Hi everyone, considering how long my stage 1 post was, I decide to summarize it for better consumption. For links to the various terminology you can look at the original post here.
For my SPO600 project, chose the image compression program “Guetzli,” because it was command line based, C++ in structure, and took a lot of time to do conversions. I also chose this because ideally I want to optimize the code so that it can run on low powered computers, not unlike the ones found in modern smart phones.
Why smart phones? Well smart phones also serve as a fairly powerful digital camera, capable of taking pictures at very large resolutions. This large resolution is of course, to capture as much detail as possible. That said, most if not all smart phones produce these images in the form of a highly compressed JPEG image. For most casual users this will be enough, however due to the ability to share ones experiences via social media; more and more people show a demand for phones with greater camera capabilities. Now there are already cameras that capture incredibly detailed pictures, however they use file formats that hold 32 bit color information like TARGA, TIFF, and PNG. That is to say, the file format stores color information that can be manipulated accurately, to emulate how one would adjust the exposure of an image. This is possible as each color has a brightness range of 24, with the remaining 8 reserved for transparency. In total 32, and yes that’s where the “32 bit” comes into play.
While there are a few hardware differences between a smart phone camera and a digital camera (namely a lens), the major bottleneck for the former is the file format. When compared to a 32 bit PNG image, a JPEGs color information is only 8 bit. It’s a very limited range and can produce visual inaccuracies, even at their best quality. So why do we use JPEG anyway? Well that is because they hold less data, they take up less storage space. This is a good compromise for devices that have low storage capacity, however 1TB SD cards are no longer in the realm of wishful thinking so storage is moot. Also, as mentioned before consumer demand is ravenous, so it may not be outside the realm of possibility to eventually switch to 32 bit file formats in the near future.
Ok so we now have potentially millions of people creating very large, high quality images, so that they can post it on social media. How can we possibly store all that information? Well it is possible if you have money to spare, but companies like to save, and this is where image compression can help. Imagine if you will, an app of one of these social media platforms, with the capability of compressing the image down to a JPEG, but preserving nearly all the detail of the 32 bit image. This would be very convenient for both users and service providers, if it was quick and effective.
Therefore ladies and gentlemen, my challenge will be to optimize the slow, “Guetzli” program so that it can run quickly on low powered computers like the AArch_64. To accomplish this, I conducted a series of tests, on two computers with that processor, and another with the x86_64 processor. The Aarch_64 computers were call Archie and Charlie, while the x86_64 one was called Xeres. The tests data used consists of the same image scaled to typical HD resolutions, or approximations. The only exception to this was in regards to Archie, the weakest of the three computers, where I used an image that had the dimensions of 444 x 258.
With the resolutions constrained mostly to standard sizes, the processing time was less than the intended target of 4 minutes, however processing time still ranged between 3 minutes and 18 seconds, to 5 minutes and 45 seconds. Given that these were not sub second values I decided to proceed.
The Test
For each computer I use an image resolution that resulted in some cpu load relative to the computer, as well as executing the program using 4 additional optimization settings: o1,o2,o3,ofast. I also ran the program a minimum of twice per each optimization level, unless the results appeared to be an outlier. In all cases this was in situations where the second execution ended up taking longer than the first time, which should not happen if the data is the same. Finally, to mitigate any processor load due to multiple users being logged into the computer, I conducted these test from 11am – 4am.
Archie – 444 x 258
Archie’s performance using it’s stock build settings produced a general processing time of 3 min and 18.6 seconds. When using optimization level 1, the time increased by roughly 0.1 seconds; level 2 showed comparable results but the first execution under this optimization did take roughly 0.15 seconds longer that level 1, and about 0.2 seconds longer than the stock build. Level 3 yielded bizarre results with the first execution matching level 2, but the second execution was faster than even the stock build by about 0.1 seconds. The last level, fast delivered results that were faster than the stock build, but still slower than the second attempt under level 2. One other observations was that the time spent accessing the hardware increased under the fast level. Makes sense as instructions get executed extremely fast in hardware. While this was promising, the speed gains were still negligible.
Charlie – 853 x 480
**Full disclosure, the image used for Charlie was scaled down from a 4k image but it did not result in an exact 720 x 480 image.
For Charlie, the results for the stock build were around 5 min and 42.87 seconds for the first run and 5 min and 42.4 seconds for the second run. When comparing level 1 we see a time of around 5 min and 42.75 seconds. If nothing else it was faster for the first run for the stock build, and maintained a consistent execution time for all but one outlier where it took 2 seconds longer. Level 2 fared worse, adding roughly 1 second to the execution time for both runs. Level 3 was faster than level 2, but averaged around 5 min 42.99 seconds. Fast was overall faster than the other optimizations at around 5 min 42.6 seconds, but still slower than the best time for the stock configuration.
This is it for the AArch_64 series of tests. While there is slightly a bit of potential, I believe more testing and optimizations that override the compiler, may be required to get better performance on AArch_64 hardware.
Xerxes – 1280 x 720
**The 4k image scaled down neatly to the 720 resolution which makes the situation with Charlie more strange.
For Xerxes, the stock test results ranged between 3 min and 24.5 seconds, to 3 min and 19.9 seconds. However with optimization level 1, the program managed to drop around 5 second, resulting in an average of 3 min and 19.5 seconds! Level 2 continued this trend by shedding around 0.2 seconds, but oddly enough, it was the first run under this level that yielded the fastest numbers thus far: 3 min and 19.16 seconds. However interestingly enough, level 3 produced consistently slower results than the first two levels, even adding a whole second to the processing time. Perhaps some of the more unsafe settings in level 3 may have impacted the program in a negative way. The same can not be said for the last setting, as it lived up to it’s name and produced results that were indeed faster than all the other settings, with it’s top result being 3 min and 19.14 seconds. The optimizations for fast is clearly the right direction, however this level does increase the file size so that may also be a factor to consider in the future stages.
Well this was the first stage of my optimization research, I look forward to updating you all in the next project post. See you then!