Baby Steps Into Optimization

Hi everyone, in keeping the momentum from the last post, last week I was part of a group that was tasked in diagnosing potential bottle necks in three programs, running on different computers. Each computer had a different configuration both in processor and memory, and as a result had varying performance times when executing the set of programs. Now what sort of programs were we running you ask? Well they were three different algorithmic solutions for adjusting an audio source’s volume, through multiplication. One approach used floating point conversion (program vol 1), another use a lookup table (program vol2), and the last used a fixed point conversion (program vol3) to adjust the volume of each sound sample it was processing.

So what was the initial predictions at first? Well for me I believed that a lookup table would have been helpful for computers with low processing power, and that computers with more memory would see speed gains as well. That said, unless the lookup table had values for every possible level at 2 degrees of precision minimum, the volume adjustment would likely be not as accurate or smooth. In other words the approach in vol2 would be the least processor intensive, but less accurate. Regarding vol1, my only thought was that processing time would be greater because it was trying to represent decimal numbers (float), with whole numbers (integer) through a conversion. In other word, more accurate, but also more processing intensive. Now in the case of vol3’s fixed point method, I assumed it would be faster than vol1 as there was no conversion, but I had no idea on what the accuracy might be like. That said, I did think it would be slower but more accurate that vol2.

With early predictions out of the way, our group had access to servers hosting five different computer configurations to run our test; 4 AArch64 configurations, and one x86_64 configuration. On each computer we did two pairs of tests: the first pair ran the programs with their algorithms to process the data removed. This gave us a baseline to see what other kinds of stress these programs were putting on computers. The second pair of tests included the algorithms, so that we could see what the full impact of each program was. Afterwards, we could then determine how much of a load the actual processing caused on the computers by subtracting the two test results.

So what was the results? Well for the sake of this post lets observe results from 2 AArch64 configurations, as well as the x86_64. Each of these computers had names assigned to them so I will use these names to refer to each one:

Archie –> The wimpy kid with lot’s of heart
AArch64
4GB memory
1GHz processor with 24 cores
two 32KB L1 cache
256KB L2 cache
4MB L3 cache
GeForce GT 710 graphics card
two hard drives: 1TB, 512GB

Israel –> Fully loaded
AArch64
30GB memory
2GHz processor with 16 cores
32KB L1d cache
48KB L1i cache
1MB L2 cache

Xerxes –> Seemingly unending army of resources
x86_64
32GB memory
4GHz processor with 8 cores but 2 threads per core (virtual cores)*
two 32KB L1 cache
256KB L2 cache
1MB L3 cache

*Makes the processor issue two instructions to each core, effectively doing twice the work, and giving performance close to a process with double the number of cores. This comes with some drawbacks, but this is beyond the scope of this post.

With everyone introduced, I will begin with Archie and continue with the other computers in a follow up post:

Algorithm

Test1

Program	Real	User	Sys
vol1	58.623	56.846	1.614
vol2	64.169	62.095	1.936
vol3	55.731	53.407	2.203

Test2

Program	Real	User	Sys
vol1	58.901	56.958	1.825
vol2	64.345	62.218	1.975
vol3	55.702	53.187	2.405

No Algorithm

Test1

Program	Real	User	Sys
vol1	51.613	49.299	2.214
vol2	51.622	49.17	2.354
vol3	51.388	49.243	2.045

Test2

Program	Real	User	Sys
vol1	51.499	49.347	2.047
vol2	51.731	49.254	2.363
vol3	51.436	49.244	2.096

Dang it Nigel whats with all these weird terms “Real, User, Sys!?”

Haha sorry about that! “Real” refers to the total time the program takes to run. “User” refers to the time taken in user mode, or in other words, when the program is no longer directly accessing the computer hardware, but running through software like an operating system (Windows, OS X, Linux etc). “Sys” is the time the program takes when directly accessing the hardware via kernel mode.

Ok back to the results, at a glance when not using the algorithm, results are reasonably the same with overall execution time being around 51.5 seconds, however take a look at vol2. While execution times were expected to vary as other groups were on the servers as well, vol2’s Sys time seemed somewhat longer than the other two approaches even when no real work was being done. If I recall, we omitted all code related to processing of any data, so really the program should only be loading some data types and the sound sample. This is peculiar but if I had to guess, it may literally have come down to the fact that vol2 had more lines of code:

vol1 = 47
vo2 = 52
vol3 = 46

That in mind, tighter code may result in slightly faster execution of the program. Well what about the numbers for including the algorithm? Looking them over we see some very interesting results. For one, the Sys time is lower for all except vol3, however lets first look at the Real time. Despite taking more time in the kernel, vol3 processes the data much faster than all the other programs. This is even much faster than vol2, which to my surprise, is definitely the slowest method. This of course, completely destroys my initial predictions of the table method being better suited to low powered processors.

Now if looking only at the User phase, much of the processing load is being done here with exception to vol3. For this program to execute it’s math calculations, it uses a low level programming technique called bit shifting, which allows for operations close to the hardware. Therefore, it makes sense that vol3 would take more time in the kernel, as it’s doing it’s bulk of the math there. It should be noted however, that vol1 is no slouch as a solution either. Despite mostly executing in software, it is still fairly quick, and because it runs in the User phase, the data can be salvaged if something fails.

That said with these results, it is clear that optimizing code to take advantage of operations close to hardware, will net considerable performance gains if speed is your goal. If you value safety in the sense of not losing your data, designing code to operate in the user phase effectively is not a bad idea either.

Well these were my initial assumptions, and ultimate discoveries based on the data my group gathered. The next post will talk about the other architectures mentioned, as well as my project. See you then!

Share this:

Related

Leave a comment Cancel reply