More Detailed Profiling Results

Now that I have a list of functions to profile, I want to gather some more information to determine exactly how long each one takes to execute with a Release build of the executable, while making the smallest impact on the performance of the code while it is profiling/benchmarking the code.  To do this, I need to do the following:

  • Create a “StopWatch” class that I can use to:
    • Name a section of code for easy identification
    • Time that section of code using a high-resolution counter like QueryPerformanceCounter (available only on the Windows Platform)
    • Store some important variables
  • Find all function calls from the previous post and use the StopWatch class to time those sections of code and to note the value of certain important variables that would indicate things like:
    • the length of any input array
    • the length of the FFT being calculated
    • other similar function specific parameters
  • Provide the smallest possible impact on the performance of the application. (i.e. don’t insert a printf statement inside a for loop that will add a significant amount of delay per iteration)

So, a quick search on google resulted in the following code in a class named “CStopWatch”.  See http://cplus.about.com/od/howtodothingsi2/a/timing.htm for more information.

I modified it of course to suit my needs, adding a constructor that allows you to “name” a section of code as well as two integer parameters to help identify what exactly was occurring in the iteration.

My CStopWatch class writes everything to standard output from a separate thread of course to limit the impact it makes on the analysis, and after I did a first run, the file stderr.txt was growing too fast, so I decided to modify cheap oakleys sunglasses my logging class to log only executions that are over 750 microseconds in total execution time.  I then wrote a small awk script to parse the output to give me a Maximum and Average execution time for each function.  To sum things up, here are my results: (Note: I added functions that were not listed in the original profiler and I named sections of larger functions using the underscore character and section number)

 

Function Name Average (milliseconds) Maximum (milliseconds)
FindSpikes 184.917 3,153.166
analyze_pot_section_1 155.373 369.653
outerloop 109.149 3,328.679
analyze_pot_section_2 59.354 3,152.386
ChirpData 4.052 14.12
find_pulse 4.032 38.737
fftwf_execute_r2r 2.512 9.079
f_GetChiSq 2.181 3.688
GetFixedPoT 1.871 7.457
GaussFit 1.556 22.356
FindAutoCorrelation 1.336 1.336
f_GetPeak2 1.333 2.727
f_GetTrueMean2 1.313 2.463
fftwf_execute_dft 1.254 6.409
GetPowerSpectrum 1.118 1.711
f_GetPeak1 0.765 0.765

FindSpikes is 83 lines long and has 2 sets of nested for loops.

analyze_pot_section_1 is 44 lines long, has one for loop, and that for loop calls “GetFixedPoP”, and “GaussFit” many times

Outerloop is the big grand for loop that loops over each chirp/fft pair, so we can forget about putting that whole section of code into an FPGA…

analyze_pot_section_2 also has a large for loop that is nested inside of yet another for loop, and the inside for loop makes many calls to find_triplets and find_pulse

The rest of the functions have a much shorter average execution time and there must be a good reason for this, probably because they are much simpler.  I will go through the rest of the functions and list how many for loops, and how many lines of code each one has.

ChirpData – 56 lines of code – a single for loop that does a simple triginometric calculation on the input data.  Looks like a good candidate for FPGA acceleration.

find_pulse  253 lines of code – 253 lines of code is pretty hard to squeeze into an FPGA.
fftwf_execute_r2r – an external libfftw function – LabVIEW for FPGA products come with a lot of Fast Fourier Transform methods, I will have to find out exactly what this method does before looking to LabVIEW for an optimized version.  So this is another good candidate for FPGA acceleration.
f_GetChiSq – 52 lines of code – one for loop with a simple calculation going on inside.  Looks like another good candidate for FPGA acceleration.
GetFixedPoT – 102 lines of code – has a huge collection of if-else if-else statements… skip this for now…
The rest of the functions are less than 2 milliseconds, an execution time that I tend to sway away from when doing FPGA work because I fear the transit time off of the PCI/PCIe might be too much, however Ray Ban sale it is not a forgone conclusion, so I will still analyze them, but I will keep them in a different category.
So here are the rest of the functions:
GaussFit – 219 lines of code – this function calls many other smaller functions which also made the list of longest execution time.
FindAutoCorrelation – 86 lines of code – has two sets of nested for loops, looks like a lot of stuff is going on in each for loop, but it still appears to be a likely candidate for FPGA acceleration, although I would look at this function after I look at some other ones first.
f_GetPeak2 – 19 lines of code – has a simple for loop Wooden that calculates a weight sum.  This is definitely a candidate for FPGA acceleration.  Note: I appended the name of this call with the number 2 because f_GetPeak is called from another place as well.
g_GetTrueMean2 – 15 lines of code – one for loop that adds up all elements in an array.  This is embarrassingly easy enough to stick into an FPGA.  So, one more candidate for FPGA acceleration.
fftwf_execute_dft – must learn the libfft library for this one… Another candidate for FPGA acceleration.
GetPowerSpectrum – 13 lines of code.  A simple for loop with two multiplications and an addition in each iteration.  Another embarrassingly easy candidate for FPGA acceleration.
So overall, we have the following Cheap Ray Ban Sunglasses functions for FPGA analysis:
  1. f_GetChiSq
  2. ChirpData
  3. fftwf_execute_r2r
  4. f_GetPeak
  5. f_GetTrueMean
  6. fftwf_execute_dft
  7. GetPowerSpectrum
What is Cheap Ray Bans the best way to proceed? Well, for starters I can make use of the parameters used Wholesale Jerseys when calling those functions to learn exactly how many iterations each one goes through, and then I can look at the source data to see how much data is required by each function.  Then I can make a test function that resides in an FPGA and see what it does for us!

Leave a Reply

Your email address will not be published. Required fields are marked *