[S5R3/R4] How to check Performance when Testing new Apps

Anonymous
Topic 13645

I just found this in another thread and think it might be suitable to be answered here:

Quote:
Quote:
Alternatively, we could just go ask Bernd instead of expending all that energy. I'm rather curious as to why he hasn't put us out of our misery earlier :). Of course that would completely take away the thrill of the chase :).

Quote:
I bet Bernd is reading these threads, and just chortling quietly at us as we bumble around re-inventing his algorithm......

There's probably an E@H Head Office Sweepstake riding on this, and it might have attracted serious money! Keep up the top work guys .... lookin' good. :-)

Cheers, Mike.


No, there isn't.

Actually I wasn't aware of how many time and effort people here had spent on it. I read about a "Ready Reckoner" in some beta test thread and was satisfied that apparently people found a tool that would make it easier for them to judge Apps. Apart from that I was perfectly busy with the threads I'm subscribed to (which usually are the Beta App threads), and I'm probably the only one in the "office" that has a noticeable time to spend on reading the forums. Sure I wanted to read through all the related threads when I had a bit more time - it's just that I hadn't. I still have read only a small fraction of the posts.

Wow, what excitement and geniality I missed!

To add my two cents here's some very, very technical background (which you probably already largely figured out), and a proposal at the end:

The "Hierarchical Search" we are using since S5R2 consists of two steps that are performed alternating for each "template": A "F-Statistic" (or short "Fstat") step that is similar (but not equal to) an DFT, and a "Hough-transform based coincidence analysis". Previous runs consisted only of the Fstat step, and a coincidence analysis was part of the post-processing. In the Hierarchical Search we analyze data from two detectors with a Fstat method, then basically look for similarities in the results, as a gravitational wave should show up in the signals of both instruments, while local noise and disturbances shouldn't. This "coincidence analysis" is currently done in the Hough part.

When you point a profiler such as Intel's VTune or Apple's Shark to current Apps, you should see the pattern of alternating calls to the functions ComputeFStatFreqBand and ComputeFstatHoughMap (and subfunctions; in recent Apps replaced by optimized functions with the prefix "Local") that implement these two steps. Currently about 2/3 of a task's run-time is spent in the Fstat analysis, and 1/3 in the Hough part. The whole machinery around these two functions doesn't matter much for the computation time.

The time needed for the Fstat part is constant for every template, i.e. independent of the actual sky position and frequency. This isn't true for the Hough step. The size of a fundamental data structure in the Hough part depends on the declination of the current sky position (actually the (abs() or sqr() - I don't remember from the top of my head) sine of the angle between the sky position and the ecliptic plane), and so does the time of every computation that involves it (This is very much simplified, but if you neglect everything that averages out over a task even in S5R3, it boils down to this).

This effect averages out in an "all-sky" task, which we had in all previous analysis runs up to and including S5R2. This is the reason why we didn't see this run-time variation in S5R2 although we already used the Hierarchical Search there, and also why the credits are right on average in S5R3.

In order to get reasonable and equal run-times for each task (though the amount of data to be analyzed is constantly growing from run to run) we found it necessary to split up not only the frequency range, but also the sky positions between the tasks in S5R3. This splitting is actually done in the application:

Having had quite some trouble in early runs with calculating the grid of sky positions to look at in the Application (from numerical differences between architectures and compilers) we are distributing files with lists of pre-calculated sky positions, the "sky-grid" files. The granularity of this grid depends on the frequency; there is a new sky-grid file for every 10Hz. The sky positions in such a file start at one pole, followed by all sky positions next to it in order of right ascension, then followed by the next-nearest circle of sky positions and so forth, until the other pole is reached.

If you look at the command-line of the App (e.g. in client_state.xml) you'll see that it gets passed options named "numSkyPartitions" and "partitionIndex", originally set by the workunit generator for this particular workunit. The App splits the skygrid file in "numSkyPartitions" portions and calculates the sky positions of the portion number "partitionIndex". (In hindsight it looks more clever to me to take every "numSkyPartitions"th skypoint of the file starting with the "partitionIndex"th, which effectively would still cover the whole sky with every task, looking more interesting on the graphics and averaging run times.)

So the current run-time is the sum of a part that is determined only by the number of templates in each task (and can be derived from the currently assigned credit), and a part that varies with the (average) declination of the sky positions in the sky partition assigned to the task.

The trouble is that the ratio between these two is a little bit different of course for every App (version), but also for every machine, and finally also depends on the frequency (the higher the frequency, the larger the Fstat part). The Fstat part is FLOPs bound, it doesn't read or write much data, and depends on the speed of the FPU (or whatever SIMD unit the current App is using). The Hough part is largely limited by memory bandwidth, which is somewhat orthogonal to FP speed. The first App optimizations to the "Kernel loop" and "sin/cos approximation" addressed the Fstat part, while the recent "prefetching" affects the memory access of the Hough part.

Honestly I find it very hard to "tune" the credits of a workunit (i.e. predict the run-time for it) in a way that would do justice to everyone and every machine, especially if that should also serve as a reference for comparing speed of Apps.

For the latter I think it would be best to construct a "reference workunit" that covers the whole sky and thus averages out these variation, e.g. by taking a current workunit and picking some points from the skygrid file in the way described above. This would, however, mean that people interested in speed comparison would need to run it on their machine with every new App they get. Remember that you'd do this without getting BOINC credit for it, but just for comparison. How much (run-) time would you be willing to sacrifice for this?

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

[S5R3/R4] How to check Performance when Testing new Apps

Quote:

I guess one of the remaining mysteries is the connection between the

- frequency displayed in the task name (f_t),
- the frequency in the comamnd line (f_c)
- and the frequency of the sky grid file (f_g).

As far as I can see ,

f_g = ceil(f_c/10.0)*10.0

but the connection between f_t and f_c is less obvious. It's close to

f_c ~ f_t + 0.15

but not exactly that.


I haven't got the code in front of me, so that's also from the top of my head (and actually Reinhard wrote the Workunit generator, not me):

- f_c is the actual base frequency where the search for this task starts.
- f_g is the sky-grid file for this frequency, your formula is correct.
- f_t is actually part of another name: all task names begin with the name of the "lowest" data file that is required for this task.

We need some "wings" below the base frequency and above the top frequency of the task, the latter being f_c + bw (bandwidth, I think the command-line option is --FreqBand). The size of the necessary "wings" also increases with frequency, you'll need more data files with higher frequency even if the bandwidth stays the same; so also f_t - f_c will increase with f_c.

The first column in the result file should always be between f_c and f_c+bw. The length of the output file is limited, so we keep only the n "most interesting" results (IIRC n=15000). Of course they are not necessarily equally spaced in frequency.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

Ok, here's the first try of a

Ok, here's the first try of a reference workunit to measure the performanece of Apps: refwu.zip.

It is a current h1_1005.00_S5R3 workunit with a modified skygrid file that covers the whole sky in 31 points.

It is meant to work on MacOS and Linux (for now), hopefully someone better than me in scripting on Windows could write a batch file similar along the lines of the shell script "run.sh", I'll be happy to include it in the archive.

The script takes the application executable as its first (and only) argument. It requires either curl or wget to be present on the system and downloads additional data files from the Einstein@home server (if not already present) before running the App.

Depending on the App version and the system the run-time should be about half an hour.

Have a try and post your comments here.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: RE: The script takes

Quote:
Quote:
The script takes the application executable as its first (and only) argument. It requires either curl or wget to be present on the system and downloads additional data files from the Einstein@home server (if not already present) before running the App.

I unzipped it in the ~BOINC/projects/ein* directory and made it run. But it made an error message "cannot fnd binary to execute".


You probably need to tell it which App you want to test.

Try something like

./run.sh ./einstein_S5R3_4.49_i686-pc-linux-gnu_1

BTW: with "switching" Apps calling the switcher doesn't work right away, you'll need to call the actual App.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: RE: Try something

Quote:
Quote:


Try something like

./run.sh ./einstein_S5R3_4.49_i686-pc-linux-gnu_1

BTW: with "switching" Apps calling the switcher doesn't work right away, you'll need to call the actual App.

BM


This is the result with *linux-gnu_1
real 11m42.814s
user 11m34.615s
sys 0m3,040s
Tullio


Looks like a rather fast machine. Can you post the details?

Actually the time only makes sense to compare Apps e.g. on the same machine. Could you run this with the generic App (*linux-gnu_0), and possibly with the SSE2 App?

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

Note that you better don't

Note that you better don't unzip the archive and run the script in the original BOINC/projects/einstein.phys.uwm.edu directory. In case you do have a workunit of the same frequency range you might mess up your skygrid file and thus your results (unzip should ask for confirmation before overwriting, but BOINC doesn't). At least better delete the skygrid file after running the test.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

For the curious: The

For the curious: The reference workunit should still work with S5R4 Apps.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.