GNU/Linux S5R3 "power users" App 4.27 available

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820
Topic 13642

This App incorporates SSE vector code, but doesn't have a CPU feature detection. It will badly crash on non-SSE machines.

Other calculations are done in x87 FPU math, including the linear SIN/COS approximation, which should make it run (and run fast!) on older CPUs (i.e. PIII).

This is also the first Linux App featuring graphics in a separate program - it doesn't need any resources if you don't run it.

Only run this if you're sure of what you're doing.

Find it on the Power User Apps page.

Happy crunching!

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

GNU/Linux S5R3 "power users" App 4.27 available

Quote:
But that windows stuff in the app_info.xml is funny.


Sorry, the app_info.xml was derived from the Windows one. Maybe I should delete the tag altogether. All people having trouble on 64 bit: give it a try. Remove the line manually.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: Looks like the fastest

Quote:
Looks like the fastest S5R3 app so far.


Do you mean the fastest Linux S5R3 App or the fastest S5R3 App alltogether? The Intel-Macs should be able to boot Ubuntu (at least after the firmware update installed when installing BootCamp or Leopard), so anyone would give it a try to compare it to 4.22? A live-CD should do...

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: RE: But that windows

Message 3424 in response to message 3422

Quote:
Quote:
But that windows stuff in the app_info.xml is funny.

Sorry, the app_info.xml was derived from the Windows one. Maybe I should delete the tag altogether. All people having trouble on 64 bit: give it a try. Remove the line manually.


I removed the tags in the app_info.xml for Apps 4.25-4.27.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

Well, the 4.27 hase been

Well, the 4.27 hase been stripped off all SSE2 instructions that made it fast on machines capable of SSE2. I was hoping that the faster sin/cos code would make up for this, but whether it actually does or not depends on the particular CPU. If it isn't slower than 4.21, then the bugfixes and the broader variety of CPUs it runs on should be worth it anyway.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: RE: I'm running

Quote:
Quote:

I'm running Ubuntu 8.04 64 bit with the 5.10.30 boinc-client.

The "Show Graphics" button in boinc manager doesn't produce any output.
But if I open a terminal and run the einstein_S5R3_4.27_graphics_i686-pc-linux-gnu program with a workunit from slots/0 as parameter I get a perfect graphics window.

From stderr:

No protocol specified
GLUT: Fatal Error in BOINC: could not open display: :1.0

http://einstein.phys.uwm.edu//task/90982448


Open a terminal and type "echo $DISPLAY". If you get something different than ":1" or ":1.0", terminate the Manager and the client ("killall boinc") and run "BOINC/run_manager" again. Having a DISPLAY :1 is slightly unusual if you have only one screen and are not logged in remotely.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: A couple of errors I've

Quote:

A couple of errors I've noticed with this 4.27 app. Both of them were on the same host, which is overclocked, but has been at an unchanged level of overclocking for the last 3 years (24/7 operation) with no prior history of problems. The HSF seems OK but I guess I should clean the fins again to be sure, seeing as it is the middle of summer here at the moment.

This result had a "process exited with code 99" error. Later on in the output it said, "Required frequency-bins [-7, 8] not covered by SFT-interval [1412793, 1413317]"

This result had a "process exited with code 41" - which on inspection is a signal 11. I thought they were supposed to be solved now :).


I'd consider it rather normal for a computer to run for a certain time w/o problems and then suddenly all kinds of weird things happen. Three years might look a bit short, but I'd suspect this to be within normal tolerance. Our old Merlin cluster machines ran under perfectly controlled conditions for almost precisely three years, now one or two machines die per week. In addition on average I'd suspect overclocked machines to age faster than ones that are run on specifications.

The "frequency-bins [-7, 8]" points to a problem in the FPU (a variable that lives only in an FPU register became NaN). A "signal 11" is the Linux equivalent of a "General access violation" on Windows - the number of possible reasons is infinite, and they could be in hardware as well as in software.

I'd think the machine has reached EOL, at least for number crunching.

One reason for a 'signal 11' in the App was in software (the BOINC library) and had been fixed. It occurred when the Core Client became unresponsive, e.g. when waiting for a DNS lookup. You could simulate this by sending the client a STOP signal ("killall -STOP boinc"), waiting for ~30 seconds and send a CONT signal ("killall -CONT boinc"). Old Apps that were running would crash with a "signal 11", new ones should "exit with 0 status and no 'finished' file" and should be restarted by the Client.

According to Charlie Fenton the BOINC developers are currently implementing asynchronous DNS requests, which should fix the same problem from the Client side, i.e. it will respond to the Apps even when waiting for a DNS lookup.

An exit status of 41 is a "signal 11" that was caught by the signal handler. In contrast to a "signal 11" with exit status 11 it means that there should be a stacktrace in stderr that tells us at least where the error happened. And indeed it does! I'll take a look as soon as I have time.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

RE: Fraction done and

Quote:
Fraction done and estimated time to finish are broken with EAH 4.20 + CC 4.19 Linux (Windows does not have that problem), with EAH 4.27 those features returned :


FWIW: This is not a feature of the Core Client but of the Linux kernel version (actually the pthread library that comes with older kernels that violates the Posix standard). We fixed this in BOINC to preserve the "old" (non-standard) CPU time query on Linux (calling getrusage(2) in a signal handler...) while using the standard way on MacOS that avoids the deadlocking we were observing there.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

A new SSE Linux App is out:

A new SSE Linux App is out: 4.35.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

Kathryn, Annika, thanks for

Kathryn, Annika,

thanks for your report.

I do have a possible explanation, though it's very technical. Maybe someone can translate it:

The code in the BOINC library (that gets linked into the Apps) that is actually meant to determine the CPU time was changed quite some times in the last months. The reason is that old Linux kernels (i.e. the pthread library there) behaves non-standard. so some non-standard "trick" is used to get this information anyway.

Using this trick for all (non-Windows) systems lead to deadlocks on MacOS, with the App apparently being "stuck" (no progress for hours until it is restarted). So currently the BOINC library uses the "standard" method on MacOS and the non-standard "trick" on Linux.

It might be that a similar dedlock we previously saw on MacOS could now happen with certain newer Linux kernels, too. If it's really a dedlock, it might depend on the version of the BOINC library that's in other Apps running at the same time, so this may well be limited to certain project pairs.

If this is what's happening the only way around this would be to change the method at run-time bease on the current kernel/pthread version on the system, which would take quite some programming effort I guess (in the BOINC library, though).

Is it possible to explicitly trigger this problem? The only way to find out what's wrong is to attach a debugger or profiler to the App and see what it actually does or where it is stuck (on MacOS I found it with Shark).

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.