new units not downloading

Anonymous
Topic 13012

This may be at least partly a screw-up on my side.

The "new" S4 data files are named l1_XXXX.X and h1_XXXX.X, in contrast to the "old" files which are named L1_XXXX.X and H1_XXXX.X.

Unfortunately I had not realized that on Win32, file names are case-insensitive.
So there may be some issues in the next few days if workunits which are supposed to use the file H1_0400.0 (which has a particular size and checksum) try to instead use the file h1_0400.0 (which has a DIFFERENT size and checksum).

Meanwhile, I'll see what I can do on the server side to ameliorate this issue.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

new units not downloading

After some discussions with David Anderson, I've taken the simple way out. I've cancelled the workunits with names that start "h1_" (NOTE: this is case sensitive, work starting "H1_" is NOT cancelled).

I've also removed the problematic h1_XXXX.X data files from the download servers. After these changes propagate to the data server mirrors (15 to 30 minutes) this should generate hard download errors for any client that attempts these WU.

I'll rename the workunits and files using "w1" (w for Washington state, where the Hanford detector is located) and reissue them.

Apologies to everyone for this fiasco. It's my fault. Hopefully we can recover quickly.

Please feel free to manually abort any h1_ workunits. My apologies for wasted CPU cycles. Fortunately these workunits have only been out there for a half-day so this shouldn't be too severe.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: Any chance to reset the

Quote:
Any chance to reset the "daily quota" things too for today?

Good idea. I should be able to reset the daily quota for any host that has had WU cancelled. I'll work on this now.

[Update 10 minutes later]

DONE!

I've reset the daily result quota for any host that received an h1 workunit.
By the way, I don't think I ever said 'thank you' to those people who pointed out that something was wrong.

THANK YOU VERY MUCH!!

Could anyone suggest a simple and reliable way to abort h1_ workunits from any host, including those running old clients? Since the input data file is no longer on the download servers, I would have thought a simple and guaranteed solution was (1) stop BOINC (2) delete all files named h1_* (LOWER CASE!) and (3) restart BOINC. Can anyone confirm that this works? Is there an easier way?

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: I've had a large number

Quote:
I've had a large number of boxes affected by this. I've only just noticed it a few minutes ago on one box. I stopped BOINC, deleted the h1 file (lower case h), restarted BOINC, forced an update and got a new file (l1 this time - lower case ell) and everything seems sweet again.

I'm glad this works. I think that this is probably the easiest procedure for most users.

Quote:
I've started looking at other boxes that I can't physically get to immediately and have found quite a number (probably about 10 so far) that have errored out work for no apparent reason today. Interestingly a number of these show signs of autorecovering in that fresh work is appearing in the list of results.

The basic problem is that some hosts may have WU that refer to different files, named (for example) H1_0050.0 and h1_0050.0. These have different lengths and different checksums. But Windows treats these files as the same and will replace one with the other. Hence a WU may error out because the checksum stated in the workunit does not agree with the calculated checksum from the file. If this happens, then all is well because the WU will exit immediately with no wasted CPU time.

Quote:
I'm not at all angry about this - c'est la vie, as they say. All I'd like to know is whether all affected boxes will autorecover now that the 8 per day has been reset, or will I physically have to go to each box and delete the offending h1 file?

I'm glad you're not mad, though I imagine that others will be! In a few hours I will again re-run the script that resets the daily result quota for machines that got h1_ workunits. This should help the machines to get more work right away.

If you don't delete the offending h1 file, I am not sure what will happen. In some cases, if there is no conflict with an H1 file name, the WU may well complete. Then the main issue is wasted CPU cycles, since I cancelled these WU on the server side.

Cheers,
Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: @ Bruce Allen, Why you

Quote:

@ Bruce Allen,

Why you didn't change yet the application from 4.79 to 0.03 (Windows)?

-only a Q.-

greetz littleBouncer

We should probably have this discussion in the other thread. But the short answer is that the new executable seems to be slower in most cases. We need to understand and fix that problem before distributing it widely.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: OK, thanks very much

Quote:

OK, thanks very much for the reply. Let me get this straight. If I'm seeing repeated attempts to get a file and repeated checksum errors it's due to a clash between a H1_xxxx and a h1_xxxx and this results in rapidly errored out work.

However, if I see any box with work in its results list starting h1_nnnn then whilst it appears at the moment to be proceeding normally, I'm going to get a rude awakening when that work is finished and attempted to be reported so I'm going to be wasting cycles big time unless I go and delete all h1_ work on all boxes that have it.

Does that about sum it up in layman's terms? :).

Yes!

I have CANCELLED all h1_ workunits. That means that any CPU time spent on them is entirely wasted. No credits, no glory, no purpose.

Shoot those workunits before they tire out your CPUs.

(And once again, sincere apologies for this fiasco.)

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: The interesting thing

Quote:
The interesting thing was that on at least three occasions BOINC claimed to be able to reget at least part of the hi_nnnn large file. I thought they were all supposedly deleted? Maybe BOINC was kidding itself :).

E@H uses five different data servers. Four are mirrored off the root server at UWM. I deleted the files from that root server about 8 hours ago, and the secondary servers are supposed to mirror that change after no more than 15 minutes. However if one or more of them failed to mirror the changes, then it will continue to serve out the files and might cause the behavior that you saw.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: What about the "l1_xxx"

Quote:
What about the "l1_xxx" WU's? I understand the point that lowercase h WU's are troublesome right now and should be aborted. What about lowercase l WU's?

Lowercase l workunits l1_XXXX.X__... are FINE! This is because we don't have any data sets labeled 'L1_XXXX.[05] for them to get confused with.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: The interesting

Quote:
The interesting question is what is going to be the reaction of the silent majority out there who don't regularly follow the lists and are going to be mightily confused by these strange happenings. Is it possible to send a small email to all registered users to warn them to check if they have h1_nnnn style data files and if so check the web page for details? I can just imagine the complaints if someone has a couple of days of h1 work and they don't immediately notice that there is no credit. I watched one of mine do that and that spurred me into action :).

I have thought about doing this. There are about 6000 host machines that got these workunits, and about 5000 users. But it would take me some hours to cobble together and test scripts for mailing the users, and I would rather spend the time making sure (testing!) the new w1_XXXX workunits to make sure they are OK.

[Edit added 30 min later]
I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: RE: I found a script

Quote:
Quote:


I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.

that is a nice touch, Bruce.

Fortunately it does not affect me, but I'm pleased to see the swift way the problem has been dealt with.

Thank you very much.

Real science is VERY error prone. In fact one of the distinguishing characteristics of real research is that (especially the first and second time) one gets it wrong more often than not. The only saving grace in all of this is that with other scientists you get 99.9% forgiveness for being brutally honest about what happened and why. That's the one thing that I can promise Einstein@Home participants that they will get 100% of the time.

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: RE: [Edit added 30

Quote:
Quote:


[Edit added 30 min later]
I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.

Bruce

I'm very pleased that you have done that and it will be good for the silent majority who probably aren't even aware of the problem yet.

However, it's not my day today :). I took your advice and cancelled running work that was in many cases 80-90% complete!!! And I'm still not mad at you in the slightest :). I'd rather lose the credits than hold up the science by doing work that will only have to be repeated anyway so my cancelling the partly completed work was still the right thing to do.

Good news -- I'm giving credit for cancelled and 'download error' work as well as successful and valid results. Since these problems were my fault it seems the least I can do.

Quote:

It must have been one of those nightmare days (and nights) for you :).

I confess to being in a pretty foul mood for most of the day today!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.