Serious BUG: Phantom WUs NOT on user client machines but on result pages

Anonymous
Topic 12782

> Could that be related to the fact that in reality we have 2 (two) servers, one
> in the states and one in Germany? Maybe one server says: "Ok, take this
> xyz-wu" then after some time you do a manual update and reach (by accident)
> the other server and it says: "Hey, you never got that xyz-wu from me, so
> forget it!".

I don't believe that because there ar not two 'scheduler' servers, only two 'data' server.

This means it may be possible to download the WUs from several servers, but only one server decide which WU is yours.

I didn't know if we can upload to several servers, but didn't believe in this because this would need a synchronizing over the web. And if this was finished, I am sure we had heard about it.

Doris and Jens
Doris and Jens
Joined: 30 Oct 04
Posts: 34
Credit: 366,238
RAC: 597

Serious BUG: Phantom WUs NOT on user client machines but on resu

> I have two on my machine and both were sent on Feb 26, the same day as the
> server crash. I suspect a lot of us these phantom WUs from the same date as
> well. My big concern is just that they won't get completed if it looks like
> they've been sent out already.

If there are not enough valid results returned the scheduler will send out new until a canonical result for the WU is found.

Greetings from Bremen/Germany

Jens Seidler (TheBigJens)

[url=http://www.boinc.de/][/url]

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

I'm going to add something

I'm going to add something more to the FAQ about this.

The basic problem is that some hosts are not getting the scheduler replies. When this happens, two things go wrong:
[A] The work sent to the host in the scheduler reply is lost.
[B] If the host was registering for the first time, it never gets its hostid back, and
then registers AGAIN (and sometimes AGAIN and AGAIN!). These hosts also
have problem [A] above!

There appear to be at least two separate bugs. One is 'understood and solved' and the other is not.

[1] The 4.19 (and some earlier) clients did not handle some http proxy servers correctly. So hosts behind a proxy or networked with some variants of Windows networking options did not get (or get all) scheduler replies correctly.

[2] Even when not using an http proxy server, some scheduler replies do not make it back to the host. I haven't been able to isolate (yet) when/how this happens. I'd be grateful for assistance!

Topic [B] above is *already* addressed in the FAQ.

If you have a particular host machine which continues to exhibit problem [A] (work sent to it is repeatedly lost) and it is NOT behind a proxy server, please write something about the host in this thread so that we can try to understand what is wrong.

I've modified the server to try and detect problem [B] above and to send an error message to the BOINC client. This may reduce the impact of the bug, but is only a workaround, not a solution.

Cheers,
Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

> Should we let you know

> Should we let you know about lost work even if it isn't a case of it being
> continually lost or will those WUs be taken casr of some other way?

If the work is not being continually lost, but was only lost until the host got registered, then please ignore it. The lost work will be resent as soon as it times out (after one week).

On the other hand, if you have a host which is not behind a firewall and which is continuously losing work, I'd like to know. This should be very useful for debugging BOINC.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

> What I should make for

> What I should make for deleting "Fantom WU" from my Results list and delete me
> from list of participants which processing this unit?

You can't remove these from your result list. But eventually they will time out (after a week) and get issued to some other user. When three successful results have been returned and validated, the whole WU and all the phantom results will get purged.

Bruce

Bernd Machenschalk
Bernd Machenschalk
Joined: 15 Oct 04
Posts: 2,684
Credit: 25,950,161
RAC: 34,820

This is definetly not

This is definetly not related.

BM

BM

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

> ^TOP^ > > Bruce, > > The

> ^TOP^
>
> Bruce,
>
> The issue is still repeating itself for the hosts listed below. None are not
> behind a firewall and they're networked at the same location. What networking
> details do you need?
>
>
> 7672 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.40GHz Microsoft Windows XP
> Professional Edition, Service Pack 2, (05.01.2600.00)
>
> 1397642 395805 25 Feb 2005 11:23:06 UTC 4 Mar 2005 11:23:06 UTC In Progress
> Unknown New --- --- ---
> 1393899 394912 25 Feb 2005 11:23:00 UTC 4 Mar 2005 11:23:00 UTC In Progress
> Unknown New --- --- ---
> 1203915 351169 18 Feb 2005 7:28:45 UTC 25 Feb 2005 7:28:45 UTC Over No reply
> New 0.00 --- ---
>
>
> 7660 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.80GHz Microsoft Windows XP Home
> Edition, Service Pack 2, (05.01.2600.00)
>
> 1440862 405896 27 Feb 2005 14:19:17 UTC 6 Mar 2005 14:19:17 UTC In Progress
> Unknown New --- --- ---
> 1440850 405893 27 Feb 2005 14:19:17 UTC 6 Mar 2005 14:19:17 UTC In Progress
> Unknown New --- --- ---
> 1178852 347092 15 Feb 2005 21:57:28 UTC 22 Feb 2005 21:57:28 UTC Over No reply
> New 0.00 --- ---
> 1177546 346808 16 Feb 2005 4:09:06 UTC 23 Feb 2005 4:09:06 UTC Over No reply
> New 0.00 --- ---
>
>
> 7668 AuthenticAMD AMD Athlon(tm) Processor Microsoft Windows XP Professional
> Edition, Service Pack 2, (05.01.2600.00)
>
> 1254561 362688 20 Feb 2005 21:52:55 UTC 27 Feb 2005 21:52:55 UTC In Progress
> Unknown New --- --- ---
>
>
> This is not a complete list as some WU's have already been deleted.

Michael, this can happen if your host machine makes contact with the server, but then the connection is broken after the scheduler has taken work out of the database but hasn't sent it to you yet. At that point it't can't easily put the work 'back in the database' so it's lost. If the problem continues to replicate itself on these machines, I'll get in touch with you next week to follow up.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

> Bruce, > > I've seen this

> Bruce,
>
> I've seen this problem where WU's are sent from the server but not seen by one
> of my computers twice recently, on two different computers. Both computers are
> running Win2K SP4. So far I have not seen this problem with Linux or XP, but I
> have a very small sample size... I do have a firewall, but no proxy.
>
> Possibly relevant is that Time Warner's road runner service has been having
> some connectivity issues in my area (Durham, North Carolina) recently.
>
> However, it seems to me that the server and the boinc client should have some
> sort of handshake agreeing on the receipt of the WU. That does not appear to
> be the case.
>
> If a packet trace from my side would be useful please let me know and I'll
> setup Ethereal to try to capture an instance.
>
> Here's the most recent WU that shows this issue for computer 47378:
>
> ResultID WorkID Sent
> 1632725 442582 4 Mar 2005 13:23:17 UTC
>
> Here's the relevant log entries from the computer, showing the failed request
> and the followup that succeeded:
>
> --- - 2005-03-04 08:26:05 - May run out of work in 0.01 days; requesting
> more
> Einstein@Home - 2005-03-04 08:26:05 - Requesting 1725 seconds of work
> Einstein@Home - 2005-03-04 08:26:05 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> Einstein@Home - 2005-03-04 08:26:08 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi failed
> Einstein@Home - 2005-03-04 08:26:08 - No schedulers responded
> Einstein@Home - 2005-03-04 08:26:08 - Deferring communication with project
> for 1 minutes and 0 seconds
> --- - 2005-03-04 08:27:09 - May run out of work in 0.01 days; requesting
> more
> Einstein@Home - 2005-03-04 08:27:09 - Requesting 1808 seconds of work
> Einstein@Home - 2005-03-04 08:27:09 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> Einstein@Home - 2005-03-04 08:27:19 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded

Chris, here are the relevant bits of the scheduler logs:
2005-03-04 13:23:17 [normal ] OS version Microsoft Windows 2000 Professional Edition, Service Pack 4, (05.00.2195.00)
2005-03-04 13:23:17 [normal ] Request [HOST#47378] Database [HOST#47378] Request [RPC#51] Database [RPC#50]
2005-03-04 13:23:17 [normal ] Processing request from [USER#12153] [HOST#47378] [IP XX.XX.215.43] [RPC#51] core client version 4.19
2005-03-04 13:23:17 [normal ] [HOST#47378] got request for 1725.305507 seconds of work; available disk 50.000000 GB
2005-03-04 13:23:17 [debug ] [HOST#47378]: has file H1_0976.4
2005-03-04 13:23:17 [debug ] in_send_results_for_file(H1_0976.4, 0) prev_result.id=1632345
2005-03-04 13:23:17 [debug ] est cpu dur 34061.844227; running_frac 0.998496; rsf 1.000000; est 34113.150406
2005-03-04 13:23:17 [debug ] Sorted list of URLs follows [host timezone: UTC-18000]
2005-03-04 13:23:17 [debug ] zone=-21600 url=http://einstein.phys.uwm.edu
2005-03-04 13:23:17 [debug ] zone=+3600 url=http://einstein.aei.mpg.de
2005-03-04 13:23:17 [debug ] [HOST#47378] Sending app_version einstein windows_intelx86 479
2005-03-04 13:23:17 [debug ] [HOST#47378] Already has file H1_0976.4
2005-03-04 13:23:17 [debug ] [HOST#47378] reducing disk needed for WU by 14736000 bytes (length of H1_0976.4)
2005-03-04 13:23:17 [debug ] est cpu dur 34061.844227; running_frac 0.998496; rsf 1.000000; est 34113.150406
2005-03-04 13:23:17 [normal ] [HOST#47378] Sending [RESULT#1632725 H1_0976.4__0976.5_0.1_T24_Test02_4] (fills 34113.15 seconds)
2005-03-04 13:23:17 [normal ] [HOST#47378] Sent 1 results

Then:

2005-03-04 13:24:21 [normal ] OS version Microsoft Windows 2000 Professional Edition, Service Pack 4, (05.00.2195.00)
2005-03-04 13:24:21 [normal ] Request [HOST#47378] Database [HOST#47378] Request [RPC#52] Database [RPC#51]
2005-03-04 13:24:21 [normal ] Processing request from [USER#12153] [HOST#47378] [IP XX.XX.215.43] [RPC#52] core client version 4.19
2005-03-04 13:24:21 [normal ] [HOST#47378] got request for 1807.746221 seconds of work; available disk 50.000000 GB
2005-03-04 13:24:21 [debug ] [HOST#47378]: has file H1_0976.4
2005-03-04 13:24:21 [debug ] in_send_results_for_file(H1_0976.4, 0) prev_result.id=1632725
2005-03-04 13:24:21 [debug ] touched ../locality_scheduling/need_work/H1_0976.4: need work for file H1_0976.4
2005-03-04 13:24:21 [debug ] make_more_work_for_file(H1_0976.4, 0)=0
2005-03-04 13:24:27 [debug ] in_send_results_for_file(H1_0976.4, 1) prev_result.id=1632725
2005-03-04 13:24:27 [debug ] est cpu dur 34061.844227; running_frac 0.998496; rsf 1.000000; est 34113.150406
2005-03-04 13:24:27 [debug ] Sorted list of URLs follows [host timezone: UTC-18000]
2005-03-04 13:24:27 [debug ] zone=-21600 url=http://einstein.phys.uwm.edu
2005-03-04 13:24:27 [debug ] zone=+3600 url=http://einstein.aei.mpg.de
2005-03-04 13:24:27 [debug ] [HOST#47378] Sending app_version einstein windows_intelx86 479
2005-03-04 13:24:27 [debug ] [HOST#47378] Already has file H1_0976.4
2005-03-04 13:24:27 [debug ] [HOST#47378] reducing disk needed for WU by 14736000 bytes (length of H1_0976.4)
2005-03-04 13:24:27 [debug ] est cpu dur 34061.844227; running_frac 0.998496; rsf 1.000000; est 34113.150406
2005-03-04 13:24:27 [normal ] [HOST#47378] Sending [RESULT#1635259 H1_0976.4__0976.9_0.1_T26_Test02_0] (fills 34113.15 seconds)
2005-03-04 13:24:28 [normal ] [HOST#47378] Sent 1 results

Is this machine part of a Windows network, or some other network? Could you describe your network infrastructure, please?

By the way, it would help in doing this if you could set your computer's clock to the right time. The best way is to set it up to automatically sync with an NTP (Network Time Protocol) server. This would make it easier to compare the logs.

And no, BOINC does not have a handshake which verifies that WU are not lost when they are sent. I've thought about trying to add one into the scheduler but it's too complex to do in the near future.

Cheers,
Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

> OK--so I've setup Ethereal

> OK--so I've setup Ethereal to watch the packets go by on both clients where
> I've seen this issue. I haven't seen the issue since getting it setup as
> neither one has completed any work.
>
> However, I have had failures when running a manual update of the project--and
> I am speculating that the reason that I am getting failures then is the same
> as why my computers are occasionally missing WU's. The messages ("Scheduler
> RPC...failed", "No schedulers responded") appear to be the same in both
> cases.
>
> Here's the log (time should be right now!) from computer 15933:
>
> Einstein@Home - 2005-03-04 21:19:49 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> Einstein@Home - 2005-03-04 21:20:04 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi failed
> Einstein@Home - 2005-03-04 21:20:04 - No schedulers responded
> Einstein@Home - 2005-03-04 21:20:04 - Deferring communication with project for
> 1 minutes and 0 seconds
> Einstein@Home - 2005-03-04 21:20:22 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> Einstein@Home - 2005-03-04 21:20:25 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
>
> Looking at the packet traces I see one difference. The client sends a HTTP
> POST message to the server and the server responds with a HTTP reply
> containing the . When things succeeds this packet is 977 bytes long. When
> things fail the exact same packet is sent, except it is 998 bytes long and the
> HTTP header includes this additional field:
>
> Content-Length: 775
>
> This is exactly 21 bytes long (exactly accounts for the difference in length
> of the two packets) and is the only difference between them. After an exchange
> of FIN/ACKs, the exchange halts in the case of the failing connection. When it
> succeeds the client continues with a HTTP GET and things continue from there.
>
> Now I have no idea why this Content-Length field is really the issue or why
> the client should reject the connection after receiving this packet, but I
> have lots of examples of updates that succeed and none of them include this
> field while the two failures I've captured both do.
>
> I also don't know why your server should sometimes send this field and other
> times not. A guess: Is the server a virtual device where one of the real
> servers is behaving slightly differently?
>
> Hope you have a clue. I will attempt to capture a WU failure and see if I see
> the same thing.

Thanks for looking carefully at this. I have no idea why the scheduler/apache is putting Content-Length onto some replies and not onto others. Note that proxies *often* do this sort of thing, but you're not using a proxy or packet filtering software that might do this.

I just got off the phone with Rom Walton and he confirmed that the 4.19 client has a bug in which if the Content-Length is specified (correctly) it will sometimes misbehave because a variable that stores this internally is not properly initialized/reset.

Advice for now is to try the 4.25 core client.

No where in the scheduler code is Content-Length written, so if you are getting this it must either be from apache or your network is somehow adding it in. I'll look at the apache config file and see if there's some way to disable adding content-length into the reply.

Cheers,
Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

> OK, I've verified tha the

> OK, I've verified tha the difference in the two requests/responses for more
> work is just the content-length field.
>
> If you are convinced that your side is not inserting this field then I could
> trace the packets before they hit my firewall. Takes a bit more work and
> finding an Ethernet hub--I blew up my power supply for mine in Australia
> (thought it was dual power...whoops!) a bit more work. Let me know.

This would be really nice. I've grepped the scheduler code pretty thoroughly and it's NOT putting Content-Length in the reply. So it must be coming from apache, or elsewhere. If you could verify that sometime's it's there, and sometimes not, when coming into your firewall, this would be helpful in pinning down where it's coming from.

> How stable is the 4.25 client with Einstein? I've seen references to bugs with
> Einstein and earlier beta clients.

I don't know. I'm running 4.24 on an XP test machine without obvious problems, but I simply don't know if 4.25 is reliable or not. In the worst case you can always put 4.19 back!

Cheers,
Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

Chris, Nice work! I've had

Chris,

Nice work!

I've had a long discussion with David Anderson about this. First, the problems provoked by having 'Content-Length' in the reply have been fixed in the core client. But this still leaves the question of 'where to these come from'?

They are not coming from Apache at my end: I've used tcpdump to look at some of the packets. David says that he thinks that your ISP is using some 'hidden proxies' to do filtering before the packets ever get to your network port. He says that depending upon the routing that they use, sometimes they are using one set of filtering and sometimes another.

Personally I am sceptical of this explanation, but can't see any alternative. What do you think?

Cheers,
Bruce

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.