Whats up with thi`s

Anonymous
Topic 13047

Jan, Walt, here's the view from the scheduler side today...

2005-08-01 09:49:00 [PID=28131] [normal ] OS version Microsoft Windows 2000 Professional Edition, Service Pack 4, (05.00.2195.00)
2005-08-01 09:49:01 [PID=28131] [debug ] Request [HOST#103595] Database [HOST#103595] Request [RPC#1412] Database [RPC#1411]
2005-08-01 09:49:01 [PID=28131] [normal ] Processing request from [USER#70690] [HOST#103595] [IP 193.3.142.123] [RPC#1412] core client version 4.45
2005-08-01 09:49:01 [PID=28131] [debug ] [HOST#103595] Resetting nresults_today
2005-08-01 09:49:01 [PID=28131] [normal ] [HOST#103595] [RESULT#6941996 w1_0662.0__0662.4_0.1_T11_S4hA_3] got result
2005-08-01 09:49:01 [PID=28131] [debug ] cpu 26717.562500 cpcs 0.002347, cc 62.712644
2005-08-01 09:49:01 [PID=28131] [debug ] [RESULT#6941996 w1_0662.0__0662.4_0.1_T11_S4hA_3]: client_state 3 exit_status -1073741819; setting outcome ERROR
2005-08-01 09:49:01 [PID=28131] [normal ] [HOST#103595] got request for 8640.000000 seconds of work; available disk 20.007354 GB
2005-08-01 09:49:01 [PID=28131] [debug ] [HOST#103595]: has file w1_0662.0
2005-08-01 09:49:01 [PID=28131] [debug ] send_old_work() no feasible result older than 168.0 hours
2005-08-01 09:49:01 [PID=28131] [debug ] in_send_results_for_file(w1_0662.0, 0) prev_result.id=6941996
2005-08-01 09:49:01 [PID=28131] [debug ] Sorted list of URLs follows [host timezone: UTC+3600]
2005-08-01 09:49:01 [PID=28131] [debug ] zone=+03600 url=http://einstein.aei.mpg.de
2005-08-01 09:49:01 [PID=28131] [debug ] zone=+00000 url=http://einstein.astro.gla.ac.uk
2005-08-01 09:49:01 [PID=28131] [debug ] zone=-18000 url=http://einstein.aset.psu.edu
2005-08-01 09:49:01 [PID=28131] [debug ] zone=-18000 url=http://morel.mit.edu
2005-08-01 09:49:01 [PID=28131] [debug ] zone=-21600 url=http://einstein.phys.uwm.edu
2005-08-01 09:49:01 [PID=28131] [debug ] [HOST#103595] Sending app_version einstein windows_intelx86 479
2005-08-01 09:49:01 [PID=28131] [debug ] [HOST#103595] Already has file w1_0662.0
2005-08-01 09:49:01 [PID=28131] [debug ] [HOST#103595] reducing disk needed for WU by 8350400 bytes (length of w1_0662.0)
2005-08-01 09:49:01 [PID=28131] [debug ] est cpu dur 29038.461538; running_frac 0.251879; rsf 1.000000; est 115287.346458
2005-08-01 09:49:01 [PID=28131] [normal ] [HOST#103595] Sending [RESULT#7065325 w1_0662.0__0662.3_0.1_T13_S4hA_2] (fills 115287.35 seconds)
2005-08-01 09:49:01 [PID=28131] [normal ] [HOST#103595] Sent 1 results [scheduler ran 1 seconds]

2005-08-01 11:06:29 [PID=11044] [debug ] REQUEST_METHOD=POST CONTENT_TYPE=application/octet-stream HTTP_ACCEPT= HTTP_USER_AGENT=BOINC client (windows_intelx86 4.45)
2005-08-01 11:06:29 [PID=11044] [debug ] CONTENT_LENGTH=3802 from 193.3.142.123
2005-08-01 11:06:29 [PID=11044] [normal ] Handling request: IP 193.3.142.123, auth host 103595, platform windows_intelx86, version 4.45
2005-08-01 11:06:29 [PID=11044] [normal ] OS version Microsoft Windows 2000 Professional Edition, Service Pack 4, (05.00.2195.00)
2005-08-01 11:06:29 [PID=11044] [debug ] Request [HOST#103595] Database [HOST#103595] Request [RPC#1413] Database [RPC#1412]
2005-08-01 11:06:29 [PID=11044] [normal ] Processing request from [USER#70690] [HOST#103595] [IP 193.3.142.123] [RPC#1413] core client version 4.45
2005-08-01 11:06:29 [PID=11044] [normal ] [HOST#103595] [RESULT#7065325 w1_0662.0__0662.3_0.1_T13_S4hA_2] got result
2005-08-01 11:06:29 [PID=11044] [debug ] cpu 3917.656300 cpcs 0.002347, cc 9.195696
2005-08-01 11:06:29 [PID=11044] [debug ] [RESULT#7065325 w1_0662.0__0662.3_0.1_T13_S4hA_2]: client_state 3 exit_status -1073741819; setting outcome ERROR
2005-08-01 11:06:29 [PID=11044] [normal ] [HOST#103595] got request for 8640.000000 seconds of work; available disk 20.007354 GB
2005-08-01 11:06:29 [PID=11044] [debug ] [HOST#103595]: has file w1_0662.0
2005-08-01 11:06:29 [PID=11044] [debug ] send_old_work() no feasible result older than 168.0 hours
2005-08-01 11:06:29 [PID=11044] [debug ] in_send_results_for_file(w1_0662.0, 0) prev_result.id=7065325
2005-08-01 11:06:29 [PID=11044] [debug ] Sorted list of URLs follows [host timezone: UTC+3600]
2005-08-01 11:06:29 [PID=11044] [debug ] zone=+03600 url=http://einstein.aei.mpg.de
2005-08-01 11:06:29 [PID=11044] [debug ] zone=+00000 url=http://einstein.astro.gla.ac.uk
2005-08-01 11:06:29 [PID=11044] [debug ] zone=-18000 url=http://einstein.aset.psu.edu
2005-08-01 11:06:29 [PID=11044] [debug ] zone=-18000 url=http://morel.mit.edu
2005-08-01 11:06:29 [PID=11044] [debug ] zone=-21600 url=http://einstein.phys.uwm.edu
2005-08-01 11:06:29 [PID=11044] [debug ] [HOST#103595] Sending app_version einstein windows_intelx86 479
2005-08-01 11:06:29 [PID=11044] [debug ] [HOST#103595] Already has file w1_0662.0
2005-08-01 11:06:29 [PID=11044] [debug ] [HOST#103595] reducing disk needed for WU by 8350400 bytes (length of w1_0662.0)
2005-08-01 11:06:29 [PID=11044] [debug ] est cpu dur 29038.461538; running_frac 0.253218; rsf 1.000000; est 114677.714611
2005-08-01 11:06:29 [PID=11044] [normal ] [HOST#103595] Sending [RESULT#7096890 w1_0662.0__0662.4_0.1_T13_S4hA_1] (fills 114677.71 seconds)
2005-08-01 11:06:29 [PID=11044] [normal ] [HOST#103595] Sent 1 results [scheduler ran 0 seconds]

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

Whats up with thi`s

Quote:

Do you have anything from July 27, around 10:20 am UTC? Thats in the middle of the "no scheduler response" messages.

No Walt, but I have stuff from an hour later. I think that the host may be ahead of UTC by one hour.

2005-07-27 11:18:48 [PID=27892] [normal ] OS version Microsoft Windows 2000 Professional Edition, Service Pack 4, (05.00.2195.00)
2005-07-27 11:18:48 [PID=27892] [debug ] Request [HOST#103595] Database [HOST#103595] Request [RPC#1408] Database [RPC#48]
2005-07-27 11:18:48 [PID=27892] [normal ] Processing request from [USER#70690] [HOST#103595] [IP 193.3.142.123] [RPC#1408] core client version 4.45
2005-07-27 11:18:48 [PID=27892] [debug ] [HOST#103595] Resetting nresults_today
2005-07-27 11:18:48 [PID=27892] [normal ] [HOST#103595] [RESULT#6673270 w1_0662.0__0662.1_0.1_T08_S4hA_2] got result
2005-07-27 11:18:48 [PID=27892] [debug ] cpu 19641.531300 cpcs 0.002449, cc 48.106584
2005-07-27 11:18:48 [PID=27892] [debug ] [RESULT#6673270 w1_0662.0__0662.1_0.1_T08_S4hA_2]: client_state 5 exit_status -164; setting outcome ERROR

2005-07-27 11:18:53 [PID=27911] [normal ] OS version Microsoft Windows 2000 Professional Edition, Service Pack 4, (05.00.2195.00)
2005-07-27 11:18:53 [PID=27911] [debug ] Request [HOST#103595] Database [HOST#103595] Request [RPC#1409] Database [RPC#1408]
2005-07-27 11:18:53 [PID=27911] [normal ] Processing request from [USER#70690] [HOST#103595] [IP 193.3.142.123] [RPC#1409] core client version 4.45
2005-07-27 11:18:53 [PID=27911] [normal ] Not sending work - last RPC too recent: 5.076002
2005-07-27 11:18:53 [PID=27911] [debug ] [HOST#103595] MSG( low) Not sending work - last RPC too recent: 5 sec
2005-07-27 11:18:53 [PID=27911] [normal ] sending delay request 60.600000

2005-07-27 11:19:55 [PID=28170] [normal ] OS version Microsoft Windows 2000 Professional Edition, Service Pack 4, (05.00.2195.00)
2005-07-27 11:19:55 [PID=28170] [debug ] Request [HOST#103595] Database [HOST#103595] Request [RPC#1410] Database [RPC#1409]
2005-07-27 11:19:55 [PID=28170] [normal ] Processing request from [USER#70690] [HOST#103595] [IP 193.3.142.123] [RPC#1410] core client version 4.45
2005-07-27 11:19:55 [PID=28170] [normal ] [HOST#103595] got request for 8640.000000 seconds of work; available disk 20.007354 GB
2005-07-27 11:19:55 [PID=28170] [debug ] [HOST#103595]: has file w1_0662.0
2005-07-27 11:19:55 [PID=28170] [debug ] send_old_work() no feasible result older than 168.0 hours
2005-07-27 11:19:55 [PID=28170] [debug ] in_send_results_for_file(w1_0662.0, 0) prev_result.id=6673270
2005-07-27 11:19:55 [PID=28170] [debug ] Sorted list of URLs follows [host timezone: UTC+3600]
2005-07-27 11:19:55 [PID=28170] [debug ] zone=+03600 url=http://einstein.aei.mpg.de
2005-07-27 11:19:55 [PID=28170] [debug ] zone=+00000 url=http://einstein.astro.gla.ac.uk
2005-07-27 11:19:55 [PID=28170] [debug ] zone=-18000 url=http://einstein.aset.psu.edu
2005-07-27 11:19:55 [PID=28170] [debug ] zone=-18000 url=http://morel.mit.edu
2005-07-27 11:19:55 [PID=28170] [debug ] zone=-21600 url=http://einstein.phys.uwm.edu
2005-07-27 11:19:55 [PID=28170] [debug ] [HOST#103595] Sending app_version einstein windows_intelx86 479
2005-07-27 11:19:55 [PID=28170] [debug ] [HOST#103595] Already has file w1_0662.0
2005-07-27 11:19:55 [PID=28170] [debug ] [HOST#103595] reducing disk needed for WU by 8350400 bytes (length of w1_0662.0)
2005-07-27 11:19:55 [PID=28170] [debug ] est cpu dur 28169.642857; running_frac 0.270034; rsf 1.000000; est 104318.874131
2005-07-27 11:19:55 [PID=28170] [normal ] [HOST#103595] Sending [RESULT#6839349 w1_0662.0__0662.2_0.1_T11_S4hA_2] (fills 104318.87 seconds)
2005-07-27 11:19:55 [PID=28170] [normal ] [HOST#103595] Sent 1 results [scheduler ran 0 seconds]

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: The computer summary

Quote:


The computer summary web page doesn't show the timezone, but I did pick up the UTC+1 hr from the server log. These messages show that the BOINC client never got thru to the server. At least not until after he filed a bug report here.

And roughly 30 minutes after the bug report, his BOINC client got thru. Would be nice to know what was wrong earlier. I see too many of these "no scheduler responded" messages when something came back.

Indeed this seems to be a very common problem which I simply don't understand. It's why I added the 'retransmission of lost work' functionality to the scheduler. Something that to me seems very odd is that many of these 'no scheduler responded' messages occur when the server is up and functioning. As far as I can tell from the apache logs and cgi logs, the scheduler gets the connection but then the host at the other end 'goes silent'. Finally after some time (up to a couple of minutes) the connection is dropped.

I suspect that ISPs may be doing very odd things with the http transactions. But I don't know enough to figure this out. The 'retransmission of lost work' is meant to work around (at least some of) these problems.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: I wonder how many of

Quote:

I wonder how many of these come from hosts connecting thru a proxy. [edit] I know some ISP's put their own proxy servers 'in the middle', maybe these mess things up.

But the real problem is with BOINC, it considers a response without the proper scheduler reply tag the same as no response at all. Its on my list of "things to look into", if I can find some free time that is.
Walt

So if for example the ISP delivered only the first 4kB of an 8kB scheduler reply, the client would simply say 'no scheduler responded' rather than some kind of 'incomplete response' message? This is something that I could ask Rom or David to add and/or fix. With no feedback to users suggesting that something is wrong, there's not much chance of identifying the cause.

Bruce

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.