I'm working to fix these problems now. I think there is a bug in the BOINC scheduling code, where it is incorrectly calculating the 'on-time' and 'active-time' fractions for host machines. I've just stuck in a workaround that doesn't really fix this problem but should at least ensure that your machines get some work. Could you try 'update project' and see if you get some work now? Or just wait until your machine contacts the scheduler again?
Cheers,
Bruce
Message from server: No work available (there was work but your
)
Good, glad to see that my 'work-around' fixed. The bug is in a part of the BOINC scheduler that I don't understand well enough to fix easily myself. I've written to the person responsible and hopefully this will be fixed 'properly' fairly soon. For the moment I'll leave my 'work-around' in place.
Cheers,
Bruce
> What would be really nice
)
> What would be really nice would be if the scheduler used the actual work on
> hand on the client rather than a resource share to determine whether and how
> much work to download. Slower machines that could finish one WU on time for
> one project are now going to be limited to one project with no backup. This
> means that if that project goes down for any reason, those machines will be
> sitting idle.
John,
I've spent quite a bit of time on the scheduler during the past couple of days. It now does this, at least to the degree that is possible with the information that is available to the server.
The basic idea is that if a machine does not have any work to do for Einstein@Home, it will at least get one workunit. After that, it will only get work if the estimated completion time of the work (taking into account any work that the machine is already doing) is before the deadline for that work. This is perhaps not ideal in all cases but hopefully will work fairly well most of the time.
Bruce
Ziran, I thought it would be
)
Ziran,
I thought it would be worthwhile for me to explain the logic behind this, not so much for you as for others who may be wondering what's going on behind the scenes. So my comments are interspersed below in your message log.
> --- - 2005-02-16 19:04:35 - Starting BOINC client version 4.19 for
> windows_intelx86
> Einstein@Home - 2005-02-16 19:04:38 - Project prefs: no separate prefs for
> home; using your defaults
> Einstein@Home - 2005-02-16 19:04:39 - Host ID is 5349
> --- - 2005-02-16 19:04:42 - General prefs: from Einstein@Home (last modified
> 2005-02-07 00:21:01)
> --- - 2005-02-16 19:04:42 - General prefs: no separate prefs for home; using
> your defaults
> Einstein@Home - 2005-02-16 19:04:57 - Resuming computation for result
> H1_0081.9__0082.2_0.1_T25_Test02_5 using einstein version 4.75
> --- - 2005-02-16 22:51:34 - May run out of work in 0.01 days; requesting more
> Einstein@Home - 2005-02-16 22:51:34 - Requesting 335 seconds of work
> Einstein@Home - 2005-02-16 22:51:34 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> Einstein@Home - 2005-02-16 22:51:35 - Computation for result
> H1_0081.9__0082.2_0.1_T25_Test02 finished
> Einstein@Home - 2005-02-16 22:51:35 - Started upload of
> H1_0081.9__0082.2_0.1_T25_Test02_5_0
> Einstein@Home - 2005-02-16 22:51:39 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
> Einstein@Home - 2005-02-16 22:51:39 - Message from server: No work available
> (there was work but your computer would not finish it before it is due
Let's see what led to this:
2005-02-16 21:44:20 [normal ] [HOST#5349] got request for 334.990120 seconds of work; available disk 0.707837 GB
2005-02-16 21:44:20 [debug ] est cpu dur 127051.428571; running_frac 0.200000; rsf 1.000000; est 635257.142857
2005-02-16 21:44:20 [debug ] [WU#348347 H1_0249.4__0249.6_0.1_T29_Test02] needs 635257 seconds on [HOST#5349]; delay_bound is 604800 (request.estimated_delay is 747.976118)
2005-02-16 21:44:20 [normal ] [HOST#5349] Sent 0 results
So the scheduler has concluded that your computer is available LESS than 20% of the time (running_frac). This is the product of the fraction of the time it is turned on, times the fraction of the time that BOINC is not running because you are using the computer for other things. Now the WU that was being considered is estimated to take 127051 CPU seconds on your machine. Hence dividing this by 0.2 the estimated completion time to finish this work is 635257 seconds, which is longer than one week (the deadline).
(Note that 747 seconds is the estimated time to complete the E@H work that was still on your computer.)
2005-02-16 21:44:20 [debug ] [WU#348347 H1_0249.4__0249.6_0.1_T29_Test02] needs 635257 seconds on [HOST#5349]; delay_bound is 604800 (request.estimated_delay is 747.976118)
2005-02-16 21:44:20 [normal ] [HOST#5349] Sent 0 results
So these numbers explain the reply just above. You would not be able to complete the work before the delay bound of one week = 604800 seconds.
> Einstein@Home - 2005-02-16 22:51:39 - Project prefs: no separate prefs for
> home; using your defaults
> Einstein@Home - 2005-02-16 22:51:39 - Got request to delete file: H1_0081.9
> Einstein@Home - 2005-02-16 22:51:39 - No work from project
> Einstein@Home - 2005-02-16 22:51:39 - Deferring communication with project for
> 1 hours, 0 minutes, and 0 seconds
> Einstein@Home - 2005-02-16 22:51:43 - Finished upload of
> H1_0081.9__0082.2_0.1_T25_Test02_5_0
> Einstein@Home - 2005-02-16 22:51:43 - Throughput 423 bytes/sec
Now your computer COMPLETES its E@H work and has no further E@H work to do. So....
> Einstein@Home - 2005-02-16 23:29:44 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
At this point your computer has no further work, and so it sends another request to the scheduler.
> Einstein@Home - 2005-02-16 23:29:48 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
Now the E@H scheduler is programmed (as described above) to completely IGNORE the fraction of time you are on, and your resource share, and always send ONE result IF you have no further E@H work to do.
2005-02-16 22:22:30 [normal ] [HOST#5349] got request for 591.666881 seconds of work; available disk 0.707837 GB
2005-02-16 22:22:30 [debug ] [HOST#5349] Sending app_version einstein windows_intelx86 479
2005-02-16 22:22:30 [debug ] est cpu dur 127051.428571; running_frac 0.200000; rsf 1.000000; est 635257.142857
2005-02-16 22:22:30 [normal ] [HOST#5349] Sending [RESULT#1187443 H1_0953.4__0953.5_0.1_T00_Test02_1] (fills 635257.14 seconds)
2005-02-16 22:22:30 [normal ] [HOST#5349] Sent 1 results
The estimate is that this WU will take 127051 CPU seconds on your machine.
Since your machine is able to run BOINC only 20% of the time, this is estimated to take 635257 seconds (more than 1 week!).
2005-02-16 22:22:30 [normal ] [HOST#5349] Sending [RESULT#1187443 H1_0953.4__0953.5_0.1_T00_Test02_1] (fills 635257.14 seconds)
2005-02-16 22:22:30 [normal ] [HOST#5349] Sent 1 results
The basic conclusion is this: if you want to get more work from the scheduler, you need to either leave your computer turned on for longer, or alternatively you need to ensure that BOINC spends more of its time running and less of its time waiting for the computer to be "Free".
I hope this helps to explain what's going on at the server end!
Cheers,
Bruce
> Perhaps some actual data
)
> Perhaps some actual data would help.
>
> Here is the share from one of my computers:
>
>
>
> Now this is not the fastest computer in the world but it has scheduled
> completions in 19 hours. Asking for three days worth, and providing a 97%
> resource share, it seems to me, should result in my having at the very
> minimum, three work units in my queue - I now have ONE.
James, I thought I'd provide a detailed reply to give you and others a snapshot of how this works from the server side. I hope this is helpful.
The key point is that 19 hours is the CPU time to complete the result, NOT the wallclock time. Apparently, according the the core client, your computer is only able to run boinc jobs less than 20% of the time. Hence the 19 hours of CPU time is more like 100 hours of wallclock time.
2005-02-16 14:44:45 [debug ] est cpu dur 69433.173333; running_frac 0.200000; rsf 0.967742; est 358738.038306
The key number is the 0.2000000 which is the product of the fraction of time that your computer is turned on, times the fraction of time that it is running BOINC. If less than 0.20, it is set to 0.20 by the scheduler. Hence although the job is estimated to take 69433 seconds of CPU time on your machine, the estimated time to completion is 358738 seconds, which is around four days.
2005-02-16 14:44:45 [debug ] [WU#348347 H1_0249.4__0249.6_0.1_T29_Test02] needs 358738 seconds on [HOST#10548]; delay_bound is 604800 (request.estimated_delay is 259197.491119)
The work currently on your machine is estimated to take 259197 seconds (around 3 days) to complete. Hence it does not make sense to send new work NOW, since it would not finish by the deadline of 1 week (604800 seconds).
2005-02-16 14:44:45 [normal ] [HOST#10548] Sent 0 results
2005-02-16 14:44:45 [normal ] sending delay request 51839
The delay request is 1/5 of the time before your machine will need new work. This is intended to ensure that if your machine works faster than expected, it will get plenty of new work.
Bottom line: to get more work, leave your computer on more, and make sure that BOINC runs a larger fraction of the time.
Cheers,
Bruce