Ghost WU and resending lost results

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0
Topic 13051

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

Ghost WU and resending lost results

Quote:
Quote:

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce

Bruce, it works real well (thanks!) but for one thing. Maybe an unintentional side effect based on what you wrote - when it sends the WU again, it assigns a new complation date. So when the result is close to deadline, you can "reset" the project and get the same work back but with another week to go.

Heres a line from my results list:

Before WU was resent:
6906875 1643286 28 Jul 2005 20:14:59 UTC 4 Aug 2005 20:14:59 UTC In Progress Unknown New

After WU was resent:
6906875 1643286 28 Jul 2005 20:22:29 UTC 4 Aug 2005 20:22:29 UTC In Progress Unknown New

[EDIT]
The original deadline was like 28 JUL 2005 14:42:32. I noticed after testing that the deadline showed one week to the second after the re-download. Tried again, but first saved off the "old" deadline as reported by the server.

Walt

Walt,

Good catch -- I'm going to have to change your status to 'Developer'!!

Now that you point this out it's obvious that this is how our code works. But it wasn't what I intended. I'll have to fix it, else results will never time out for misconfigured hosts that never get the work.

Any reason that I shouldn't fix this?

[EDIT 10 minutes later]
Walt, I've fixed this. Now when results are resent the 'sent_time' and 'report_deadline' in the database are left unchanged.

[EDIT 5 minutes later]
I wonder if I should update 'sent_time' but NOT 'report_deadline'. This way the result will still time out OK but it'll be obvious from the database that it has been re-sent one or more times. Thoughts??

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: RE: Perhaps I'm

Quote:
Quote:
Perhaps I'm better off just to abort them?

My heart was bleeding when I saw the pile of WUs ... I aborted all that already have quorum and validated.

The pile is a bit shorter now. I just have to wait for BOINC to report them back to the server as aborted.

Thank you for this post and the previous one as well. I hadn't realized that when merging hosts, the new 'child' host would get any work that had been sent to the 'parent' hosts, and which was not on the child host.

I intend to watch this thread and 'tweak' the behavior of this re-send mechanism over the coming days. [For example, if the result which would be re-sent is already close to the deadline, I could mark it as an error and generate a new result instead (which would go to some other host).] But I would like to keep this mechanism as simple as possible for the moment, so for now I just plan to 'watch and wait'.

If you have suggestions about changes or refinements to this mechanism, please post them here.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

I've made an additional

I've made an additional change as Walt and I discussed.

For results that are re-sent, the REPORT DEADLINE is left unchanged. However I update the SENT TIME when the result is reset. Thus if

(REPORT_DEADLINE-SENT_TIME) is less than 7 days

it means that the work was resent one or more times.

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: I haven't had any

Quote:

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: RE: RE: I haven't

Quote:
Quote:
Quote:

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.

Even if (s)he doesn't abort it (not everyone babysits their BOINC installations), it will eventually pass the dead-line iff it is not re-set after WU is re-sent.

Agreed.

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: I just got a pile of

Quote:
I just got a pile of these on one of my hosts. However, the deadline is set to tomorrow. I'm not sure how 48-70 hours of work is supposed to get done in 36 hours or so. Shouldn't the deadlines be reset on any of these resent units such that the host has a chance of catching up?

I suggest that you abort the workunits which can't be finished in time. Then do 'update project' to report the aborted WU to the server. This way, new WU can be issued and your computer won't spend a long time doing work that's overdue.

Any idea how this work got lost??

Cheers,
Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

RE: The deduction of a

Quote:
The deduction of a resend from the pending work makes sense. I'd also like to see the deadline recalculated from TODAY, for another 7 days.


This would have a bad consequence. A host which had a proxy problem and never received a work unit, but which kept contacting the scheduler, would cause that workunit to never finish.

Quote:

Lastly, I'd like it not to resend them if the client is already at capacity. Just wait on them until there's room, or mark them out as errored and resend to another client if need be.


I don't know how to make this determination.

However I have just made the following changes. IF
- Work within 25% of deadline (42 hours for Einstein@Home), OR
- Work no longer needed (Canonical result already exists), OR
- Work unit has error flag set (something wrong), THEN
the scheduler no longer resends the workunit, but instead marks it as timed out in the database. The scheduler will then send an informational message to the client reporting that this WU has been 'expired'.

I'll test this over the next few hours, and see if it has undesirable side effects.

Bruce

Bruce Allen
Bruce Allen
Joined: 15 Oct 04
Posts: 958
Credit: 170,849,008
RAC: 0

Based on the feedback in this

Based on the feedback in this forum, I've made some additional modifications to the scheduler policy on resending lost workunits. Details may be found here:
deadline_proposal.txt. This extends the deadlines (up to a total of an additional week) for machines that did not get the work when it was originally sent.

Bruce

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.