> @Bernd or any def people
> here got a longrider.
T(om?)hanks!
> Still interested about the boinc folder?
Thanks for doing, but I think we found the problem. The WUs causing this kind of trouble seems to be analyzing the frequency range around 60Hz (you can currently tell it from the name). We are working to get this problem out of the way.
If you don't mind, keep the archive for a while in case we need it, I think right now it is of no use for us, but may become handy in the future.
> Is this related to my CPU time resetting back about an hour whenever einstein
> was paused and resumed by BOINC? Is there maybe no checkpoint built into the
> 3rd stage of analysis since it was expected to be so short? I have my
> settings set to switch between projects every 60 minutes and to remove them
> from memory when doing so.
Your analysis of the problem is correct. There is no checkpoint in the third stage of processing because it is supposed to take only a few seconds.
> In case this helps anyone:
>
> The problem appears to be in the data, not in the code. The WU will eventually
> finish, but it may take quite some time (even more than we expect in the max
> CPU time value and exceeding the deadline), and maybe also more memory than we
> expected (possibly causing more problems). We'll ty to avoid such WUs in the
> future.
I want to say this a bit differently.
The problem is in our code. For certain data sets, it is not as efficient as it could or should be. So we are in the process of fixing the code to make it efficient in all cases.
Unfortunately we can't identify the 'troublesome' data sets or cases without actually analyzing the data! So we can't easily avoid such WU. Instead we need to make the code work efficiently with any of our input data sets.
> Another not ending story in "H1_0059.9__0060.0_0.1_T17_Test02" If you can
> delete this workunit the other will thank you.
For what its worth, the problem occurs when analyzing the band
of data containing 60 Hz (power mains frequency in the USA).
I'll talk with others in our team about cancelling these WU. I am not doing it right away because looking at the results coming back (they are slow but do compplete) may help us to fix this.
> 0060.0_ So are you saying if I look at this part of the WU ID that there could
> be a potential problem with the WU taking to much time to finish ... ???
Actually it's the __0060.0 (TWO underscores) that's the clue. I wouldn't be surprised if the __0059.9 (TWO underscores) workunits also show this behavior.
1. We have identified the problem and are working on it. A new set of apps fixing this should be availble in the next days.
2. We still appreciate the uploaded process directories (slots). They will help up to test these new apps before releasing them.
3. If you see such a "never ending result" apparently staying at 100% for hours, pleaase do the following:
- report it to us, probably in this thread
- if you can, zip (or tar.gz) the appropriate slots directory (and maybe the projects/einstein directory, too, but that's not so important) and make it available to us
- if you are only running E@H, just be patient - the WU will spend quite some time at the 100% mark, but will eventually finish
- if you are swapping between different projects and have set to remove the app from memory when suspended, the Result probably will never finish. The reason is that the app is not writing any checkpoints during this last stage, always gets suspended before completing it and thus starts at the bginning of this phase over and over again.
If you have a (experimental) client that allows to suspend individual projects (and aborting individual results), suspend all other projects except E@H until this Result is finished. You may also abort this individual result if you don't want to affect the other projects, but will lose you CPU time spent on it then.
If you are using a stock client (4.1x) without this possibility, well, there may be no other way than resetting the E@H project to get his Result out of the way, losing the CPU time you spent on it (and causing another 11MB data file download).
It may help to modify your preferences for a longer swap time and/or to keep the app in memory, but I'm not sure if it helps once the app is in this stage.
> @Bernd or any def people >
)
> @Bernd or any def people
> here got a longrider.
T(om?)hanks!
> Still interested about the boinc folder?
Thanks for doing, but I think we found the problem. The WUs causing this kind of trouble seems to be analyzing the frequency range around 60Hz (you can currently tell it from the name). We are working to get this problem out of the way.
If you don't mind, keep the archive for a while in case we need it, I think right now it is of no use for us, but may become handy in the future.
Thanks a lot for your help!
BM
BM
> Is this related to my CPU
)
> Is this related to my CPU time resetting back about an hour whenever einstein
> was paused and resumed by BOINC? Is there maybe no checkpoint built into the
> 3rd stage of analysis since it was expected to be so short? I have my
> settings set to switch between projects every 60 minutes and to remove them
> from memory when doing so.
Your analysis of the problem is correct. There is no checkpoint in the third stage of processing because it is supposed to take only a few seconds.
Please see the front page news item about this.
Bruce
> And it a pleasure to help
)
> And it a pleasure to help in a modest way
This was a big help. The problem has been isolated and we're working on a fix. Please see the front page news item about this.
Bruce
> In case this helps
)
> In case this helps anyone:
>
> The problem appears to be in the data, not in the code. The WU will eventually
> finish, but it may take quite some time (even more than we expect in the max
> CPU time value and exceeding the deadline), and maybe also more memory than we
> expected (possibly causing more problems). We'll ty to avoid such WUs in the
> future.
I want to say this a bit differently.
The problem is in our code. For certain data sets, it is not as efficient as it could or should be. So we are in the process of fixing the code to make it efficient in all cases.
Unfortunately we can't identify the 'troublesome' data sets or cases without actually analyzing the data! So we can't easily avoid such WU. Instead we need to make the code work efficiently with any of our input data sets.
Cheers,
Bruce
> Another not ending story in
)
> Another not ending story in "H1_0059.9__0060.0_0.1_T17_Test02" If you can
> delete this workunit the other will thank you.
For what its worth, the problem occurs when analyzing the band
of data containing 60 Hz (power mains frequency in the USA).
I'll talk with others in our team about cancelling these WU. I am not doing it right away because looking at the results coming back (they are slow but do compplete) may help us to fix this.
Cheers,
Bruce
> 0060.0_ So are you saying
)
> 0060.0_ So are you saying if I look at this part of the WU ID that there could
> be a potential problem with the WU taking to much time to finish ... ???
Actually it's the __0060.0 (TWO underscores) that's the clue. I wouldn't be surprised if the __0059.9 (TWO underscores) workunits also show this behavior.
Bruce
> Is there a transmission at
)
> Is there a transmission at 74Hz as well? ;)
Well, IMHO everything below 100Hz may cause problems with the current data sets.
We are working on new apps and different data pre-processing to solve this problem.
BM
BM
Just to give you an update on
)
Just to give you an update on this issue:
1. We have identified the problem and are working on it. A new set of apps fixing this should be availble in the next days.
2. We still appreciate the uploaded process directories (slots). They will help up to test these new apps before releasing them.
3. If you see such a "never ending result" apparently staying at 100% for hours, pleaase do the following:
- report it to us, probably in this thread
- if you can, zip (or tar.gz) the appropriate slots directory (and maybe the projects/einstein directory, too, but that's not so important) and make it available to us
- if you are only running E@H, just be patient - the WU will spend quite some time at the 100% mark, but will eventually finish
- if you are swapping between different projects and have set to remove the app from memory when suspended, the Result probably will never finish. The reason is that the app is not writing any checkpoints during this last stage, always gets suspended before completing it and thus starts at the bginning of this phase over and over again.
If you have a (experimental) client that allows to suspend individual projects (and aborting individual results), suspend all other projects except E@H until this Result is finished. You may also abort this individual result if you don't want to affect the other projects, but will lose you CPU time spent on it then.
If you are using a stock client (4.1x) without this possibility, well, there may be no other way than resetting the E@H project to get his Result out of the way, losing the CPU time you spent on it (and causing another 11MB data file download).
It may help to modify your preferences for a longer swap time and/or to keep the app in memory, but I'm not sure if it helps once the app is in this stage.
Thanks a lot for you help!
BM
BM
ric, Blizzard, I found the
)
ric, Blizzard,
I found the uploads from Rebirther (to test our new apps with them), but nothing from ric or others - have they already been deleted?
BM
BM
ric, Blizzard, I found the
)
ric, Blizzard,
I found the uploads from Rebirther (to test our new apps with them), but nothing from ric or others - have they already been deleted?
BM
BM