We still have serious problems with our database and are working on it. For the moment we disabled the scheduler (which was getting mostly DB connection errors anyway) to let the rest of the daemons catch up.
ABP1 validator and assimilator are running, however they will never again run on the machine the server status page has direct access to, so I removed the status signs from there.
Also due to DB connection problems the server status page isn't updated, but still shows an old status of 10:37 UTC.
BM

"Project is down" for 19 hours now
)
Ok, the daemons have worked through the backlogs and the DB looks responsive again. I started the scheduler, let's see how things go.
BM
BM
RE: I haven't seen any
)
Surely a number of things added together, we didn't spent much time to investigate this, but we surely will.
Under heavy load finally our old DB server crashed and we couldn't bring it back online for more than a last dump (that itself took several hours). So we stuffed together the hardware we could find to a new, larger server (32G RAM) that is running now. While the additional memory surely helped, the performance of this new DB is still poor. We are looking into this and already solved some issues. For now the project has been stopped again to run some offline analysis, indexing and other optimization on the database.
We'll continue to fix the problems that occur, so far we had all kinds of things, ranging from hardware failures to bugs in software not written by us, last one fixed just minutes ago, all in just a few days.
I strongly intend to have the project up and running later today.
BM
BM
Status: Good: Database is
)
Status:
Good:
Database is running more smoothly than ever, daemons (transitioner, assimilator, validator) are working through their backlogs, scheduler seems to send out work.
Bad:
All related to ABP1 is currently offline, because for some reason the connection to Hannover isn't stable. For some other reason the S5R5 workunit generator is running too slow to keep up with work requests, so the project might run out of work some time over the weekend, unless someone fixed these issues.
But after some 12 hours of hard work Oliver and I are too tired to take much care of this, we'll possibly do more damage than good. Need to get some sleep.
BM
BM
Update: * We found a
)
Update:
* We found a workaround for the problems with the S5R5 WUG (WU Generator), so we probably won't run out of work in the near future.
* We also found a workaround to run the ABP1 validator and assimilator, the WUG is still causing some headache.
* We needed a couple of workarounds to bring the project back online, which will probably require a scheduled downtime for a few hours some time next month to remove them. But for now we're getting back on track.
BM
BM
RE: RE: Update: * We
)
I just turned the dumps back on.
BM
BM