-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coredump occurs around 4 o'clock on master #13
Comments
What does "w" mean here? Not sure what that letter means in this context. "k" usually means thousand, "m" means million, but "w" means what????
Yes, 4:00 AM local server time is generally when Cronicle runs its daily maintenance (see maintenance to configure when this happens), where it prunes lists that have grown too big, especially the There's only one thing I can suggest at this time. Please edit this file on all your Cronicle servers:
Locate this line inside the file:
And change it to this:
Basically, we're adding in this Node.js command-line argument here: Then restart Cronicle on all your servers. This should allow Node.js to use up to 4 GB of RAM (the default is 1 GB). If this doesn't work, then I have no idea how to help you. You may simply be pushing this system far beyond where it was designed to go. Good luck! - Joe |
Oh my... So if 1w = 10k (10,000), and you're doing 4w per day, then that's 40,000 jobs per day??? That's extraordinarily excessive. My jaw is on the floor right now. Unfortunately this is way, way more than the system was ever designed to handle. That's very likely why you're running out of memory. Okay, so at this sheer magnitude of jobs, the standard listSplice() call, which is used to prune the lists during the daily maintenance, isn't going to work for you. It loads many pages into memory. That's simply too many items for it to splice at once. You're better off just clearing the lists every night. You'll lose the history of the completed items and activity, but I can't think of anything else, without redesigning the system somehow. You can try sticking these commands in a shell script, activated by a crontab (or Cronicle I guess). Make sure you only run this on your Cronicle master server (not all the servers).
These are the two largest lists, and probably the ones causing all the problems for you. If you delete these every night before the maintenance runs (maybe 3 AM?) it might work around your problem. I'm worried about race conditions, however, because the CLI script will be "fighting" with Cronicle to read/write to these same lists as they are being deleted, especially if you are launching a job every 2 seconds (~40K per day). We may run into data corruption here -- but I don't know what else to try. What you may have to do is stop Cronicle, run the commands, then start it again. Like this:
Again, please note that this should only be executed on the Cronicle master server (which is the only server that should ever "write" to the data store). Good luck, and sorry. Please understand the Cronicle is not designed for this level of scale (not even close), and is still in pre-release beta (not yet even v1.0). I've never ran more than 500 jobs in a day before. You're doing 80 times that amount. - Joe |
Thank you for your advice, we have a dozen of event runs every minute, so ... hah~ when execute "/opt/cronicle/bin/storage-cli.js list_delete logs/complete", we got an exception:
and "get" command's output like this: |
Oh, I know, it's logs/completed not logs/complete. Stop all masters and clear is a bit unfriendly for production user, especially when the user uses the master/backup mode, they have to stop all servers in master group to ensure no writing activity in progress. |
@iambocai I do apologize, that was a typo on my part. It is indeed At this point I recommend we try the Node.js memory increase (I think you did this already -- please let me know the results), and if that doesn't work, try the nightly crontab delete. I understand that stopping and starting the service is not ideal. You can skip this part, and just do the delete "hot" (while the service is running). We're most likely already in a situation where your data is a bit corrupted, because we've core dumped at least once in the middle of a maintenance run. So at this point I'd say, if the Node.js memory increase doesn't work, put in a crontab that runs this every night at 3 AM on your Master server:
Also, please allow me to make another recommendation. Since you are running 40K jobs per day, you should probably greatly decrease the job_data_expire_days config setting from 180 days down to maybe 7 days. This is because we're pruning the main lists down to 10,000 items every night, so we're essentially losing the history of all your jobs after one day. So there's no need to keep the job logs around for much longer. However, you can still drill down into the job history for individual events, which are stored as different lists, so a few days is probably still good. I am working on a new version of Cronicle which will attempt to prune lists without blowing out memory. I'll let you know when this is finished and released. |
Hey @iambocai, I just released Cronicle v0.6.9, which should fix this bug, if it is what I think it is. The nightly maintenance should now use far less memory when chopping old items off huge lists. https://github.com/jhuckaby/Cronicle/releases/tag/v0.6.9 Hope this helps. |
Oh wow. That is insane. Okay, the situation has grown beyond my ability to "fix" it, so we have to export all your critical data, delete your entire Here is what the backup will include, meaning all the following data will be preserved and restored:
Here is what will be lost completely (no choice really):
I am sorry about this, but keep in mind that Cronicle is still in pre-release (v0.6) and is just not designed for your immense scale, or for repairing corrupted data on disk. I will work to make this system more robust before I ever release v1.0. Here are the steps. Please do this on your master server only (and make sure you don't have any secondary backup servers that will take over as soon as the master goes down). Make sure you are the root user (i.e. superuser). FYI, this is documented here: Data Import and Export
You should now have a backup of all your critical data in
So this has moved the old If you are not sure, now is a good time to make sure that Cronicle is on the latest version (i.e. v0.6.9), which has the nightly memory fix in it.
Now let's restore your backup into the fresh new
That should be it. Please let me know if any step fails or emits errors. I am hoping and praying that all the critical data contained in the backup doesn't have any corruption. With any luck the damaged records are all within the huge completed logs, which we are not bringing over. Fingers crossed. - Joe |
Yeah it works! about 49GB history data has gone with wind, :D |
Whew, that's a relief! 😌 Let's hope the new code in v0.6.9 keeps it under control from now on. It should prune all the lists down to 10,000 max items (configurable) every night. Thanks for the issue report, and sorry about the core dumps and corrupted log data. |
That's odd, it's supposed to automatically delete those. However, please note that changing However, if this directory keeps growing out of control, you can put in a cronjob to delete old files from it, at least until I can find the bug: find /home/homework/data/jobs -type f -mtime 3d -exec rm {} \; Unfortunately your job volume is so extreme that I don't know of any feasible way to examine the nightly maintenance logs on your server. It would just be noise on top of all the other jobs your servers are running. However, when I have some time I will try to recreate this issue on a test server. |
We have 5 server in Cronicle cluster:
and totally 308w+ completed Jobs so far (maybe 4w+ per day):
From the last few days, the master server dumped core around 4:00 am every day , but there is no crash.log file appears. We are very worried about this situation.
From the core stack , the coredump is caused by OOM :
From Storage.log, it seems that there are a lot of "completed log" items loading to memory at that time (may be in the maintenance operation?),
If there is any other needs I can provide, please feel free to reply to me.
Thank you!
The text was updated successfully, but these errors were encountered: