Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coredump occurs around 4 o'clock on master #13

Closed
iambocai opened this issue Feb 17, 2017 · 13 comments
Closed

coredump occurs around 4 o'clock on master #13

iambocai opened this issue Feb 17, 2017 · 13 comments

Comments

@iambocai
Copy link

iambocai commented Feb 17, 2017

We have 5 server in Cronicle cluster:
image

and totally 308w+ completed Jobs so far (maybe 4w+ per day):
image

From the last few days, the master server dumped core around 4:00 am every day , but there is no crash.log file appears. We are very worried about this situation.

From the core stack , the coredump is caused by OOM :

(gdb) bt
#0 0x00007f4a300325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f4a30033dc5 in abort () at abort.c:92
#2 0x000000000125f101 in node::Abort() ()
#3 0x000000000125f13c in node::OnFatalError(char const*, char const*) ()
#4 0x0000000000a5f412 in v8::Utils::ReportOOMFailure(char const*, bool) ()
#5 0x0000000000a5f61b in v8::internal::V8::FatalProcessOutOfMemory(char const, bool) ()*
#6 0x0000000000e1a9a1 in v8::internal::Factory::NewFixedArray(int, v8::internal::PretenureFlag) ()
#7 0x0000000000f8f481 in v8::internal::HashTable<v8::internal::StringTable, v8::internal::StringTableShape, v8::internal::HashTableKey*>::New(v8::internal::Isolate*, int, v8::internal::MinimumCapacity, v8::internal::PretenureFlag) ()
#8 0x0000000000f902f9 in v8::internal::HashTable<v8::internal::StringTable, v8::internal::StringTableShape, v8::internal::HashTableKey*>::EnsureCapacity(v8::internal::Handlev8::internal::StringTable, int, v8::internal::HashTableKey*, v8::internal::PretenureFlag) ()
#9 0x0000000000f90880 in v8::internal::StringTable::LookupString(v8::internal::Isolate*, v8::internal::Handlev8::internal::String) ()
#10 0x00000000010b1aaf in v8::internal::Runtime_ObjectHasOwnProperty(int, v8::internal::Object**, v8::internal::Isolate*) ()
#11 0x00000dad3c3063a7 in ?? ()
#12 0x0000000000000000 in ?? ()
(gdb) q

From Storage.log, it seems that there are a lot of "completed log" items loading to memory at that time (may be in the maintenance operation?),

[1487276212.593][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13280][]
[1487276212.595][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13279][]
[1487276212.596][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13278][]
[1487276212.598][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13277][]
[1487276212.599][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13276][]
[1487276212.601][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13275][]
[1487276212.602][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13274][]
[1487276212.604][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13273][]
[1487276212.606][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13272][]
[1487276212.607][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13271][]
[1487276212.609][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13270][]
[1487276212.61][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13269][]
[1487276212.612][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13268][]
[1487276212.614][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13267][]
[1487276212.617][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13266][]
[1487276212.618][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13265][]
[1487276212.62][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13264][]
[1487276212.622][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13263][]
[1487276212.624][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13262][]
[1487276212.626][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13261][]
[1487276212.628][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13260][]
[1487276212.629][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13259][]
[1487276212.631][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13258][]
[1487276212.634][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13257][]
[1487276212.635][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13256][]
[1487276212.637][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13255][]
[1487276212.638][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13254][]
[1487276212.64][2017-02-17 04:16:52][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13253][]
[1487276220.862][2017-02-17 04:17:00][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13252][]
[1487276220.872][2017-02-17 04:17:00][CronicleMaster][Storage][debug][9][Fetching 0 items at position 0 from list: global/schedule][]
[1487276220.872][2017-02-17 04:17:00][CronicleMaster][Storage][debug][9][Loading list: global/schedule][]
[1487276229.266][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.27][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/schedule/-19][]
[1487276229.271][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/schedule/-18][]
[1487276229.271][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/schedule/-17][]
[1487276229.271][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/schedule/-16][]
[1487276229.276][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Locating item in list: global/server_groups][{"id":"giyfkrp0u01"}]
[1487276229.276][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list: global/server_groups][]
[1487276229.276][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/server_groups/0][]
[1487276229.281][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Locating item in list: global/plugins][{"id":"shellplug"}]
[1487276229.281][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list: global/plugins][]
[1487276229.281][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/plugins/0][]
[1487276229.286][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Locating item in list: global/categories][{"id":"civm2b4vm02"}]
[1487276229.286][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list: global/categories][]
[1487276229.286][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/categories/0][]
[1487276229.289][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13251][]
[1487276229.292][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.293][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.296][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Setting expiration on: jobs/jiz8tuhnp6o/log.txt.gz to 1502828229][]
[1487276229.296][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"expire_set","key":"jobs/jiz8tuhnp6o/log.txt.gz","expiration":1502828229,"force":false}]
[1487276229.297][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.297][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Setting expiration on: jobs/jiz8tuhnp6o to 1502828229][]
[1487276229.297][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"expire_set","key":"jobs/jiz8tuhnp6o","expiration":1502828229,"force":false}]
[1487276229.297][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.297][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.297][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.3][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13250][]
[1487276229.306][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13249][]
[1487276229.308][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"expire_set","key":"jobs/jiz8tuhoy6v/log.txt.gz","expiration":1502828229,"force":false}]
[1487276229.308][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Setting expiration on: jobs/jiz8tuhoy6v/log.txt.gz to 1502828229][]
[1487276229.308][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.308][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Setting expiration on: jobs/jiz8tuhoy6v to 1502828229][]
[1487276229.308][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"expire_set","key":"jobs/jiz8tuhoy6v","expiration":1502828229,"force":false}]
[1487276229.308][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.308][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.308][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.314][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13248][]
[1487276229.315][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.318][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Locating item in list: global/server_groups][{"id":"giyfkrp0u01"}]
[1487276229.318][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list: global/server_groups][]
[1487276229.318][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/server_groups/0][]
[1487276229.32][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Locating item in list: global/plugins][{"id":"shellplug"}]
[1487276229.32][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list: global/plugins][]
[1487276229.321][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/plugins/0][]
[1487276229.321][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13247][]
[1487276229.322][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Locating item in list: global/categories][{"id":"civm2b4vm02"}]
[1487276229.322][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list: global/categories][]
[1487276229.321][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/plugins/0][]
[1487276229.321][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13247][]
[1487276229.322][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Locating item in list: global/categories][{"id":"civm2b4vm02"}]
[1487276229.322][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list: global/categories][]
[1487276229.322][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: global/categories/0][]
[1487276229.323][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.326][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Setting expiration on: jobs/jiz8tuhmo6l/log.txt.gz to 1502828229][]
[1487276229.326][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"expire_set","key":"jobs/jiz8tuhmo6l/log.txt.gz","expiration":1502828229,"force":false}]
[1487276229.326][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.326][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Setting expiration on: jobs/jiz8tuhmo6l to 1502828229][]
[1487276229.326][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"expire_set","key":"jobs/jiz8tuhmo6l","expiration":1502828229,"force":false}]
[1487276229.326][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.326][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.326][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276229.33][2017-02-17 04:17:09][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13246][]
[1487276281.359][2017-02-17 04:18:01][CronicleMaster][Storage][debug][9][Fetching 0 items at position 0 from list: global/schedule][]
[1487276281.359][2017-02-17 04:18:01][CronicleMaster][Storage][debug][9][Loading list: global/schedule][]
[1487276281.361][2017-02-17 04:18:01][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276281.367][2017-02-17 04:18:01][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276285.708][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276285.717][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list page: global/schedule/-19][]
[1487276285.717][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list page: global/schedule/-18][]
[1487276285.718][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list page: global/schedule/-17][]
[1487276285.718][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list page: global/schedule/-16][]
[1487276285.719][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Locating item in list: global/server_groups][{"id":"giyfkrp0u01"}]
[1487276285.719][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list: global/server_groups][]
[1487276285.72][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list page: global/server_groups/0][]
[1487276285.732][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Locating item in list: global/plugins][{"id":"shellplug"}]
[1487276285.732][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list: global/plugins][]
[1487276285.732][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list page: global/plugins/0][]
[1487276285.735][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Locating item in list: global/categories][{"id":"civm2b4vm02"}]
[1487276285.735][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list: global/categories][]
[1487276285.735][2017-02-17 04:18:05][CronicleMaster][Storage][debug][9][Loading list page: global/categories/0][]
[1487276309.519][2017-02-17 04:18:29][CronicleMaster][Storage][debug][9][Loading list page: logs/completed/-13245][]
[1487276324.101][2017-02-17 04:18:44][CronicleMaster][Storage][debug][9][Enqueuing async task][{"action":"custom"}]
[1487276369.48][2017-02-17 04:19:29][CronicleMaster][Storage][debug][9][Fetching 0 items at position 0 from list: global/schedule][]
[1487276369.48][2017-02-17 04:19:29][CronicleMaster][Storage][debug][9][Loading list: global/schedule][]
[1487276457.089][2017-02-17 04:20:57][CronicleMaster][Storage][debug][2][Setting up storage system][]
[1487276457.112][2017-02-17 04:20:57][CronicleMaster][Storage][debug][9][Fetching 0 items at position 0 from list: global/servers][]
[1487276457.113][2017-02-17 04:20:57][CronicleMaster][Storage][debug][9][Loading list: global/servers][]

If there is any other needs I can provide, please feel free to reply to me.

Thank you!

@jhuckaby
Copy link
Owner

and totally 308w+ completed Jobs so far (maybe 4w+ per day):

What does "w" mean here? Not sure what that letter means in this context. "k" usually means thousand, "m" means million, but "w" means what????

From Storage.log, it seems that there are a lot of "completed log" items loading to memory at that time (may be in the maintenance operation?),

Yes, 4:00 AM local server time is generally when Cronicle runs its daily maintenance (see maintenance to configure when this happens), where it prunes lists that have grown too big, especially the logs/completed list. I've honestly never had this run out of memory before. It looks like you are running so many jobs per day that this operation has exceeded Node's default 1 GB RAM ceiling.

There's only one thing I can suggest at this time. Please edit this file on all your Cronicle servers:

/opt/cronicle/bin/control.sh

Locate this line inside the file:

BINARY="node --expose_gc --always_compact $HOMEDIR/lib/main.js"

And change it to this:

BINARY="node --expose_gc --always_compact --max_old_space_size=4031 $HOMEDIR/lib/main.js"

Basically, we're adding in this Node.js command-line argument here: --max_old_space_size=4031

Then restart Cronicle on all your servers. This should allow Node.js to use up to 4 GB of RAM (the default is 1 GB).

If this doesn't work, then I have no idea how to help you. You may simply be pushing this system far beyond where it was designed to go.

Good luck!

- Joe

@iambocai
Copy link
Author

iambocai commented Feb 17, 2017

What does "w" mean here? Not sure what that letter means in this context. "k" usually means thousand, "m" means million, but "w" means what????

Sorry, 1w = 10k, w is the abbreviation of wan[万], A unit of magnitude commonly used in China. :)

I have changed the control script and restarted the service, hope it works.

This is the monitoring graph trend at that time :
image

Can I disable the daily maintenance in configuration, and do the maintenance by a separate script in system crontab? this should can avoid service OOM.

@jhuckaby
Copy link
Owner

Oh my... So if 1w = 10k (10,000), and you're doing 4w per day, then that's 40,000 jobs per day??? That's extraordinarily excessive. My jaw is on the floor right now. Unfortunately this is way, way more than the system was ever designed to handle. That's very likely why you're running out of memory.

Okay, so at this sheer magnitude of jobs, the standard listSplice() call, which is used to prune the lists during the daily maintenance, isn't going to work for you. It loads many pages into memory. That's simply too many items for it to splice at once. You're better off just clearing the lists every night. You'll lose the history of the completed items and activity, but I can't think of anything else, without redesigning the system somehow.

You can try sticking these commands in a shell script, activated by a crontab (or Cronicle I guess). Make sure you only run this on your Cronicle master server (not all the servers).

#!/bin/bash
/opt/cronicle/bin/storage-cli.js list_delete logs/complete
/opt/cronicle/bin/storage-cli.js list_delete logs/activity

These are the two largest lists, and probably the ones causing all the problems for you. If you delete these every night before the maintenance runs (maybe 3 AM?) it might work around your problem.

I'm worried about race conditions, however, because the CLI script will be "fighting" with Cronicle to read/write to these same lists as they are being deleted, especially if you are launching a job every 2 seconds (~40K per day). We may run into data corruption here -- but I don't know what else to try. What you may have to do is stop Cronicle, run the commands, then start it again. Like this:

#!/bin/bash
/opt/cronicle/bin/control.sh stop
/opt/cronicle/bin/storage-cli.js list_delete logs/complete
/opt/cronicle/bin/storage-cli.js list_delete logs/activity
/opt/cronicle/bin/control.sh start

Again, please note that this should only be executed on the Cronicle master server (which is the only server that should ever "write" to the data store).

Good luck, and sorry. Please understand the Cronicle is not designed for this level of scale (not even close), and is still in pre-release beta (not yet even v1.0). I've never ran more than 500 jobs in a day before. You're doing 80 times that amount.

- Joe

@iambocai
Copy link
Author

Thank you for your advice, we have a dozen of event runs every minute, so ... hah~

when execute "/opt/cronicle/bin/storage-cli.js list_delete logs/complete", we got an exception:

[root@master bin]# ./storage-cli.js list_delete logs/complete
/opt/cronicle/bin/storage-cli.js:342
if (err) throw err;
^

Error: Failed to fetch key: logs/complete: File not found
at ReadFileContext.callback (/opt/cronicle/node_modules/pixl-server-storage/engines/Filesystem.js:255:6)
at FSReqWrap.readFileAfterOpen [as oncomplete] (fs.js:335:13)

and "get" command's output like this:

image

config.json ‘s Storage section:
image

@iambocai
Copy link
Author

iambocai commented Feb 17, 2017

Oh, I know, it's logs/completed not logs/complete.

Stop all masters and clear is a bit unfriendly for production user, especially when the user uses the master/backup mode, they have to stop all servers in master group to ensure no writing activity in progress.
Can we provide a "pause/continue" feature which will temporarily pause/continue all event's new job scheduling(pause command check status every 30s and exit when all old running jobs are finished) ? so we can do the manual maintenance more safely and do not need to stop all our masters (which will produce so many job fail email ).

@jhuckaby
Copy link
Owner

@iambocai I do apologize, that was a typo on my part. It is indeed logs/completed, as you said.

At this point I recommend we try the Node.js memory increase (I think you did this already -- please let me know the results), and if that doesn't work, try the nightly crontab delete.

I understand that stopping and starting the service is not ideal. You can skip this part, and just do the delete "hot" (while the service is running). We're most likely already in a situation where your data is a bit corrupted, because we've core dumped at least once in the middle of a maintenance run. So at this point I'd say, if the Node.js memory increase doesn't work, put in a crontab that runs this every night at 3 AM on your Master server:

#!/bin/bash
/opt/cronicle/bin/storage-cli.js list_delete logs/completed
/opt/cronicle/bin/storage-cli.js list_delete logs/activity

Also, please allow me to make another recommendation. Since you are running 40K jobs per day, you should probably greatly decrease the job_data_expire_days config setting from 180 days down to maybe 7 days. This is because we're pruning the main lists down to 10,000 items every night, so we're essentially losing the history of all your jobs after one day. So there's no need to keep the job logs around for much longer. However, you can still drill down into the job history for individual events, which are stored as different lists, so a few days is probably still good.

I am working on a new version of Cronicle which will attempt to prune lists without blowing out memory. I'll let you know when this is finished and released.

@jhuckaby
Copy link
Owner

Hey @iambocai,

I just released Cronicle v0.6.9, which should fix this bug, if it is what I think it is. The nightly maintenance should now use far less memory when chopping old items off huge lists.

https://github.com/jhuckaby/Cronicle/releases/tag/v0.6.9

Hope this helps.

@iambocai
Copy link
Author

iambocai commented Feb 18, 2017

At this point I recommend we try the Node.js memory increase (I think you did this already -- please let me know the results)

Yes, I have changed it yesterday, the good news is, it did not make core last night; the bad news is, it also seems did not delete completed log successful.
image
(Ah, yes there's 3219k+ completed logs in file storage now. )

And when I try to do the delete "hot", it kept threws exception, guess we met just what you have said "CLI script fighting with Cronicle" condition.

I'm know nothing about pixl-server-storage, do we have ways to delete them with out suffers from read/write conflict (by page number/last update time)? should only delete logs/completed a few days ago be helpful,for example ?

Again, Thank you very much for helping me over the last few days. you are so nice !~ : )

[root@master bin]# node storage-cli.js list_delete logs/completed
[ERROR] Failed to delete object: logs/completed/-62935: File not found
/opt/cronicle/bin/storage-cli.js:342
if (err) throw err;
^

Error: ENOENT: no such file or directory, unlink '/home/homework/data/cronicle/logs/6a/fc/e6/6afce60cf3ff8c95c37153b5f48b43cf.json'
[root@master bin]# node storage-cli.js list_delete logs/completed
[ERROR] Failed to delete object: logs/completed/-62935: File not found
/opt/cronicle/bin/storage-cli.js:342
if (err) throw err;
^

Error: ENOENT: no such file or directory, unlink '/home/homework/data/cronicle/logs/6a/fc/e6/6afce60cf3ff8c95c37153b5f48b43cf.json'

@jhuckaby
Copy link
Owner

Oh wow. That is insane. Okay, the situation has grown beyond my ability to "fix" it, so we have to export all your critical data, delete your entire data directory contents, and then restore the backup. This is actually very easy, but you will have to temporarily stop the Master server. As long as you have Cronicle v0.6.9 or higher, this issue shouldn't happen again, after we restore your data into a fresh directory.

Here is what the backup will include, meaning all the following data will be preserved and restored:

  • All your users
  • All your API keys
  • All your Plugins
  • All your server groups
  • All your event categories
  • All your schedule entries
  • All your servers (cluster info)

Here is what will be lost completely (no choice really):

  • All historical (completed) jobs logs will be lost
  • All event history (stats) will be lost
  • The "Activity" log contents in the Admin tab will be lost
  • All user session cookies (meaning you will have to login with your user/password again)

I am sorry about this, but keep in mind that Cronicle is still in pre-release (v0.6) and is just not designed for your immense scale, or for repairing corrupted data on disk. I will work to make this system more robust before I ever release v1.0.

Here are the steps. Please do this on your master server only (and make sure you don't have any secondary backup servers that will take over as soon as the master goes down). Make sure you are the root user (i.e. superuser).

FYI, this is documented here: Data Import and Export

cd /opt/cronicle
./bin/control.sh stop
./bin/control.sh export data-backup.txt --verbose

You should now have a backup of all your critical data in /opt/cronicle/data-backup.txt. For safety I don't recommend you hard delete the data directory at this point. Instead, you can rename or tarball it if you want. Renaming is faster, so we minimize your downtime:

mv /home/homework/data /home/homework/data-OLD
mkdir /home/homework/data

So this has moved the old data directory aside, renaming it to data-OLD, and we created a fresh new empty data directory. This way you can still revert if you want (i.e. just stop, swap the directories back, then start).

If you are not sure, now is a good time to make sure that Cronicle is on the latest version (i.e. v0.6.9), which has the nightly memory fix in it.

./bin/control.sh upgrade

Now let's restore your backup into the fresh new /home/homework/data directory we just created, and start the service:

./bin/control.sh import data-backup.txt
./bin/control.sh start

That should be it. Please let me know if any step fails or emits errors. I am hoping and praying that all the critical data contained in the backup doesn't have any corruption. With any luck the damaged records are all within the huge completed logs, which we are not bringing over.

Fingers crossed.

- Joe

@iambocai
Copy link
Author

Yeah it works! about 49GB history data has gone with wind, :D

@jhuckaby
Copy link
Owner

Whew, that's a relief! 😌

Let's hope the new code in v0.6.9 keeps it under control from now on. It should prune all the lists down to 10,000 max items (configurable) every night.

Thanks for the issue report, and sorry about the core dumps and corrupted log data.

@iambocai
Copy link
Author

iambocai commented Mar 3, 2017

I have kept watching the system for two weeks, the logs/completed and activity list's daily maintenance works fine now. but unfortunately, the jobs dir seems still growth fast:

in config.conf:

"job_data_expire_days": 3,

disk usage:
image

There are 893k files under jobs dir now , far more than 120k (40k jobs per day * 3days) :
image

@jhuckaby
Copy link
Owner

jhuckaby commented Mar 4, 2017

That's odd, it's supposed to automatically delete those. However, please note that changing job_data_expire_days only takes effect for new jobs, meaning it doesn't retroactively expire older ones, from before you made the change.

However, if this directory keeps growing out of control, you can put in a cronjob to delete old files from it, at least until I can find the bug:

find /home/homework/data/jobs -type f -mtime 3d -exec rm {} \;

Unfortunately your job volume is so extreme that I don't know of any feasible way to examine the nightly maintenance logs on your server. It would just be noise on top of all the other jobs your servers are running. However, when I have some time I will try to recreate this issue on a test server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants