Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to continue a crashed run #8

Open
papaig opened this issue Feb 21, 2019 · 1 comment
Open

How to continue a crashed run #8

papaig opened this issue Feb 21, 2019 · 1 comment

Comments

@papaig
Copy link

papaig commented Feb 21, 2019

Dear developers,
I'm running Thunder on a cpu cluster. Despite using 8 nodes with 16 cores each, my run didn't finish in 14 days. unfortunately this 14 days is the time limit for a run on this cluster, so the run was cancelled. I would like to continue it and I wonder how I should do it. If I put the last .thu file in the json as ".thu File Storing Paths and CTFs of Images", Thunder seems to restart the run from the beginning. I chose a new folder for the output not to overwrite the files from the previous, crashed run.
I would be grateful if you could tell me how to continue the run from where it crashed.
Thank you,
Gabor

@thuem
Copy link
Owner

thuem commented Mar 23, 2019

I am so sorry for the delay, as the E-mail system blocked the notification letter.

There are two situations.

First situation, you ended during global search. In this case, just put last .thu fie as ".thu File Storing Paths and CTFs of Images", and change the initial model and initial resolution to the reference / resolution you achieve in the last round of the previous round, respectively. As it should work.

Second situation, you ended after global search. In this case, despite the actions in the first situation, the "Global Search" option should be turned from true to false.

Moreover, THUNDER uses cluster resource in the way different from RELION. If you were using 8 nodes with 16 cores each, it is better to run 1 process with 16 threads on each node. Moreover, as some job managing system such as LFS restricts the number of physic cores assigned to each process, I believe that it is important to check the configuration of the job managing system, making sure that one node runs one process of THUNDER, and this process can use all CPU resource by threading. If this method does not accelerate your job, please contract us and inform us with your job information, such as number of images, boxsize and symmetry. We will compare it with our benchmark.

Best regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants