Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddcal worker fails due to a MemoryError #1582

Open
a-benati opened this issue May 11, 2024 · 27 comments
Open

ddcal worker fails due to a MemoryError #1582

a-benati opened this issue May 11, 2024 · 27 comments

Comments

@a-benati
Copy link

a-benati commented May 11, 2024

Hello,

the ddcal worker fails with:
MemoryError: Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.

I don't understand which are the parameters to be modified in order to solve this problem. The number of worker processes (dist_nworker) is set to 0. I tried to modify the data_chunkhours parameter to 0.01 instead of the default 0.05 and nothing seems to be different.

Here is the log file where the error is encountered:

# INFO      01:08:53 - main               [0.8 11.8 0.0Gb] multi-process mode: 1+1 workers, --dist-nthread 1
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Detected a total of 503.77GiB of system memory.
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Per-solver (worker) memory use estimated at 789.68GiB: 156.75% of total system memory.
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Peak I/O memory use estimated at 571.53GiB: 113.45% of total system memory.
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Total peak memory usage estimated at 1361.20GiB: 270.20% of total system memory.
# INFO      01:08:53 - main               [0.8 11.8 0.0Gb] Exiting with exception: MemoryError(Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.)
#  Traceback (most recent call last):
#   File "/opt/venv/lib/python3.8/site-packages/cubical/main.py", line 548, in main
#     estimate_mem(ms, tile_list, GD["data"], GD["dist"])
#   File "/opt/venv/lib/python3.8/site-packages/cubical/data_handler/wisdom.py", line 89, in estimate_mem
#     raise MemoryError(
# MemoryError: Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.

I found the same problem in #1466, but trying to adjust the parameters dd_g_timeslots_int and dd_dd_timeslots_int does not seem to improve the situation (I tried with dd_g_timeslots_int: 16 and dd_dd_timeslots_int: 16 and with dd_g_timeslots_int: 4 and dd_dd_timeslots_int: 4).

Do you know how can I solve this problem?

@Athanaseus
Copy link
Collaborator

Hi @a-benati , thanks for reporting this.

Can you please share the full log?
And does it help setting dist_nworker to 1 or 2?

Best regards

@a-benati
Copy link
Author

Hi @Athanaseus,

thanks for your answer. Here is the full log:
log-caracal.txt

Setting dist_nworker to a higher value actually makes the situation worse since the required memory increases.

@Athanaseus
Copy link
Collaborator

Thanks @a-benati ,
by default the parameter is 0 meaning load the entire data.
I'm curious to see the log results of dist_nmworker: 1
and want to compare the requested memory.
Regards

@JSKenyon
Copy link

I believe that the issue is the absence of time and frequency chunks in the input parameters. You will be be working with extremely large chunks. I would suggest setting the input time and frequency chunks to match the solution interval on your DDE in this case.

@a-benati
Copy link
Author

Thanks @Athanaseus. Here is the log result of dist_nmworker: 4, since I already have it and it would take ~9 hours to try with dist_nmworker: 1.
log-caracal_dist_nworker_4.txt

@a-benati
Copy link
Author

@JSKenyon thanks for your answer. I agree with the fact that I need smaller time and frequency chunks, but I am not sure about the parameters to change: are they dd_dd_timeslots_int and dd_dd_chan_int? And what would be a fair value? Or, better, what do I have to inspect to understand which would be a fair value?

@JSKenyon
Copy link

These is where the options are set in the ddcal worker:

"data-time-chunk": ddsols_t * int(min(1, config[key]['dist_nworker'])) if (ddsols_f == 0 or config[key]['dd_g_chan_int'] == 0) else ddsols_t * int(min(1, np.sqrt(config[key]['dist_nworker']))),
"data-freq-chunk": 0 if (ddsols_f == 0 or config[key]['dd_g_chan_int'] == 0) else ddsols_f * int(min(1, np.sqrt(config[key]['dist_nworker']))),

I am not much of a CaraCal user so I am not sure of the easiest way to adjust those parameters.

@JSKenyon
Copy link

In principle, for the parameters in the log you shared, data-time-chunk=4 would probably be ideal. Currently, it gets sets to zero which means it will treat each scan as a chunk. Let me know if you manage to give that a go and feel free to share further logs - I might be able to offer further insight.

@a-benati
Copy link
Author

a-benati commented May 14, 2024

Thanks @JSKenyon. When dist_nworker: 0 both data-time-chunk and data-freq-chunk are set to 0. When I set dist_nworker: 4, I have data-time-chunk=100 and data-freq-chunk=0, but the memory error persists (you can check the log file above). Do you think I should set data-time-chunk=4 directly in the code of ddcal_worker.py?

@JSKenyon
Copy link

You could definitely give it a try and see if it resolves the issue. I see that you have 24 directions in your model - that is pretty extreme (your model will be 24 times larger than the associated visibilities). I would also suggest making your frequency solution interval something which divides 512 (the number of channels if I am not mistaken) e.g. 128. That should prevent some complications. Unfortunately, CubiCal (the underlying software package for the ddcal step) was never particularly light on memory.

@a-benati
Copy link
Author

@JSKenyon thanks, I will try setting data-time-chunk=4 and data-freq-chunk=128 directly in the code and see what happens. I will keep you posted.

@a-benati
Copy link
Author

@Athanaseus, @JSKenyon thanks. I think I solved that error since the code gets through the part where it was stuck before. However, now I get another error, which I believe is related to flagging in DDFacet (I think all data are flagged). Here is the log file:
log-caracal_new.txt

@JSKenyon
Copy link

It looks like the data has been almost completely flagged, possibly by CubiCal. You should probably check your flagging before and after that step. CubiCal is also very unhappy about the SNR in many of the directions. I would suggest looking at your image prior to DD calibration to make sure that all 24 of those directions really require DD solutions.

@a-benati
Copy link
Author

@JSKenyon thanks. I reduced the number of facets to 12, but I don't think I really need this many directions, I only have 3 or 4 very bright sources in my field which corrupt everything else. Do you think that reducing the number of facets to 4 or 6 could solve the issue of the flagging? Or is it a completely independent problem?

@JSKenyon
Copy link

Unfortunately I did not implement the DDFacet component of the visibility prediction, so I am not an expert. I think that you would likely need to edit the region file passed to CubiCal such that it only includes the 4 or so problematic sources.

@a-benati
Copy link
Author

a-benati commented May 15, 2024

Thanks @JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?

@JSKenyon
Copy link

Thanks @JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?

Based on your log, it is /stimela_mount/output/de-Abell3667.reg. It is created by CatDagger in the previous step. I would suggest manually creating your own region file using Carta or DS9. You could then modify the model option in the CubiCal step to use your region file (note that it appears twice in the specification of the model).

Pinging @bennahugo as he is more knowledgeable about this functionality than I am.

@bennahugo
Copy link
Collaborator

bennahugo commented May 15, 2024 via email

@a-benati
Copy link
Author

@JSKenyon @bennahugo thank you. I will try by creating manually a region file with ds9 and telling CubiCal to use that file instead of the one created by CatDagger. I will let you know if it works.

@a-benati
Copy link
Author

a-benati commented May 17, 2024

@JSKenyon @bennahugo I created the region file manually with carta and I gave it as input in CubiCal, but I still get the same error related to the flagged data. I actually think that the code stops at a previous step since in the log file the point where the region file is read is not even reached. For example, before the file caracaldE_sub.log was created, but now it is not. Here is my log file.
log-caracal_latest.txt

@JSKenyon
Copy link

Can you please check the status of the flagging on the original data, prior to the pipeline being run? I don't think that the pipeline is resetting the flags to their original state i.e. now that your data is 100% flagged, it will remain that way.

@a-benati
Copy link
Author

@JSKenyon yes, my data now is 100% flagged even prior to the run of the pipeline. Do you know how could I reset the flagging? I am running caracal starting directly from the ddcal worker, maybe I need to start over from the beginning to get it right? And in that case, giving the right list of tagged sources to CubiCal should solve the error with the flagging right?

@paoloserra
Copy link
Collaborator

Sorry to jump in, but CARACal does support flagging resetting and rewinding in a number of ways. See https://caracal.readthedocs.io/en/latest/manual/reduction/flag/index.html .

The ddcal worker might be the only one with no flagging rewinding option, but you could add a flag worker block to your config to just do the rewinding to whatever flag version you need.

@JSKenyon
Copy link

Sorry to jump in, but CARACal does support flagging resetting and rewinding in a number of ways. See https://caracal.readthedocs.io/en/latest/manual/reduction/flag/index.html .

The ddcal worker might be the only one with no flagging rewinding option, but you could add a flag worker block to your config to just do the rewinding to whatever flag version you need.

Thanks for jumping in! I am not really a CARACal expert so I appreciate it!

@a-benati
Copy link
Author

@paoloserra thanks! I will definitely look into that, hoping that giving the manual region file to CubiCal solves the issue.

@a-benati
Copy link
Author

@paoloserra I get an error saying that there aren't any flag versions for my ms file:

2024-05-17 18:06:41 CARACal INFO: flag__3: initializing
2024-05-17 18:06:41 CARACal ERROR: You have asked to rewind the flags of 1685906777_sdp_l0.ms to the version "caracal_flag__3_before" but this version
2024-05-17 18:06:41 CARACal ERROR: does not exist. The available flag versions for this .MS file are:
2024-05-17 18:06:41 CARACal ERROR: Note that if you are running Caracal on multiple targets and/or .MS files you should rewind to a flag
2024-05-17 18:06:41 CARACal ERROR: version that exists for all of them.
2024-05-17 18:06:41 CARACal ERROR: Flag version conflicts. [RuntimeError]

I attach here my log file.
log-caracal_flag.txt

@Athanaseus
Copy link
Collaborator

Hi @a-benati,

You can also provide the name of the flag version like:

rewind_flags:
  enable:                       True
  mode:                         rewind_to_version
  version:                      caracal_selfcal_after

You can look up the flag versions in the flag table (<dataid>.ms.flagversions/FLAG_VERSION_LIST) to get the one you need.
A description of the parameter is here: https://caracal.readthedocs.io/en/latest/manual/workers/flag/index.html#rewind-flags

Note that the label_in is set to an empty string, meaning use MS=1685906777_sdp_l0.ms.

Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants