ddcal worker fails due to a MemoryError #1582

a-benati · 2024-05-11T08:20:32Z

Hello,

the ddcal worker fails with:
MemoryError: Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.

I don't understand which are the parameters to be modified in order to solve this problem. The number of worker processes (dist_nworker) is set to 0. I tried to modify the data_chunkhours parameter to 0.01 instead of the default 0.05 and nothing seems to be different.

Here is the log file where the error is encountered:

# INFO      01:08:53 - main               [0.8 11.8 0.0Gb] multi-process mode: 1+1 workers, --dist-nthread 1
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Detected a total of 503.77GiB of system memory.
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Per-solver (worker) memory use estimated at 789.68GiB: 156.75% of total system memory.
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Peak I/O memory use estimated at 571.53GiB: 113.45% of total system memory.
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Total peak memory usage estimated at 1361.20GiB: 270.20% of total system memory.
# INFO      01:08:53 - main               [0.8 11.8 0.0Gb] Exiting with exception: MemoryError(Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.)
#  Traceback (most recent call last):
#   File "/opt/venv/lib/python3.8/site-packages/cubical/main.py", line 548, in main
#     estimate_mem(ms, tile_list, GD["data"], GD["dist"])
#   File "/opt/venv/lib/python3.8/site-packages/cubical/data_handler/wisdom.py", line 89, in estimate_mem
#     raise MemoryError(
# MemoryError: Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.

I found the same problem in #1466, but trying to adjust the parameters dd_g_timeslots_int and dd_dd_timeslots_int does not seem to improve the situation (I tried with dd_g_timeslots_int: 16 and dd_dd_timeslots_int: 16 and with dd_g_timeslots_int: 4 and dd_dd_timeslots_int: 4).

Do you know how can I solve this problem?

The text was updated successfully, but these errors were encountered:

Athanaseus · 2024-05-14T11:15:04Z

Hi @a-benati , thanks for reporting this.

Can you please share the full log?
And does it help setting dist_nworker to 1 or 2?

Best regards

a-benati · 2024-05-14T11:48:53Z

Hi @Athanaseus,

thanks for your answer. Here is the full log:
log-caracal.txt

Setting dist_nworker to a higher value actually makes the situation worse since the required memory increases.

Athanaseus · 2024-05-14T12:22:32Z

Thanks @a-benati ,
by default the parameter is 0 meaning load the entire data.
I'm curious to see the log results of dist_nmworker: 1
and want to compare the requested memory.
Regards

JSKenyon · 2024-05-14T12:30:48Z

I believe that the issue is the absence of time and frequency chunks in the input parameters. You will be be working with extremely large chunks. I would suggest setting the input time and frequency chunks to match the solution interval on your DDE in this case.

a-benati · 2024-05-14T12:31:25Z

Thanks @Athanaseus. Here is the log result of dist_nmworker: 4, since I already have it and it would take ~9 hours to try with dist_nmworker: 1.
log-caracal_dist_nworker_4.txt

a-benati · 2024-05-14T12:36:17Z

@JSKenyon thanks for your answer. I agree with the fact that I need smaller time and frequency chunks, but I am not sure about the parameters to change: are they dd_dd_timeslots_int and dd_dd_chan_int? And what would be a fair value? Or, better, what do I have to inspect to understand which would be a fair value?

JSKenyon · 2024-05-14T12:44:25Z

These is where the options are set in the ddcal worker:

caracal/caracal/workers/ddcal_worker.py

Lines 330 to 331 in 2d338e2

    
           "data-time-chunk": ddsols_t * int(min(1, config[key]['dist_nworker'])) if (ddsols_f == 0 or config[key]['dd_g_chan_int'] == 0) else ddsols_t * int(min(1, np.sqrt(config[key]['dist_nworker']))), 
        
           "data-freq-chunk": 0 if (ddsols_f == 0 or config[key]['dd_g_chan_int'] == 0) else ddsols_f * int(min(1, np.sqrt(config[key]['dist_nworker']))),

I am not much of a CaraCal user so I am not sure of the easiest way to adjust those parameters.

JSKenyon · 2024-05-14T12:50:42Z

In principle, for the parameters in the log you shared, data-time-chunk=4 would probably be ideal. Currently, it gets sets to zero which means it will treat each scan as a chunk. Let me know if you manage to give that a go and feel free to share further logs - I might be able to offer further insight.

a-benati · 2024-05-14T13:02:55Z

Thanks @JSKenyon. When dist_nworker: 0 both data-time-chunk and data-freq-chunk are set to 0. When I set dist_nworker: 4, I have data-time-chunk=100 and data-freq-chunk=0, but the memory error persists (you can check the log file above). Do you think I should set data-time-chunk=4 directly in the code of ddcal_worker.py?

JSKenyon · 2024-05-14T13:13:03Z

You could definitely give it a try and see if it resolves the issue. I see that you have 24 directions in your model - that is pretty extreme (your model will be 24 times larger than the associated visibilities). I would also suggest making your frequency solution interval something which divides 512 (the number of channels if I am not mistaken) e.g. 128. That should prevent some complications. Unfortunately, CubiCal (the underlying software package for the ddcal step) was never particularly light on memory.

a-benati · 2024-05-14T13:20:03Z

@JSKenyon thanks, I will try setting data-time-chunk=4 and data-freq-chunk=128 directly in the code and see what happens. I will keep you posted.

a-benati · 2024-05-15T08:37:46Z

@Athanaseus, @JSKenyon thanks. I think I solved that error since the code gets through the part where it was stuck before. However, now I get another error, which I believe is related to flagging in DDFacet (I think all data are flagged). Here is the log file:
log-caracal_new.txt

JSKenyon · 2024-05-15T09:32:06Z

It looks like the data has been almost completely flagged, possibly by CubiCal. You should probably check your flagging before and after that step. CubiCal is also very unhappy about the SNR in many of the directions. I would suggest looking at your image prior to DD calibration to make sure that all 24 of those directions really require DD solutions.

a-benati · 2024-05-15T09:49:54Z

@JSKenyon thanks. I reduced the number of facets to 12, but I don't think I really need this many directions, I only have 3 or 4 very bright sources in my field which corrupt everything else. Do you think that reducing the number of facets to 4 or 6 could solve the issue of the flagging? Or is it a completely independent problem?

JSKenyon · 2024-05-15T11:15:30Z

Unfortunately I did not implement the DDFacet component of the visibility prediction, so I am not an expert. I think that you would likely need to edit the region file passed to CubiCal such that it only includes the 4 or so problematic sources.

a-benati · 2024-05-15T12:19:42Z

Thanks @JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?

JSKenyon · 2024-05-15T12:25:32Z

Thanks @JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?

Based on your log, it is /stimela_mount/output/de-Abell3667.reg. It is created by CatDagger in the previous step. I would suggest manually creating your own region file using Carta or DS9. You could then modify the model option in the CubiCal step to use your region file (note that it appears twice in the specification of the model).

Pinging @bennahugo as he is more knowledgeable about this functionality than I am.

bennahugo · 2024-05-15T12:36:02Z

Yup you may need to increase the local sigma thresholding to the autotagger if you want to use it -- alternatively manually create a pixel-coordinate region file for your target with astropy / ds9 to pass into cubical per @JSKenyon 's suggestion. The number of facets has no traction on the memory footprint though -- only the number of directions you marked in the region file. I do agree that 12 tags are on the excessive end.

…

On Wed, May 15, 2024 at 2:25 PM JSKenyon ***@***.***> wrote: Thanks @JSKenyon <https://github.com/JSKenyon>. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right? Based on your log, it is /stimela_mount/output/de-Abell3667.reg. It is created by CatDagger in the previous step. I would suggest manually creating your own region file using Carta or DS9. You could then modify the model option in the CubiCal step to use your region file (note that it appears twice in the specification of the model). Pinging @bennahugo <https://github.com/bennahugo> as he is more knowledgeable about this functionality than I am. — Reply to this email directly, view it on GitHub <#1582 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6XNBHHOTSLBIJHNASLZCNH5FAVCNFSM6AAAAABHRXT3K2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJSGM4DQMBTGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Benjamin Hugo

a-benati · 2024-05-15T13:53:37Z

@JSKenyon @bennahugo thank you. I will try by creating manually a region file with ds9 and telling CubiCal to use that file instead of the one created by CatDagger. I will let you know if it works.

a-benati · 2024-05-17T10:19:22Z

@JSKenyon @bennahugo I created the region file manually with carta and I gave it as input in CubiCal, but I still get the same error related to the flagged data. I actually think that the code stops at a previous step since in the log file the point where the region file is read is not even reached. For example, before the file caracaldE_sub.log was created, but now it is not. Here is my log file.
log-caracal_latest.txt

JSKenyon · 2024-05-17T10:53:50Z

Can you please check the status of the flagging on the original data, prior to the pipeline being run? I don't think that the pipeline is resetting the flags to their original state i.e. now that your data is 100% flagged, it will remain that way.

a-benati · 2024-05-17T11:20:02Z

@JSKenyon yes, my data now is 100% flagged even prior to the run of the pipeline. Do you know how could I reset the flagging? I am running caracal starting directly from the ddcal worker, maybe I need to start over from the beginning to get it right? And in that case, giving the right list of tagged sources to CubiCal should solve the error with the flagging right?

paoloserra · 2024-05-17T11:29:15Z

Sorry to jump in, but CARACal does support flagging resetting and rewinding in a number of ways. See https://caracal.readthedocs.io/en/latest/manual/reduction/flag/index.html .

The ddcal worker might be the only one with no flagging rewinding option, but you could add a flag worker block to your config to just do the rewinding to whatever flag version you need.

JSKenyon · 2024-05-17T12:04:32Z

Sorry to jump in, but CARACal does support flagging resetting and rewinding in a number of ways. See https://caracal.readthedocs.io/en/latest/manual/reduction/flag/index.html .

The ddcal worker might be the only one with no flagging rewinding option, but you could add a flag worker block to your config to just do the rewinding to whatever flag version you need.

Thanks for jumping in! I am not really a CARACal expert so I appreciate it!

a-benati · 2024-05-17T12:10:53Z

@paoloserra thanks! I will definitely look into that, hoping that giving the manual region file to CubiCal solves the issue.

a-benati · 2024-05-17T16:10:16Z

@paoloserra I get an error saying that there aren't any flag versions for my ms file:

2024-05-17 18:06:41 CARACal INFO: flag__3: initializing
2024-05-17 18:06:41 CARACal ERROR: You have asked to rewind the flags of 1685906777_sdp_l0.ms to the version "caracal_flag__3_before" but this version
2024-05-17 18:06:41 CARACal ERROR: does not exist. The available flag versions for this .MS file are:
2024-05-17 18:06:41 CARACal ERROR: Note that if you are running Caracal on multiple targets and/or .MS files you should rewind to a flag
2024-05-17 18:06:41 CARACal ERROR: version that exists for all of them.
2024-05-17 18:06:41 CARACal ERROR: Flag version conflicts. [RuntimeError]

I attach here my log file.
log-caracal_flag.txt

Athanaseus · 2024-05-20T08:03:38Z

Hi @a-benati,

You can also provide the name of the flag version like:

rewind_flags:
  enable:                       True
  mode:                         rewind_to_version
  version:                      caracal_selfcal_after

You can look up the flag versions in the flag table (<dataid>.ms.flagversions/FLAG_VERSION_LIST) to get the one you need.
A description of the parameter is here: https://caracal.readthedocs.io/en/latest/manual/workers/flag/index.html#rewind-flags

Note that the label_in is set to an empty string, meaning use MS=1685906777_sdp_l0.ms.

Best regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddcal worker fails due to a MemoryError #1582

ddcal worker fails due to a MemoryError #1582

a-benati commented May 11, 2024 •

edited

Loading

Athanaseus commented May 14, 2024

a-benati commented May 14, 2024

Athanaseus commented May 14, 2024

JSKenyon commented May 14, 2024

a-benati commented May 14, 2024

a-benati commented May 14, 2024

JSKenyon commented May 14, 2024

JSKenyon commented May 14, 2024

a-benati commented May 14, 2024 •

edited

Loading

JSKenyon commented May 14, 2024

a-benati commented May 14, 2024

a-benati commented May 15, 2024

JSKenyon commented May 15, 2024

a-benati commented May 15, 2024

JSKenyon commented May 15, 2024

a-benati commented May 15, 2024 •

edited

Loading

JSKenyon commented May 15, 2024

bennahugo commented May 15, 2024 via email

a-benati commented May 15, 2024

a-benati commented May 17, 2024 •

edited

Loading

JSKenyon commented May 17, 2024

a-benati commented May 17, 2024

paoloserra commented May 17, 2024

JSKenyon commented May 17, 2024

a-benati commented May 17, 2024

a-benati commented May 17, 2024

Athanaseus commented May 20, 2024

ddcal worker fails due to a MemoryError #1582

ddcal worker fails due to a MemoryError #1582

Comments

a-benati commented May 11, 2024 • edited Loading

Athanaseus commented May 14, 2024

a-benati commented May 14, 2024

Athanaseus commented May 14, 2024

JSKenyon commented May 14, 2024

a-benati commented May 14, 2024

a-benati commented May 14, 2024

JSKenyon commented May 14, 2024

JSKenyon commented May 14, 2024

a-benati commented May 14, 2024 • edited Loading

JSKenyon commented May 14, 2024

a-benati commented May 14, 2024

a-benati commented May 15, 2024

JSKenyon commented May 15, 2024

a-benati commented May 15, 2024

JSKenyon commented May 15, 2024

a-benati commented May 15, 2024 • edited Loading

JSKenyon commented May 15, 2024

bennahugo commented May 15, 2024 via email

a-benati commented May 15, 2024

a-benati commented May 17, 2024 • edited Loading

JSKenyon commented May 17, 2024

a-benati commented May 17, 2024

paoloserra commented May 17, 2024

JSKenyon commented May 17, 2024

a-benati commented May 17, 2024

a-benati commented May 17, 2024

Athanaseus commented May 20, 2024

a-benati commented May 11, 2024 •

edited

Loading

a-benati commented May 14, 2024 •

edited

Loading

a-benati commented May 15, 2024 •

edited

Loading

a-benati commented May 17, 2024 •

edited

Loading