convert_msv2_to_processing_set performance #321

kswang1029 · 2024-11-23T07:10:02Z

I did a test with two measurement sets (MSv2) with different file sizes and found that it seems the performance of MSv2 to MSv4 does not scale linearly.

"""
# this is a SV data W51 band1
msv2_name = "./W51_Band1_UncalibratedData/uid___A002_X1048ed8_X4af7.ms"
convert_out = "./W51_Band1_UncalibratedData/uid___A002_X1048ed8_X4af7.vis.zarr"
convert_msv2_to_processing_set(
in_file=msv2_name,
out_file=convert_out,
overwrite=True,
)
# convert_msv2_to_processing_set is single-threaded, could it be parallelized?
# file size of uid___A002_X1048ed8_X4af7.ms is 10.65 GB
# elapsed time = 3.5018158674240114 min...
# output uid___A002_X1048ed8_X4af7.vis.zarr is 5.63 GB
# [2024-11-20 14:44:14,549] INFO viperlog: Partition scheme that will be used: ['DATA_DESC_ID', 'OBS_MODE', 'OBSERVATION_ID', 'FIELD_ID']
# [2024-11-20 14:44:15,407] INFO viperlog: Number of partitions: 155

"""

"""
# this is a SV data W51 band1 with two ms combined via casa-concat task

msv2_name = "./W51_Band1_UncalibratedData/combined.ms"
convert_out = "./W51_Band1_UncalibratedData/combined.vis.zarr"
convert_msv2_to_processing_set(
in_file=msv2_name,
out_file=convert_out,
overwrite=True,
)
# file size of combined.ms is 23.55 GB
# elapsed time = 9.483368102709452 min...
# output combined.vis.zarr is 12.34 GB
# [2024-11-20 14:51:44,953] INFO viperlog: Partition scheme that will be used: ['DATA_DESC_ID', 'OBS_MODE', 'OBSERVATION_ID', 'FIELD_ID']
# [2024-11-20 14:51:48,699] INFO viperlog: Number of partitions: 310

# the elapsed time seems not linearly scaled with the ms file size. It gets slower.

"""

Is this an expected behavior? In addition, can it be parallelized to improve conversion performance?

FedeMPouzols · 2024-11-25T10:12:25Z

@kswang1029 : yes, the convert function can run in parallel (see the parallel parameter). For that you'd need to initialize a Dask client (for simple examples you can check the local_client used here https://github.com/casangi/xradio/blob/main/demo/demo.ipynb or here https://github.com/casangi/xradio/blob/main/docs/source/measurement_set/tutorials/ps_vis.ipynb).

kswang1029 · 2024-11-25T12:30:40Z

@kswang1029 : yes, the convert function can run in parallel (see the parallel parameter). For that you'd need to initialize a Dask client (for simple examples you can check the local_client used here https://github.com/casangi/xradio/blob/main/demo/demo.ipynb or here https://github.com/casangi/xradio/blob/main/docs/source/measurement_set/tutorials/ps_vis.ipynb).

@FedeMPouzols Thanks for pointing out the tutorials. I will give it a try. So it seems it is environment-dependent (ie dask). Does this imply that every time I run the convert function (as an example here, in general any function applies to a processing set), I will need to configure my dask environment as the first step? Could it be a configure file so that I will just do once? I see there is an estimate function for suggested resources. Could it be run and applied behind the scene so that users would not have to worry about it every time and for every other functions (I imagine the estimated resource is function/task-dependent(?)). Sorry this might not be related to MSv4 schema but a more general question for using xarray and dask for the new era.

FedeMPouzols · 2024-11-25T12:59:58Z

Hi @kswang1029 : the estimate function is specifically for the MSv2=>MSv4 convert function. The dask environment does not need to be re-configured for different functions. In the tutorials and demo they just convert a dataset and do a few simple operations, so it might give the wrong impression. You'd normally configure the dask environment only once in your data manipulation/reduction session.
The convert function can require significant amount of memory and the estimate function has been introduced recently so that you can have an approximate idea beforehand.

kswang1029 added the MSv4 Review label Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert_msv2_to_processing_set performance #321

convert_msv2_to_processing_set performance #321

kswang1029 commented Nov 23, 2024

FedeMPouzols commented Nov 25, 2024

kswang1029 commented Nov 25, 2024

FedeMPouzols commented Nov 25, 2024

convert_msv2_to_processing_set performance #321

convert_msv2_to_processing_set performance #321

Comments

kswang1029 commented Nov 23, 2024

FedeMPouzols commented Nov 25, 2024

kswang1029 commented Nov 25, 2024

FedeMPouzols commented Nov 25, 2024