Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert_msv2_to_processing_set performance #321

Open
kswang1029 opened this issue Nov 23, 2024 · 3 comments
Open

convert_msv2_to_processing_set performance #321

kswang1029 opened this issue Nov 23, 2024 · 3 comments

Comments

@kswang1029
Copy link

I did a test with two measurement sets (MSv2) with different file sizes and found that it seems the performance of MSv2 to MSv4 does not scale linearly.

"""
# this is a SV data W51 band1
msv2_name = "./W51_Band1_UncalibratedData/uid___A002_X1048ed8_X4af7.ms"
convert_out = "./W51_Band1_UncalibratedData/uid___A002_X1048ed8_X4af7.vis.zarr"
convert_msv2_to_processing_set(
in_file=msv2_name,
out_file=convert_out,
overwrite=True,
)
# convert_msv2_to_processing_set is single-threaded, could it be parallelized?
# file size of uid___A002_X1048ed8_X4af7.ms is 10.65 GB
# elapsed time = 3.5018158674240114 min...
# output uid___A002_X1048ed8_X4af7.vis.zarr is 5.63 GB
# [2024-11-20 14:44:14,549] INFO viperlog: Partition scheme that will be used: ['DATA_DESC_ID', 'OBS_MODE', 'OBSERVATION_ID', 'FIELD_ID']
# [2024-11-20 14:44:15,407] INFO viperlog: Number of partitions: 155

"""

"""
# this is a SV data W51 band1 with two ms combined via casa-concat task

msv2_name = "./W51_Band1_UncalibratedData/combined.ms"
convert_out = "./W51_Band1_UncalibratedData/combined.vis.zarr"
convert_msv2_to_processing_set(
in_file=msv2_name,
out_file=convert_out,
overwrite=True,
)
# file size of combined.ms is 23.55 GB
# elapsed time = 9.483368102709452 min...
# output combined.vis.zarr is 12.34 GB
# [2024-11-20 14:51:44,953] INFO viperlog: Partition scheme that will be used: ['DATA_DESC_ID', 'OBS_MODE', 'OBSERVATION_ID', 'FIELD_ID']
# [2024-11-20 14:51:48,699] INFO viperlog: Number of partitions: 310

# the elapsed time seems not linearly scaled with the ms file size. It gets slower.

"""

Is this an expected behavior? In addition, can it be parallelized to improve conversion performance?

@FedeMPouzols
Copy link
Collaborator

@kswang1029 : yes, the convert function can run in parallel (see the parallel parameter). For that you'd need to initialize a Dask client (for simple examples you can check the local_client used here https://github.com/casangi/xradio/blob/main/demo/demo.ipynb or here https://github.com/casangi/xradio/blob/main/docs/source/measurement_set/tutorials/ps_vis.ipynb).

@kswang1029
Copy link
Author

@kswang1029 : yes, the convert function can run in parallel (see the parallel parameter). For that you'd need to initialize a Dask client (for simple examples you can check the local_client used here https://github.com/casangi/xradio/blob/main/demo/demo.ipynb or here https://github.com/casangi/xradio/blob/main/docs/source/measurement_set/tutorials/ps_vis.ipynb).

@FedeMPouzols Thanks for pointing out the tutorials. I will give it a try. So it seems it is environment-dependent (ie dask). Does this imply that every time I run the convert function (as an example here, in general any function applies to a processing set), I will need to configure my dask environment as the first step? Could it be a configure file so that I will just do once? I see there is an estimate function for suggested resources. Could it be run and applied behind the scene so that users would not have to worry about it every time and for every other functions (I imagine the estimated resource is function/task-dependent(?)). Sorry this might not be related to MSv4 schema but a more general question for using xarray and dask for the new era.

@FedeMPouzols
Copy link
Collaborator

Hi @kswang1029 : the estimate function is specifically for the MSv2=>MSv4 convert function. The dask environment does not need to be re-configured for different functions. In the tutorials and demo they just convert a dataset and do a few simple operations, so it might give the wrong impression. You'd normally configure the dask environment only once in your data manipulation/reduction session.
The convert function can require significant amount of memory and the estimate function has been introduced recently so that you can have an approximate idea beforehand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants