-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert_msv2_to_processing_set performance #321
Comments
@kswang1029 : yes, the convert function can run in parallel (see the |
@FedeMPouzols Thanks for pointing out the tutorials. I will give it a try. So it seems it is environment-dependent (ie dask). Does this imply that every time I run the convert function (as an example here, in general any function applies to a processing set), I will need to configure my dask environment as the first step? Could it be a configure file so that I will just do once? I see there is an estimate function for suggested resources. Could it be run and applied behind the scene so that users would not have to worry about it every time and for every other functions (I imagine the estimated resource is function/task-dependent(?)). Sorry this might not be related to MSv4 schema but a more general question for using xarray and dask for the new era. |
Hi @kswang1029 : the estimate function is specifically for the MSv2=>MSv4 convert function. The dask environment does not need to be re-configured for different functions. In the tutorials and demo they just convert a dataset and do a few simple operations, so it might give the wrong impression. You'd normally configure the dask environment only once in your data manipulation/reduction session. |
I did a test with two measurement sets (MSv2) with different file sizes and found that it seems the performance of MSv2 to MSv4 does not scale linearly.
"""
# this is a SV data W51 band1
msv2_name = "./W51_Band1_UncalibratedData/uid___A002_X1048ed8_X4af7.ms"
convert_out = "./W51_Band1_UncalibratedData/uid___A002_X1048ed8_X4af7.vis.zarr"
convert_msv2_to_processing_set(
in_file=msv2_name,
out_file=convert_out,
overwrite=True,
)
# convert_msv2_to_processing_set is single-threaded, could it be parallelized?
# file size of uid___A002_X1048ed8_X4af7.ms is 10.65 GB
# elapsed time = 3.5018158674240114 min...
# output uid___A002_X1048ed8_X4af7.vis.zarr is 5.63 GB
# [2024-11-20 14:44:14,549] INFO viperlog: Partition scheme that will be used: ['DATA_DESC_ID', 'OBS_MODE', 'OBSERVATION_ID', 'FIELD_ID']
# [2024-11-20 14:44:15,407] INFO viperlog: Number of partitions: 155
"""
"""
# this is a SV data W51 band1 with two ms combined via casa-concat task
msv2_name = "./W51_Band1_UncalibratedData/combined.ms"
convert_out = "./W51_Band1_UncalibratedData/combined.vis.zarr"
convert_msv2_to_processing_set(
in_file=msv2_name,
out_file=convert_out,
overwrite=True,
)
# file size of combined.ms is 23.55 GB
# elapsed time = 9.483368102709452 min...
# output combined.vis.zarr is 12.34 GB
# [2024-11-20 14:51:44,953] INFO viperlog: Partition scheme that will be used: ['DATA_DESC_ID', 'OBS_MODE', 'OBSERVATION_ID', 'FIELD_ID']
# [2024-11-20 14:51:48,699] INFO viperlog: Number of partitions: 310
# the elapsed time seems not linearly scaled with the ms file size. It gets slower.
"""
Is this an expected behavior? In addition, can it be parallelized to improve conversion performance?
The text was updated successfully, but these errors were encountered: