Using Awkward Arrays at some stage of Argo data processing? #210
Replies: 2 comments 2 replies
-
Hi @jpivarski First, I should note that xarray dataset is such intimately used by argopy as internal data model that changing it would require a huge amount of work that I see could only be motivated by a dramatic improvement in performances. And as of now, the performance bottle neck is on the server side. So here, using a cloud native new format to access Argo data is the way to go I guess. That's why I would also be very curious to see how you manage to convert the full dataset into a parquet file (which by the way I could not access), would you be willing to share your code doing this ? Second, if I understand a little bit how Akward arrays are made for, I could see the added value for the manipulation of Argo floats trajectories. This would tend to be close to the Lagrangian analysis you're pointing at. As a first step, I imagine the internal possibility to convert an Argo index into an Akward array with each float coordinates (and possibly other meta-data). Then this Akward array could be used to provide more data selection mechanism (like the shore distance you're mentioning above), but also the more awaited specific path selection (eg: #169 ). |
Beta Was this translation helpful? Give feedback.
-
Hi @jpivarski
Yes, No and Yes. "get data before this processing step". Indeed the DAC
If you're still up to this, I guess for a good start, you can work with the multi-profile files under:
This would be awesome to try ! |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm the author of Awkward Array, a library for arrays with heterogenous types and shapes (GitHub). We've been using it in Nuclear and High Energy Physics (NHEP) for a few years now, and I've been thinking about other application domains to broaden its usefulness.
For example, it was by finding an application in radio astronomy that we realized that we needed to support complex numbers, and it was by feedback from a data scientist that we realized we needed date-times. (Then the date-times ended up being useful for Argo time data, too, as you'll see below...)
Half a year ago, I came across Nicolas Mortimer's article about trying to include Argo data in Pangeo and Zarr, but the trouble was that Argo data have variable-length lists (heterogeneous in size). Awkward Arrays are designed for that sort of thing, and we're in discussion about how to incorporate it into Zarr.
However, I also noticed that the entire Argo dataset is rather small. A full set of NetCDF files from 1997 through 2021 is 135 GB (downloaded from ftp.ifremer.fr), and saving all the fields that argopy uses in "expert mode" results in a single 7.0 GB Parquet file. I made a copy of this file available on HTTP and
s3://pivarski-princeton/argo-floats-expert.parquet
, though it should be understood that this is just a snapshot: I can see by the NetCDF file dates that the old data are occasionally rewritten (maybe to update the "adjusted" pressure, temperature, and salinity values with new calibrations?).In this gist, I made a demonstration of exploring the Argo data with Awkward Array. I know that the xarray interface is the most familiar to Argo users, the point here is to show what new things you can do if you can address the entire dataset as one array with nested structure. argopy has data-fetchers for probes by ID and longitude/latitude/time boxes, but suppose you wanted to select all data by distance from shore:
(I'm making something up—I don't actually know what new kinds of selectors would be useful.) Using Numba to accelerate the "distance from shore" calculation and not much else (single-threaded because I haven't used Awkward Array's Dask hooks yet), this run over all Argo data, producing xarray data with all the "standard" fields took 7 minutes.
Oceanography is new to me (my background is in NHEP), but I'm working with @philippemiron on a similar project: he included a test of Awkward Array in a demonstration of Lagrangian analysis for the EarthCube Annual meeting. I'd like to know if any of this could be useful for Argo analysis, either visibly at the end-user stage or hidden in a workflow, such as a service that provides novel data fetchers. (For that latter, there's a lot of room for improvement in the 7 minute test above—not just Dask, but a computer with 64 GB of RAM could comfortably hold all of the standard fields uncompressed in memory, and 128 or 256 GB could process all of the expert data.)
Let me know—thanks!
Beta Was this translation helpful? Give feedback.
All reactions