Using Awkward Arrays at some stage of Argo data processing? #210

jpivarski · 2022-04-20T16:08:51Z

jpivarski
Apr 20, 2022

Hi, I'm the author of Awkward Array, a library for arrays with heterogenous types and shapes (GitHub). We've been using it in Nuclear and High Energy Physics (NHEP) for a few years now, and I've been thinking about other application domains to broaden its usefulness.

For example, it was by finding an application in radio astronomy that we realized that we needed to support complex numbers, and it was by feedback from a data scientist that we realized we needed date-times. (Then the date-times ended up being useful for Argo time data, too, as you'll see below...)

Half a year ago, I came across Nicolas Mortimer's article about trying to include Argo data in Pangeo and Zarr, but the trouble was that Argo data have variable-length lists (heterogeneous in size). Awkward Arrays are designed for that sort of thing, and we're in discussion about how to incorporate it into Zarr.

However, I also noticed that the entire Argo dataset is rather small. A full set of NetCDF files from 1997 through 2021 is 135 GB (downloaded from ftp.ifremer.fr), and saving all the fields that argopy uses in "expert mode" results in a single 7.0 GB Parquet file. I made a copy of this file available on HTTP and s3://pivarski-princeton/argo-floats-expert.parquet, though it should be understood that this is just a snapshot: I can see by the NetCDF file dates that the old data are occasionally rewritten (maybe to update the "adjusted" pressure, temperature, and salinity values with new calibrations?).

In this gist, I made a demonstration of exploring the Argo data with Awkward Array. I know that the xarray interface is the most familiar to Argo users, the point here is to show what new things you can do if you can address the entire dataset as one array with nested structure. argopy has data-fetchers for probes by ID and longitude/latitude/time boxes, but suppose you wanted to select all data by distance from shore:

(I'm making something up—I don't actually know what new kinds of selectors would be useful.) Using Numba to accelerate the "distance from shore" calculation and not much else (single-threaded because I haven't used Awkward Array's Dask hooks yet), this run over all Argo data, producing xarray data with all the "standard" fields took 7 minutes.

Oceanography is new to me (my background is in NHEP), but I'm working with @philippemiron on a similar project: he included a test of Awkward Array in a demonstration of Lagrangian analysis for the EarthCube Annual meeting. I'd like to know if any of this could be useful for Argo analysis, either visibly at the end-user stage or hidden in a workflow, such as a service that provides novel data fetchers. (For that latter, there's a lot of room for improvement in the 7 minute test above—not just Dask, but a computer with 64 GB of RAM could comfortably hold all of the standard fields uncompressed in memory, and 128 or 256 GB could process all of the expert data.)

Let me know—thanks!

gmaze · 2022-04-29T08:14:51Z

gmaze
Apr 29, 2022
Maintainer

Hi @jpivarski
Thanks for pointing this lib that could be helpful to argopy !
Indeed, this Akward arrays seems very promising and you already explored quite a lot with Argo data 👍🏻

First, I should note that xarray dataset is such intimately used by argopy as internal data model that changing it would require a huge amount of work that I see could only be motivated by a dramatic improvement in performances. And as of now, the performance bottle neck is on the server side. So here, using a cloud native new format to access Argo data is the way to go I guess. That's why I would also be very curious to see how you manage to convert the full dataset into a parquet file (which by the way I could not access), would you be willing to share your code doing this ?

Second, if I understand a little bit how Akward arrays are made for, I could see the added value for the manipulation of Argo floats trajectories. This would tend to be close to the Lagrangian analysis you're pointing at.
These floats trajectories are collection of latitude/longitude/time points that have not the same length. I understand that Akward arrays have the potential to improve performances and to ease their analysis.
And since Argo floats trajectories are not yet supported specifically by argopy (there is no dedicated API), testing Akward arrays is a possibility. Would you be interested in giving it a try ?

As a first step, I imagine the internal possibility to convert an Argo index into an Akward array with each float coordinates (and possibly other meta-data). Then this Akward array could be used to provide more data selection mechanism (like the shore distance you're mentioning above), but also the more awaited specific path selection (eg: #169 ).

2 replies

jpivarski May 2, 2022
Author

Hi @gmaze,

From what I've read of the argopy docs and what I've heard on biweekly Pangeo calls, I can tell that users strongly expect the xarray interface. Even though Awkward Array is designed as a user (data analyst) facing library, I can understand argopy users not wanting to change their workflows unless they're gaining something substantial—for instance, more than 10× performance improvement or new capabilities that they didn't have before. Generalized data-fetching is one example of a new capability. Such a thing can be entirely server-side, and therefore hidden from users. Another would be the capability to do full-dataset analyses without reducing the data, but I don't know what kinds of analyses people would want to do if they had the whole dataset in hand.

I hadn't known about #169, but it's a perfect example of a need for a new kind of data-fetcher! Awkward Array can help here because we can load the whole dataset into memory to more easily implement the selection algorithm, on the server only. After selecting data, we can then pad it and explode the non-level fields to make it a rectangular xarray for export to the client.

That's why I would also be very curious to see how you manage to convert the full dataset into a parquet file (which by the way I could not access), would you be willing to share your code doing this ?

Sure! I just uploaded the scripts that I used for the NetCDF4 → Parquet conversion and wrote them up in this gist.

I just tried the link, https://pivarski-princeton.s3.amazonaws.com/argo-floats-expert.parquet, while logged out of AWS, and it started downloading. Does this link not work for you?

Second, if I understand a little bit how Akward arrays are made for, I could see the added value for the manipulation of Argo floats trajectories. This would tend to be close to the Lagrangian analysis you're pointing at.
These floats trajectories are collection of latitude/longitude/time points that have not the same length. I understand that Akward arrays have the potential to improve performances and to ease their analysis.

I'm not sure if the dataset I've been using is the right one to start from. The source NetCDF4 files are labeled by ocean and day, with one file and therefore a fixed number of levels per day. I would have thought that different floats could report different numbers of levels from each other, even on the same day. The fact that many of these levels end in nan if they share a date with another WMO number suggests that these data have already been padded to have a fixed number of levels per day:

>>> for i in range(10):
...     print(standard.platform_number[i], standard.time[i], standard.levels.pres[i])
... 
13858 1997-07-28T20:26:20.000000000 [15.5, 21.1, 26.6, 32.2, 37.8, ..., 1.01e+03, 1.03e+03, 1.04e+03, 1.05e+03]
13857 1997-07-29T20:03:00.000000000 [11.9, 17, 22.1, 27.2, 32.3, ..., 1.03e+03, 1.04e+03, 1.05e+03, 1.06e+03]
13859 1997-07-30T14:45:11.000000000 [5.7, 11.1, 16.5, 22, 27.4, ..., 1.05e+03, 1.06e+03, 1.07e+03, 1.08e+03]
31810 1997-08-01T07:59:00.000000000 [16.7, 22.3, 27.9, 33.5, 39.1, ..., 1.05e+03, 1.06e+03, 1.07e+03, 1.08e+03]
31855 1997-08-02T15:53:47.000000000 [15.7, 21, 26.4, 31.7, 37, ..., 1.02e+03, 1.03e+03, 1.04e+03, 1.05e+03]
5902328 1997-08-03T17:23:13.000000000 [5.2, 15.7, 26.1, 36.6, 47, 57.5, 67.9, ..., nan, nan, nan, nan, nan, nan, nan]
31856 1997-08-03T03:01:01.000000000 [15, 20.4, 25.8, 31.1, 36.5, ..., 993, 1e+03, 1.01e+03, 1.02e+03, 1.04e+03]
31858 1997-08-04T06:10:56.000000000 [18.6, 23.7, 28.8, 34, 39.1, ..., 996, 1.01e+03, 1.02e+03, 1.03e+03, 1.04e+03]
13857 1997-08-09T19:21:12.000000000 [11.9, 17, 22.1, 27.2, 32.3, ..., 1.03e+03, 1.04e+03, 1.05e+03, 1.06e+03]
13858 1997-08-09T01:52:41.000000000 [15.5, 21.1, 26.6, 32.2, 37.8, 43.4, 49, ..., nan, nan, nan, nan, nan, nan, nan]

Is it possible to get data before this processing step? Is that what you're asking about ("Argo floats trajectories are not yet supported specifically by argopy")? If it's true that a lot of those nan values are for padding to the largest number of levels in a day, then the memory use for selection would be even smaller.

I'd be willing to give this a try, setting up an example of such a service, if you can point me to the right data sources. Cloud prices for servers with 128 GB of RAM (enough to keep all of the expert-level fields in memory) are about $1 per hour, so if this can be developed into a production server, it wouldn't be an expensive one for you to run. (Or you would run it on your own servers, but we'd have to think about memory.)

What I have in mind is to access the original (unpadded) data, convert it into a Parquet file, set up an example of a server that holds the data in memory and responds to requests for trajectories (given a path of longitude, latitude versus time and widths for the longitude, latitude box) by selecting matches, padding the irregular levels to make a rectangular table, which is then sent back to the client as an xarray. I don't know what data protocols you use to transfer data (Zarr? plain HTTP?), but I can use what the region, float, and profile data-fetcher methods use.

Then this "general data-fetcher" could at any later time be expanded to other kinds of requests, including ones that depend on the level data (e.g. "Give me all profiles containing at least one temperature above 20°C"), and the Parquet file would be available for power users. (The Parquet file may need to be recreated regularly...)

jpivarski Jul 15, 2022
Author

Note to self, from a conversation with @tylertucker202:

I had been looking at daily files, which are indexed by day, rather than probe number. I probably want the probe numbers.
They're in ftp.ifremer.fr/ifremer/argo/dac/coriolis/ (subdirectories by probe number)
"R" means raw (PI didn't vouch for it), "B" (biological signals), "D" (PI vouched for it), and "SR" (combines signals in a hopefully easy-to-use form)
I don't know which to use, but I'll make Parquet files of all of them and then ask Megan Scanderbeg where to go from there.

gmaze · 2022-09-23T12:31:17Z

gmaze
Sep 23, 2022
Maintainer

Hi @jpivarski
now catching up with this

Is it possible to get data before this processing step? Is that what you're asking about ("Argo floats trajectories are not yet supported specifically by argopy")? If it's true that a lot of those nan values are for padding to the largest number of levels in a day, then the memory use for selection would be even smaller.

Yes, No and Yes.

"get data before this processing step". Indeed the DAC geo folder is already a processed version of Argo data where profiles have been padded to make clean arrays of daily profiles. But this is not the Argo data version we prefer to work with, we favorite the "raw" data that are stored in the dac folder. In there you will find multi-profiles files under each dac name and float wmo. A one float multi-profile file is also a padded version of all the single-profile files of this float. This is all well explained here.

I'd be willing to give this a try, setting up an example of such a service, if you can point me to the right data sources

If you're still up to this, I guess for a good start, you can work with the multi-profile files under: https://data-argo.ifremer.fr/dac/*/<WMO>/<WMO>_traj.nc

What I have in mind is to access the original (unpadded) data, convert it into a Parquet file, set up an example of a server that holds the data in memory and responds to requests for trajectories

This would be awesome to try !

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Awkward Arrays at some stage of Argo data processing? #210

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using Awkward Arrays at some stage of Argo data processing? #210

jpivarski Apr 20, 2022

Replies: 2 comments · 2 replies

gmaze Apr 29, 2022 Maintainer

jpivarski May 2, 2022 Author

jpivarski Jul 15, 2022 Author

gmaze Sep 23, 2022 Maintainer

jpivarski
Apr 20, 2022

Replies: 2 comments 2 replies

gmaze
Apr 29, 2022
Maintainer

jpivarski May 2, 2022
Author

jpivarski Jul 15, 2022
Author

gmaze
Sep 23, 2022
Maintainer