Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiently parsing erddap server metadata #3

Open
brey opened this issue May 5, 2022 · 8 comments
Open

Efficiently parsing erddap server metadata #3

brey opened this issue May 5, 2022 · 8 comments

Comments

@brey
Copy link
Contributor

brey commented May 5, 2022

When using erddapy to retrieve the metadata, the full set of data is parsed, including data variables. This results in a long wait depending on the volume of data.

There has to be a way to simplify/expedite this.

@ocefpaf
Copy link

ocefpaf commented Jul 7, 2022

When using erddapy to retrieve the metadata, the full set of data is parsed, including data variables.

The get_info method should download only the metadata, it is something like:

info_url = e.get_info_url(dataset_id, response="csv")

info = pd.read_csv(info_url)
info.head()

However, that is quite "low level" and ideally we should allow for a "dataset-like" class with the metadata and load the data lazily afterwards. We are working on a refactor to go into this direction.

With that said I believe that libraries that build on top of erddapy should use the low level interface. The high level is mostly for end users.

@brey
Copy link
Contributor Author

brey commented Sep 20, 2023

Hi @ocefpaf. I have finally came back to this issue. Thanks for the tip above but my problem remains. So using the get_info_url I am getting all the variables/Attributes which is fine. But then I would like to retrieve a subset of these and I can't see how I can avoid the time parameter.

I am posting below an example. I am using the EMODNET server as an example:

from erddapy import ERDDAP
import pandas as pd

e = ERDDAP(
  server="https://erddap.emodnet-physics.eu/erddap",
  protocol="tabledap",
)
e.response = "csv"
e.dataset_id = "EMODPACE_NMDIS_PSMSL_L2A_SLEV_TG_TS"


info_url = e.get_info_url(response='csv')
info = pd.read_csv(info_url)

info['Variable Name'].unique()

info['Attribute Name'].unique()

So far so good. However, what I need is the following

e.variables = [
    "StationName",
    "EP_PLATFORM_CODE",
    "EP_PLATFORM_TYPE",
    "EP_PLATFORM_LINK",
    "StationCountry",
    "longitude",
    "latitude",
]

If I use

df = e.to_pandas(low_memory=False)

I get all times. How I can get the above info without the time dimension?

@pmav99
Copy link
Member

pmav99 commented Oct 5, 2023

@brey is this still an issue? I just tested it, and I don't see time in the returned results:

> df.head()
  StationName EP_PLATFORM_CODE EP_PLATFORM_TYPE                                   EP_PLATFORM_LINK StationCountry  longitude (degrees_east)  latitude (degrees_north)
0      Dalian           Dalian               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    121.68                     38.87
1      Kanmen           Kanmen               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    121.28                     28.08
2      Nansha           Nansha               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    112.88                      9.55
3       Xisha            Xisha               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    112.30                     16.80
4       Zhapo            Zhapo               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    111.81                     21.58

@brey
Copy link
Contributor Author

brey commented Oct 5, 2023

Try

df.loc[df.EP_PLATFORM_CODE=='Xisha']

You get for each station one entry per timestamp

@pmav99
Copy link
Member

pmav99 commented Oct 5, 2023

All the rows are identical, are they not? Then maybe,

df.loc[df.EP_PLATFORM_CODE=='Xisha'].iloc[0]

might be enough?

Or maybe even:

df.groupby(df.EP_PLATFORM_CODE).first()

@brey
Copy link
Contributor Author

brey commented Oct 5, 2023

I know but that means that if another server has a longer time range the amount of data you'll download will be quite large.

@ocefpaf
Copy link

ocefpaf commented Jan 9, 2024

Sorry, this one flew under the radar but I just found it. Maybe

df = e.to_pandas(distinct=True)

can help you there. That would return only unique values, filtered on the server-side first. It should be similar to the post pandas unique method call.

Screenshot from 2024-01-09 15-54-10

@pmav99
Copy link
Member

pmav99 commented Jan 9, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants