Rethinking the infamous standard_names #401
Replies: 7 comments 16 replies
-
Thanks for bringing this up -- I do think more "structure" to the standard names would be helpful. I also have noted in some of the discussion of new standard names that sometimes none of the CF committee folks involved in the discussion really have the expertise to evaluate the name in a particular field. For the most part, we trust the proposer, but that may not always be adequate. However, one problem is that it's hard to find people with the time, interest and expertise for this kind of work -- so it may be a challenge to flesh out the committees for each "specific communities" -- who would do the work? Perhaps an incremental solution would be to more formalize the "categories" of standard names, currently: Atmospheric Chemistry Atmosphere Dynamics Carbon Cycle Cloud HydrologyOcean Dynamics Radiation Sea Ice Surface (which, I note, aren't quite aligned with disciplines, at the moment.) Once that more formal structure is established, it would be easier to change the governance for some or all of the categories. Also -- is there any precedent for a community coming up with a set of standard names for a particular field, and maintaining it either independently, or propose it en-mass to the CF standard name table? I ask, because I have started such an effort for the oil spill modeling community. We are working on a set of standard names for the results from oil spill models. Some of them are quite field specific, so it may not make sense to mange them the same way as all the other standard names. |
Beta Was this translation helpful? Give feedback.
-
I don't disagree, but I'm not sure I agree either. There are definitely things I would want to do differently if we were (re-) designing the standard_name system from scratch, and I think there's a lot of appeal to the idea of distinguishing between the fundamental data type and the data context. However, the more I think about it, the more complicated it gets. There are plenty of cases where it's easy to separate things that way, but there are also a lot where things are not as clear. Ditto with, say, incoming solar radiation at TOA and the irradiance of a wildfire. They're both radiative energy fluxes, but are they really the same thing? And energy is energy; you can combine enthalpy and potential energy into dry static energy; is enthalpy diffusion the same thing as heat advection? In one sense, it's all energy moving from point A to point B, but in another, those are very different things. I don't think there's a single right answer to the question of what the fundamental data types are, and that's what makes ontology difficulty and messy. Everything is context-dependent, which is how we have ended up where we are today. (Or if there is a single answer, it's already captured in the |
Beta Was this translation helpful? Give feedback.
-
But all of that aside, probably the bigger barrier is backwards-compatibility. One of the principles of CF is not to invalidate all of the existing data that's out there, so if you want to make a sweeping redesign of the standard_name system, you'll need to figure out a way to do it that handles that problem to get any traction. |
Beta Was this translation helpful? Give feedback.
-
CF has become, as far as I can tell, the de facto standard for anything climate related which does cross a lot of disciplines. I'm coming at this from an observational oceanographic perspective and having both models and observations use a common data format is basically a miracle. This miracle being enabled, I feel, by the very issues that are pointed out. I want the centralization of all this information within CF and its controlled vocabularies and governance. It brings the communities together around something that just "good enough" even if not "community specific". That is, the situation of every community disliking a common format is much better than communities developing their own formats, CF as a convention might as well not exist then. Standard names are not trying to be an ontology. |
Beta Was this translation helpful? Give feedback.
-
I also agree with Andrew. The three issues given in the original post also act as advantages to others, particularly those for who regularly use data created by other people. Just to note that most standard names do follow a general pattern: [surface] [component] base_quantity [at surface] [in medium] [due to process] [assuming condition] (see, for instance, CF Standard Names: Current status and Steps toward interoperability in the environmental sciences and Towards an explorer tool for visualisation of grammatical patterns in the CF Standard Names through decomposition into n-grams). |
Beta Was this translation helpful? Give feedback.
-
Dear all The concept and construction of standard names has certainly been discussed more than once before, but I don't have a note of which old emails or trac tickets it appears in. In my presentation at the CF workshop in 2018, I tried to answer some questions and objections about the design of standard names. Here are some points from that presentation:
The first point is most relevant to Arlindo's initial comment. I think it's an advantage and an intention of standard names that they should all be self-explanatory to a geoscientist of any field, who isn't necessarily a domain expert. At least then you can tell what the variable is about, even if you don't fully understand it. One reason for this is interoperability. CF is used for datasets from many disciplines, and some individual datasets are of relevance to many e.g. the output from Earth system models. To make this efficient, or even possible, we all have to speak the same language. That is why standard names are a centrally controlled vocabulary. If they were delegated to separate communities, it is inevitable that some standard names would be jargon or shorthand familiar in the relevant field, but incomprehensible or ambiguous outside it, and some quantities would be named in more than one field, probably with different names. They wouldn't be "standard" names, that serve the CF purpose of allowing users to decide which variables are comparable in datasets from different sources. The cost of this policy is that deciding on standard names is sometimes hard. It involves people from various fields understanding what the quantity is, and describing it in terms anyone can understand. But with this design, that cost arises only once. If we had standard names which were not generally self-explanatory, then many users would pay the price of trying to reconcile different vocabularies. The design of standard names also follows the CF principle of addressing present use-cases, rather than trying to anticipate them. It aims to be "good enough", as @DocOtak said. Hence we don't have to address the philosophical question of what is a "fundamental" quantity, such as Seth mentioned. We decide as we go along whether two things are the same or not. I agree, and we've often discussed, that it would be useful to provide tools that made it easier to propose new names, in the simpler cases at least. It's easy when a new name uses existing vocabulary and patterns, and usually easy if we just need new vocabulary. There is much more consistency in construction than the guidelines alone, because we follow precedent. For example, a long time (14 years) ago, I did an analysis of the standard names of the time into lexicon and syntax. Maybe machine learning could help with this task. Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Maybe this is a “yes-and” situation. Making it easier to extend standard names does not have to mean distributing the responsibility for approving them. If we accept that additional terms are being created in CF-ish vocabularies, it’s easy to imagine—maybe even implement —an API to receive those names and represent them as proposals. They clearly have a use case, and the API could enforce existence tests on required attributes. In an ideal world, the process could identify a matching pattern, or suggest (to submitters or curators) possible patterns to follow instead of what’s submitted, including pre-filling predefined text. Pretty sure LLM could do this pretty well, given a list of existing patterns or just some training on existing names. In the end curation would still be needed, maybe at unsupportable scale. But at least the process would be capturing more real-world use cases.John
|
Beta Was this translation helpful? Give feedback.
-
Topic for discussion
Perceived Issue
I hope I am not starting a holy war, but for full disclosure, I really hate the standard_names.
While there is definitely much value in the standardization of variable names, I am probably not alone saying that the standard_names in CF is perhaps one of its most confusing aspects and the main impediment for its full adoption. While CF stands for Climate and Forecasting, many of the standardization provided by CF goes beyond the climate and forecast community. Except, perhaps, by the infamous standard_names. Some of the issues are:
The current standard_names confound 2 main attributes of a variable: a) fundamental data type (pressure, temperature, relative humidity, etc.) and b) what we could broadly call data context. For example, air_pressure_at_cloud_base has pressure as the fundamental data type, and air at cloud base as the context. In other words, the characterization of a variable is at least 2D: (fundamental data type, context).
Suggestion: break up the standard_name
While the fundamental data types are universal and easier to standardize, the data context is better standardized by specific communities. Thus the foundational CF standard should not go beyond the fundamental data types, and be extensible to multiple contexts managed by specific communities. These would be much smaller tables. A contextual table for the Climate-Forecast community would be in order, extensible by other communities, say with a global attribute:
So, this extends the CF data contexts with additional contexts by the Other-1.0 data context conventions.
Beta Was this translation helpful? Give feedback.
All reactions