Skip to content
Stache edited this page Aug 8, 2022 · 8 revisions

This methodology assumes a preliminary knowledge of the tool and the terms.

πŸ““ Note: This methodology is constantly evolving, and is discussed internally at Hestia. Do not hesitate to give feedback if necessary

One of the recurring questions regarding the addition of data semantics is the granularity used. The problem is, at what point do you want to specify what the data is.

For example, if you take a Twitter user ID, how do you define it? As a Twitter user numeric ID? As a numeric ID? As an identifier? As an int?

Basically, you can always use an arbitrary granularity, but it will strongly depend on the uses, the context, and the precision required. Moreover, changing the precision in the middle of the process is likely to be very counter-productive.

General information

Reminder of the available fields

As a reminder, the fields to be filled in are the following:

  • descriptiveType (URL/URI): The semantic link of the value type description.
  • unique (Boolean (true/false)): Whether the value is unique or not.
  • default: Default value of the field.
  • description: Human-readable description of the field.
  • choices (List): If the values are discrete, which values can be chosen.
  • regex: Which regex can be used to recognise the field value.

Special values for fields

  • If a field is not yet known (e.g. not yet filled), the word "none" must be set.
  • If a field cannot be used (e.g. by relevant), the word "N/A" should be used.

Semantic sources

The semantic sources to use are, in this order

  • Schema.org: En gΓ©nΓ©ral utilisΓ© pour ajouter de la sΓ©mantique de haut-niveau.
  • Wikidata: En gΓ©nΓ©ral utilisΓ© pour ajouter de la sΓ©mantique avec une granularitΓ© plus fine.
  • Custom Argonodes: Si aucune sΓ©mantique suffisante n'existe, il faut en crΓ©er sur le Wiki (cf. ).

Workflow

To begin with, it is recommended to start by exploring the surface of the file(s) to be completed with semantics, to visualise the whole and to have a vague idea of the work to be done. In particular, some datasets contain "meta fields" that are actually redundant, and so time can be saved by not adding the same semantics twice.

It is very important to consider the power of "N/A" when adding information. Indeed, the "N/A" actively means to any person or machine using the Models "I can totally ignore this field as it will not be useful to me". For example, when exporting Models to Argonodes, a "N/A" in descriptiveType will tell the tool that it can collapse the path between two nodes, to optimise the Model.

In general, you should not hesitate to use the "none" when in doubt, rather than incorrectly tagging or ignoring information.

Descriptive type

We wish to define granularity as the highest possible level of precision, outside the basic type, by privileging already existing semantics - to facilitate interoperability. Semantics are chosen as narrowly as possible, in the idea that it is possible to go up the ontology hierarchy if the concept is too narrow for the purpose, but the reverse is not true.

πŸ‘©β€πŸ« Taking the example of the Twitter user ID, the lowest level of granularity is "int" (because it is an integer), and the highest level is Twitter user numeric ID.

Absence of a higher level

If, when adding semantics, a higher level is not available, it is left to the person making the addition to judge whether it is necessary to create a new one, or not.

πŸ‘©β€πŸ« If we take the Twitter user ID example again, and assume that Twitter user numeric ID does not exist, it is not necessarily useful to have the extra information of "this is a Twitter ID specifically" - the level directly below, namely numeric ID, should suffice in most cases.

Definition not compatible in a given context

If, when adding semantics, a definition does not match the need, one should first look in other places for a corresponding definition, otherwise, it is left to the person making the addition to judge whether it is necessary to create a new one, or not.

πŸ‘©β€πŸ« Again with the Twitter user ID example, there are two identifier definitions, one from Schema.org, and one from WikiData. When building the model, a selection can be made according to the needs.

πŸ““ Note: you should always give a context when building a model (you can think of it as a "namespace" like in programming), which allows you to say which definition you are using.

Collapsing path

If an element is not useful for a Model and can be safely ignored, then it is possible to put "N/A" in place of a URI. This will indicate in the Model that the explored node is not useful and/or not used.

πŸ‘©β€πŸ« In Twitter, the files are created so that some contain lists which contain objects which contain lists (etc.), but only the last path is actually useful. Thus, it is not useful to tag each intermediate step as "list" or as "dict", and one can tag them as "N/A".

Description

It is easy to think, especially as a developer, that a human, non-formal description of values is not useful. Nevertheless :

  • To follow the Argonodes philosophy, we want as many people as possible to benefit from the knowledge produced by the tool.
  • Sometimes semantics alone is not enough, and adding context for humans can be very useful.

Thus, it is requested to systematically add a human-readable description of the content of the fields, to facilitate their recognition later on. Note also that additional details, such as exploration paths or value details, can be added in the description field.

πŸ‘©β€πŸ« For the Twitter example, some descriptions are already provided when the data is exported - so it is easy to add descriptions.

Regex

Regexes are particularly important when creating Models for specific fields, but can sometimes safely be ignored for other, less stringent fields. It is however recommended, if time and knowledge permit, to at least add a regex to specify the type of field, ideally its length, and if possible its limitations.

In particular, when a field proposes to put a date, we expressly ask to indicate the date formatting, to save time when using the Models. An automatic date recognition can be used in the analysis step with the corresponding Applier.

πŸ‘©β€πŸ« For Twitter, we know that a Tweet will contain text, but the specifications are actually a little more subtle and precise. Here, we simply propose to put in regex ^.{1,280}$.

πŸ““ Note: There is the FoundRegex, handy for assisting in the detection of regexes in the exploration phase.

πŸ““ Note: regex101 and Regex Generator are pretty cool tools to help with regexes - use them!

Unique, Default, Choices

These fields are particular in the sense that it is possible to prove them false with a counterexample, but it is difficult to prove them right with a generality. Therefore, they should be added with caution, only when there are enough data points, and otherwise put "none".

Argonodes Wiki

Tutorials

  • SOON

Usage and pipeline

  1. General usage examples
  2. Parsers: Preparation and conversion
  3. Nodes: Explore the data
  4. Appliers: Enhance the data
  5. Models: Abstract the data
  6. Filters: Filter and refine
  7. Exporters: Save your hard work!

Schemas

Argonodes

Known sources

Ressources

Clone this wiki locally