Skip to content

Latest commit

 

History

History
197 lines (173 loc) · 11.4 KB

common_ntfs_rules.md

File metadata and controls

197 lines (173 loc) · 11.4 KB

Shared specifications for all converters

This document explains the shared parts among all the converters when converting a data set from a given format into a NTFS dataset.

Data prefix

The construction of NTFS objects IDs requires, for uniqueness purpose, that a unique prefix (specified for each source of data as an additional parameter to each converter) needs to be included in every object's id.

Prepending all the identifiers with a unique prefix ensures that the NTFS identifiers are unique accross all the NTFS datasets. With this assumption, merging two NTFS datasets can be done without worrying about conflicting identifiers.

This prefix should be applied to all NTFS identifiers except for the physical mode identifiers that are standardized and fixed values. Fixed values are described in the NTFS specifications

To reinforce the uniqueness some objects might have a sub-prefix in addition to their prefix.
The pattern is the following <prefix>:<sub_prefix>:<object_id>.

Adding a sub-prefix allows the merge of seasonal datasets; similar referential (e.g. networks, lines, stop areas, stop points) but different schedules (e.g. trips, dates).

The objects that may be concerned by this sub-prefix are: calendars, trips, trip_properties, frequencies, comments, comment_links, geometries, equipments (see each connector's documentation for details).

Configuration of each converter

A configuration file config.json, as it is shown below, is provided for each converter and contains additional information about the data source as well as about the upstream system that generated the data (if available). In particular, it provides the necessary information for:

  • the required NTFS files contributors.txt and datasets.txt
  • some additional metadata can also be inserted in the file feed_infos.txt.
{
    "contributor": {
        "contributor_id": "DefaultContributorId",
        "contributor_name": "DefaultContributorName",
        "contributor_license": "DefaultDatasourceLicense",
        "contributor_website": "http://www.default-datasource-website.com"
    },
    "dataset": {
        "dataset_id": "DefaultDatasetId"
    },
    "feed_infos": {
        "feed_publisher_name": "DefaultContributorName",
        "feed_license": "DefaultDatasourceLicense",
        "feed_license_url": "http://www.default-datasource-website.com",
    }
}

The objects contributor and dataset are required, containing at least the corresponding identifier (and the name for contributor), otherwise the conversion stops with an error. The object feed_infos is optional.

The files contributors.txt and datasets.txt provide additional information about the data source.

Loading Contributor

NTFS file NTFS field key in config.json Constraint Note
contributors.txt contributor_id contributor_id Required This field is prefixed.
contributors.txt contributor_name contributor_name Required
contributors.txt contributor_license contributor_license Optional
contributors.txt contributor_website contributor_website Optional

Loading Dataset

NTFS file NTFS field key in config.json Constraint Note
datasets.txt dataset_id dataset_id Required This field is prefixed.
datasets.txt contributor_id contributor_id Required This field is prefixed.
datasets.txt dataset_start_date Smallest date of all the trips of the dataset.
datasets.txt dataset_end_date Greatest date of all the trips of the dataset.

CO2 emissions and fallback modes

Physical modes may not contain CO2 emissions. If the value is missing, we are using default values (see below), mostly based on what is provided by ADEME.

Physical Mode CO2 emission (gCO2-eq/km)
Air 144.6
Boat NC
Bus 132
BusRapidTransit 84
Coach 171
Ferry 279
Funicular 3
LocalTrain 30.7
LongDistanceTrain 3.4
Metro 3
RapidTransit 6.2
RailShuttle NC
Shuttle NC
SuspendedCableCar NC
Taxi 184
Train 11.9
Tramway 4

The following fallback modes are also added to the model (they're usually not referenced by any trip).

Physical Mode CO2 emission (gCO2-eq/km)
Bike 0
BikeSharingService 0
Car 184

Common practices

The following rules apply to every converter, unless otherwise explicitly specified.

General rules

  • When one or more stop_points in the input data are not attached to a stop_area, a stop_area is automatically created for each one. The name, the coordinates, the visibility, and the timezone of the new stop_area are the same as the corresponding stop_point, the identifier is the stop_point's identifier prefixed with Navitia:.
  • If a stop_area doesn't have coordinates, the barycenter of the contained stop_points is used.
  • Unless otherwise specified, dates of service are transformed into a list of active dates as if using a single NTFS file calendar_dates.txt. Those list of dates are then transformed to calendar and calendar_dates automatically.
  • Any / character in an identifier of an object is removed.
  • If a trip doesn't have a trip_headsign, it is automatically generated based on the name of the last stop point of the trip
  • If a route doesn't have a direction_type (or empty), the direction_type "forward" is assigned by default
  • If a route doesn't have a name (or empty), name and destination_id are automatically generated:
    • the route.name is generated with the following rules:
      1. select the most frequent stop_area origin and most frequent stop_area destination of all the associated trips
      2. in case of equal frequencies, the biggest stop_areas (the most stop_points) are chosen
      3. in case of stop_area of equal sizes, the stop_area names are sorted alphabetically and the first ones are taken
      4. finally, the route.name is generated with: [name of origin's stop area] - [name of destination's stop area]
    • the route.destination_id is set (overridden if needed) with the destination's stop area selected with the above rule
  • If a line doesn't have a name (or empty), name is automatically set with the name of its first route in the forward direction (in alphabetical order)
  • If a line has an empty opening or closing times, then they are both generated.
    • the line.opening_time is generated with the smallest departure time (at the first stop) of all journeys on the lines.
    • the line.closing_time is generated with the biggest arrival time (at the last stop) of all journeys on the lines (+ 24h if the end is earlier than the start time).
    • if a line has several periods without circulation in the day, only the main one (larger and earlier) is used to define the opening and closing times.
    • lines with continuous circulation are indicated by default with an opening at 00:00 and a closing at 23:59.
  • If a trip contains stop times matching any of the following conditions, the trip is deleted and logged with a WARN:
    • two stop times with the same sequence number
    • if the arrival time of a stop time is greater that the departure time of the same stop time
    • if the departure time of a stop time is greater that the arrival time of the next stop time

Conflicting identifiers

The model will raise a critical error if identifiers of 2 objects of the same type are identical. For example:

  • if 2 datasets have the same identifier
  • if 2 lines have the same identifier
  • if 2 stop_points have the same identifier
  • if 2 stop_areas have the same identifier
  • if 2 routes have the same identifier
  • if 2 trips have the same identifier

Please note that a stop_area and a stop_point can have the same identifier because they are considered as different types of objects.

Incoherences

Dangling references are cleaned up:

  • if a transfer refers a stop which doesn't exist (from_stop_id and to_stop_id)
  • if a trip refers to a route which doesn't exist
  • if a trip refers to a commercial mode which doesn't exist
  • if a trip refers to a dataset which doesn't exist
  • if a trip refers to a company which doesn't exist
  • if a trip refers to a calendar which doesn't exist
  • if a line refers to a network which doesn't exist
  • if a line refers to a commercial mode which doesn't exist
  • if a route refers to a line which doesn't exist
  • if a stop_point refers to a stop_area which doesn't exist
  • if a dataset refers to a contributor which doesn't exist

Unnecessary objects

Objects that are not relevant are cleaned up:

  • datasets which are not referenced by trips
  • contributors which are not referenced by datasets
  • companies which are not referenced by trips
  • networks containing no line
  • lines containing no route
  • routes containing no trips
  • trips containing no stop_time or with empty calendars
  • stop_points which are not referenced by stop_times
  • stop_areas which are not referenced by stop_points or routes
  • calendars which doesn't contain any active date
  • geometries which are not referenced
  • equipments which are not referenced by stop_points
  • frequencies which are not referenced by trips
  • physical_modes which are not referenced by trips
  • commercial_modes which are not referenced by lines
  • trip_properties which are not referenced by trips
  • comments which are not referenced
  • grid_calendar which refers to a line which does not exist (through the relation in the file grid_rel_calendar_line.txt); Exception: when the line_external_code is used and the line_id is empty, the grid_calendar is kept
  • grid_exception date which refers to a grid_calendar which does not exist
  • grid_period which refers to a grid_calendar which does not exist