This document explains the shared parts among all the converters when converting a data set from a given format into a NTFS dataset.
The construction of NTFS objects IDs requires, for uniqueness purpose, that a unique prefix (specified for each source of data as an additional parameter to each converter) needs to be included in every object's id.
Prepending all the identifiers with a unique prefix ensures that the NTFS identifiers are unique accross all the NTFS datasets. With this assumption, merging two NTFS datasets can be done without worrying about conflicting identifiers.
This prefix should be applied to all NTFS identifiers except for the physical mode identifiers that are standardized and fixed values. Fixed values are described in the NTFS specifications
To reinforce the uniqueness some objects might have a sub-prefix in addition to their prefix.
The pattern is the following <prefix>:<sub_prefix>:<object_id>.
Adding a sub-prefix allows the merge of seasonal datasets; similar referential (e.g. networks
, lines
, stop areas
, stop points
) but different schedules (e.g. trips
, dates
).
The objects that may be concerned by this sub-prefix are: calendars
, trips
, trip_properties
, frequencies
, comments
, comment_links
, geometries
, equipments
(see each connector's documentation for details).
A configuration file config.json
, as it is shown below, is provided for each
converter and contains additional information about the data source as well as about
the upstream system that generated the data (if available). In particular, it provides the necessary information for:
- the required NTFS files
contributors.txt
anddatasets.txt
- some additional metadata can also be inserted in the file
feed_infos.txt
.
{
"contributor": {
"contributor_id": "DefaultContributorId",
"contributor_name": "DefaultContributorName",
"contributor_license": "DefaultDatasourceLicense",
"contributor_website": "http://www.default-datasource-website.com"
},
"dataset": {
"dataset_id": "DefaultDatasetId"
},
"feed_infos": {
"feed_publisher_name": "DefaultContributorName",
"feed_license": "DefaultDatasourceLicense",
"feed_license_url": "http://www.default-datasource-website.com",
}
}
The objects contributor
and dataset
are required, containing at least the
corresponding identifier (and the name for contributor
), otherwise the conversion
stops with an error. The object feed_infos
is optional.
The files contributors.txt
and datasets.txt
provide additional information about the data source.
NTFS file | NTFS field | key in config.json |
Constraint | Note |
---|---|---|---|---|
contributors.txt | contributor_id | contributor_id | Required | This field is prefixed. |
contributors.txt | contributor_name | contributor_name | Required | |
contributors.txt | contributor_license | contributor_license | Optional | |
contributors.txt | contributor_website | contributor_website | Optional |
NTFS file | NTFS field | key in config.json |
Constraint | Note |
---|---|---|---|---|
datasets.txt | dataset_id | dataset_id | Required | This field is prefixed. |
datasets.txt | contributor_id | contributor_id | Required | This field is prefixed. |
datasets.txt | dataset_start_date | Smallest date of all the trips of the dataset. | ||
datasets.txt | dataset_end_date | Greatest date of all the trips of the dataset. |
Physical modes may not contain CO2 emissions. If the value is missing, we are using default values (see below), mostly based on what is provided by ADEME.
Physical Mode | CO2 emission (gCO2-eq/km) |
---|---|
Air | 144.6 |
Boat | NC |
Bus | 132 |
BusRapidTransit | 84 |
Coach | 171 |
Ferry | 279 |
Funicular | 3 |
LocalTrain | 30.7 |
LongDistanceTrain | 3.4 |
Metro | 3 |
RapidTransit | 6.2 |
RailShuttle | NC |
Shuttle | NC |
SuspendedCableCar | NC |
Taxi | 184 |
Train | 11.9 |
Tramway | 4 |
The following fallback modes are also added to the model (they're usually not referenced by any trip).
Physical Mode | CO2 emission (gCO2-eq/km) |
---|---|
Bike | 0 |
BikeSharingService | 0 |
Car | 184 |
The following rules apply to every converter, unless otherwise explicitly specified.
- When one or more stop_points in the input data are not attached to a
stop_area, a stop_area is automatically created for each one. The name, the
coordinates, the visibility, and the timezone of the new
stop_area
are the same as the corresponding stop_point, the identifier is thestop_point
's identifier prefixed withNavitia:
. - If a
stop_area
doesn't have coordinates, the barycenter of the containedstop_points
is used. - Unless otherwise specified, dates of service are transformed into a list of active dates as if using a single NTFS file
calendar_dates.txt
. Those list of dates are then transformed tocalendar
andcalendar_dates
automatically. - Any
/
character in an identifier of an object is removed. - If a trip doesn't have a
trip_headsign
, it is automatically generated based on the name of the last stop point of the trip - If a route doesn't have a
direction_type
(or empty), thedirection_type
"forward" is assigned by default - If a route doesn't have a name (or empty),
name
anddestination_id
are automatically generated:- the
route.name
is generated with the following rules:- select the most frequent
stop_area
origin and most frequentstop_area
destination of all the associated trips - in case of equal frequencies, the biggest
stop_area
s (the moststop_points
) are chosen - in case of
stop_area
of equal sizes, thestop_area
names are sorted alphabetically and the first ones are taken - finally, the
route.name
is generated with:[name of origin's stop area] - [name of destination's stop area]
- select the most frequent
- the
route.destination_id
is set (overridden if needed) with the destination's stop area selected with the above rule
- the
- If a line doesn't have a name (or empty),
name
is automatically set with thename
of its first route in the forward direction (in alphabetical order) - If a line has an empty opening or closing times, then they are both generated.
- the
line.opening_time
is generated with the smallest departure time (at the first stop) of all journeys on the lines. - the
line.closing_time
is generated with the biggest arrival time (at the last stop) of all journeys on the lines (+ 24h if the end is earlier than the start time). - if a line has several periods without circulation in the day, only the main one (larger and earlier) is used to define the opening and closing times.
- lines with continuous circulation are indicated by default with an opening at 00:00 and a closing at 23:59.
- the
- If a trip contains stop times matching any of the following conditions, the trip is deleted and logged with a WARN:
- two stop times with the same sequence number
- if the arrival time of a stop time is greater that the departure time of the same stop time
- if the departure time of a stop time is greater that the arrival time of the next stop time
The model will raise a critical error if identifiers of 2 objects of the same type are identical. For example:
- if 2 datasets have the same identifier
- if 2 lines have the same identifier
- if 2 stop_points have the same identifier
- if 2 stop_areas have the same identifier
- if 2 routes have the same identifier
- if 2 trips have the same identifier
Please note that a stop_area and a stop_point can have the same identifier because they are considered as different types of objects.
Dangling references are cleaned up:
- if a transfer refers a stop which doesn't exist (
from_stop_id
andto_stop_id
) - if a trip refers to a route which doesn't exist
- if a trip refers to a commercial mode which doesn't exist
- if a trip refers to a dataset which doesn't exist
- if a trip refers to a company which doesn't exist
- if a trip refers to a calendar which doesn't exist
- if a line refers to a network which doesn't exist
- if a line refers to a commercial mode which doesn't exist
- if a route refers to a line which doesn't exist
- if a stop_point refers to a stop_area which doesn't exist
- if a dataset refers to a contributor which doesn't exist
Objects that are not relevant are cleaned up:
datasets
which are not referenced bytrips
contributors
which are not referenced bydatasets
companies
which are not referenced bytrips
networks
containing noline
lines
containing noroute
routes
containing notrips
trips
containing nostop_time
or with emptycalendars
stop_points
which are not referenced bystop_times
stop_areas
which are not referenced bystop_points
orroutes
calendars
which doesn't contain any active dategeometries
which are not referencedequipments
which are not referenced bystop_points
frequencies
which are not referenced bytrips
physical_modes
which are not referenced bytrips
commercial_modes
which are not referenced bylines
trip_properties
which are not referenced bytrips
comments
which are not referencedgrid_calendar
which refers to aline
which does not exist (through the relation in the filegrid_rel_calendar_line.txt
); Exception: when theline_external_code
is used and theline_id
is empty, thegrid_calendar
is keptgrid_exception
date which refers to agrid_calendar
which does not existgrid_period
which refers to agrid_calendar
which does not exist