This is a generic scraper for driving durations using open APIs.
To use this scraper:
- Install poetry to manage the python dependencies (see the link for instructions).
- From within the project root, run
poetry install
which fetches all required dependencies. - Interact with the scrapers below by running
poetry run scrapy runspider …
(see the different scrapers for details)
OpenRouteService can perform different kinds of distance calculations if you are looking to estimate distances by car, by foot, in a wheelchair, or by bike. If you need public transit, see the section on VbbRest below.
The scraper works by providing the paths of two CSV files when invoking it. It will build the cartesian product of all items in the two files and return routing information from OpenRouteService.
We need two CSV files, each require the fields lat
and lng
. Any other field will be kept around but is not required.
examples/sources.csv
:
id,lat,lng,name
1,52.5033,13.3848,Berlin
examples/destinations.csv
id,lat,lng,name
1,51.7784,14.3875,Cottbus
Invocation:
poetry run scrapy crawl OpenRouteService \
-a api_key=YOUR_API_TOKEN \
-a source_csv=examples/sources.csv \
-a destination_csv=examples/destinations.csv \
-o output.json
The api_key
can be copy-and-pasted from the OpenRouteService UI. The two CSV files are the paths given above. -o output.json
determines the name and format of the file that stores all scraper results.
You can also pass -a profile=...
to customize the way the driving duration and distance are calculated (more in the api documentation). Interesting profiles might be driving-car
, cycling-regular
, wheelchair
or foot-walking
.
The response is an array of GeoJSON features. Each feature represents one trip. The most important properties are properties.summary
, properties.source
and properties.destination
.
Read the comments below that aim to help understanding the result:
[
{
"bbox": [
13.383565,
51.722495,
14.388388,
52.503489
],
"type": "Feature",
"properties": {
"segments": [
{
"distance": 130873.1, // segment distance driven in meters; may be less than total distance (see "summary" below)
"duration": 6042.1, // segment duration in seconds
"steps": [
{
"distance": 19.9,
"duration": 4.8,
"type": 11,
"instruction": "Head southwest",
"name": "-",
"way_points": [
0,
1
]
},
// many more steps with durations, distances and instructions
]
}
],
"summary": {
"distance": 130873.1, // ← !! total distance driven in meters
"duration": 6042.1 // ← !! total trip duration in seconds
},
"way_points": [
0,
1171
],
"source": {
// contains the complete source row of your csv for matching
"id": "1",
"lat": "52.5033",
"lng": "13.3848",
"name": "Berlin"
},
"destination": {
// contains the complete destination row of your csv for matching
"id": "1",
"lat": "51.7784",
"lng": "14.3875",
"name": "Cottbus"
}
},
"geometry": {
"coordinates": [
[
13.384808,
52.503294
],
// many more coordinates so you can draw a detailed shape
],
"type": "LineString"
}
}
]
The VBB Rest API needs stop ids to calculate trips. These are provided, as above, in a designated row of the input and output CSV files. If your CSV does not contain a stop_id
column, you will a lat
and lng
and an address
* column (where address can be a street address or city name), and use the VbbRestStopIds
scraper to fetch the closest stop for you.
If you CSVs already contain a stop_id
column that can be consumed by the VBB Rest API, you can skip the next section.
Warning, there's a rant following. If you don't want to skip it you're of course free to do so, but if you're wondering why this is kind of messy, please read on.
I know, that address column is very annoying. I mean, a latitude and longitude can be of arbitrary precision, much more so than an address. Why does it need that? I don't know. I didn't write the API, and if I did I would have probably not written it the way it is written. But it's what's there and it's still very useful.
On the other hand, there is another endpoint of the HAFAS engine - exposed as stops/nearby
in the REST API - which unfortunately is not very reliable. If you just want to say "I have the name of a city, give me whatever is the main stop", then the stops/reachable-from
endpoint, which is used by this scraper, has given better results.
Given a CSV with slightly more information as the source.csv
as above:
id,lat,lng,name,address
1,52.5033,13.3848,Berlin,Stresemannstraße 70
poetry run scrapy crawl VbbRestStopIds \
-a source_csv=sources-with-address.csv \
-o source_stops.csv
Results in the following CSV:
id,lat,lng,name,address,stop_id,stop_duration,stop_name,stop_lat,stop_lng,stop_products
1,52.5033,13.3848,Berlin,Stresemannstraße 70,900000012101,4,S Anhalter Bahnhof,52.504537,13.38208,"suburban,bus"
Note that the stop_duration
column gives you the walking distance estimated by the API.
You can also tell the scraper to exclude some transit types (separated with ,
) when considering the closest stop. Available transit types are suburban
, subway
, tram
, bus
, ferry
, express
and regional
(you can read the API documentation for more information about what exactly these types mean):
poetry run scrapy crawl VbbRestStopIds \
-a source_csv=sources-with-address.csv \
-a excluded_products=suburban,bus \
-o source_stops.csv
Results in the following response, because Möckernbrücke offers a subway:
id,lat,lng,name,address,stop_id,stop_duration,stop_name,stop_lat,stop_lng,stop_products
1,52.5033,13.3848,Berlin,Stresemannstraße 70,900000017104,11,U Möckernbrücke,52.498945,13.383257,"subway,bus"
The CSV just generated can be used as input for the next scraper.
The following invocation will calculate trips from any row in sources.csv
to any row in destinations.csv
.
poetry run scrapy crawl VbbRestJourneys \
-a source_csv=examples/sources.csv \
-o source_stops.csv
Additional, optional arguments are:
- Either
-a departure=...
or-a arrival=...
- By default it will assume that you want to depart now, and date / time parameters can be passed like described in the API documentation (e.g.
today 2pm
,2020-04-29T19:30:00+02:00
or any unix timestamp).
- By default it will assume that you want to depart now, and date / time parameters can be passed like described in the API documentation (e.g.
-a excluded_products
, which is a comma-separated list as described in the previous section
Given two CSVs, each with a column stop_id
:
sources.csv
:
id,lat,lng,name,stop_id,stop_name,stop_lat,stop_lng,stop_distance,stop_products
1,52.5033,13.3848,Berlin,900000012101,S Anhalter Bahnhof,52.504537,13.38208,230,"suburban,bus"
destinations.csv
:
id,lat,lng,name,stop_id,stop_name,stop_lat,stop_lng,stop_distance,stop_products
1,52.5033,13.3848,Berlin,900000017104,U Möckernbrücke,52.498945,13.383257,495,"subway,bus"
You can find out information about trips that started at S Anhalter Bahnhof and arrived at 10am today like so:
poetry run scrapy crawl VbbRestJourneys -a source_csv=source_stops.csv -a destination_csv=source_stops_excluded.csv -a arrival='today 10am' -o trips_today_10am.jl
Where the last trip returned looks like this (the order is as you'd expect in your public transit app, with the last trip is the one immediately before your deadline):
{
"type": "journey",
"legs": [
// a leg is a single stop in the joruney
{
"origin": {
"type": "stop",
"id": "900000012101",
"name": "S Anhalter Bahnhof",
"location": {
"type": "location",
"id": "900012101",
"latitude": 52.504537,
"longitude": 13.38208
},
"products": {
"suburban": true,
"subway": false,
"tram": false,
"bus": true,
"ferry": false,
"express": false,
"regional": false
}
},
"destination": {
"type": "stop",
"id": "900000012151",
"name": "Willy-Brandt-Haus",
"location": {
"type": "location",
"id": "900012151",
"latitude": 52.500411,
"longitude": 13.387437
},
"products": {
"suburban": false,
"subway": false,
"tram": false,
"bus": true,
"ferry": false,
"express": false,
"regional": false
}
},
"departure": "2021-04-08T09:50:00+02:00",
"plannedDeparture": "2021-04-08T09:50:00+02:00",
"departureDelay": null,
"arrival": "2021-04-08T09:51:00+02:00",
"plannedArrival": "2021-04-08T09:51:00+02:00",
"arrivalDelay": null,
"reachable": true,
"tripId": "1|22282|23|86|8042021",
"line": {
"type": "line",
"id": "m41",
"fahrtNr": "36959",
"name": "M41",
"public": true,
"adminCode": "BVB",
"mode": "bus",
"product": "bus",
"operator": {
"type": "operator",
"id": "berliner-verkehrsbetriebe",
"name": "Berliner Verkehrsbetriebe"
},
"symbol": "M",
"nr": 41,
"metro": true,
"express": false,
"night": false
},
"direction": "Sonnenallee/Baumschulenstr.",
"arrivalPlatform": null,
"plannedArrivalPlatform": null,
"departurePlatform": null,
"plannedDeparturePlatform": null,
"cycle": {
"min": 540,
"max": 600,
"nr": 13
}
}, // ... followed by other stops in the journey
],
// this token can be used to continuously refresh information about the trip,
// so you can keep the delay information up-to-date
// the endpoint is described here: https://v5.vbb.transport.rest/api.html#get-journeysref
"refreshToken": "¶HKI¶T$A=1@O=S Anhalter Bahnhof (Berlin)@L=900012101@a=128@$A=1@O=Willy-Brandt-Haus (Berlin)@L=900012151@a=128@$202104080950$202104080951$ M41$$1$$$$§G@F$A=1@O=Willy-Brandt-Haus (Berlin)@L=900012151@a=128@$A=1@O=U Möckernbrücke (Berlin)@L=900017104@a=128@$202104080951$202104081000$$$1$$$$¶GP¶ft@0@2000@120@0@100@1@@0@@@@@false@0@-1@0@-1@-1@$f@$f@$f@$f@$f@$§bt@0@2000@120@0@100@1@@0@@@@@false@0@-1@0@-1@-1@$f@$f@$f@$f@$f@$§tt@0@250000@120@0@100@1@@0@@@@@false@0@-1@0@-1@-1@$t@0@250000@120@0@100@1@@0@@@@@false@0@-1@0@-1@-1@$t@0@0@0@0@100@-1@0@0@@@@@false@0@-1@0@-1@-1@$f@$f@$f@$§",
// how often does this type of journey repeat?
"cycle": {
"min": 540
},
"tickets": [
// you can even get information about ticket offers
{
"name": "Berlin Kurzstrecke (Via: Kurzstrecke): Kurzstrecke – Regeltarif",
"price": 2,
"tariff": "Berlin",
"coverage": "short trip",
"variant": "adult",
"amount": 1,
"shortTrip": true
},
// … followed by more ticket information
],
"source": {
// this can be used to match the row in your source csv
"id": "1",
"lat": "52.5033",
"lng": "13.3848",
"name": "Berlin",
"stop_id": "900000012101",
"stop_name": "S Anhalter Bahnhof",
"stop_lat": "52.504537",
"stop_lng": "13.38208",
"stop_distance": "230",
"stop_products": "suburban,bus"
},
"destination": {
// this can be used to match the row in your destination csv
"id": "1",
"lat": "52.5033",
"lng": "13.3848",
"name": "Berlin",
"stop_id": "900000017104",
"stop_name": "U Möckernbrücke",
"stop_lat": "52.498945",
"stop_lng": "13.383257",
"stop_distance": "495",
"stop_products": "subway,bus"
}
}
If you want to calculate the total trip time for example, you can use the following code in Python 3.7:
# assume that the journey above is available as `journey`
import datetime
start_time = datetime.datetime.strptime(journey['legs'][0]['departure'], "%Y-%m-%dT%H:%M:%S%z")
end_time = datetime.datetime.strptime(journey['legs'][-1]['arrival'], "%Y-%m-%dT%H:%M:%S%z")
end_time - start_time
# → datetime.timedelta(seconds=600)