Ground truth datasets for our paper Similarity Classification of Public Transit Stations (Hannah Bast, Patrick Brosi, Markus Näther).
The datasets contain pairs of station identifiers, each consisting of a station label and its geographic position. Each pair is marked either similar (1
) or not similar (0
). The datasets have been generated from OpenStreetMap data. Basically, two identifiers are similar if they belong to the same station object (a way
or node
describing a station or platform) or to two station objects which are grouped by a relation public_transport=station
. Two identifiers are not similar if they are both part of different public_transport=station
relations. To avoid a blowup of not similar pairs, we only include not similar pairs that have a maximum distance 1,000 meters. Beyond that, we consider the similarity classification trivial.
Both datasets are sanitized to filter out obvious mapping mistakes. As a general rule, whenever our heuristics decide that something is a mapping mistake, we completely remove the suspicious pair from the dataset (it will appear neither as similar nor as not similar).
Each dataset is randomly split into a training dataset (20%) and a testing dataset (80%). The datasets come as gzipped tab-separated files (.tsv
) with the following columns:
station1_id, station1_name, station1_lat, station1_lon, station2_id, station2_name, station2_lat, station2_lon, is_similar
An example row for the Freiburg area:
3614 Bertoldsbrunnen, Freiburg im Breisgau 47.995183 7.8502314 4118 Bertoldsbrunnen 47.9945867 7.8501955 1
Dataset name | Description | Source | # pairs | Datasets |
---|---|---|---|---|
UK | Ground truth data for Great Britain and Ireland | OSM data | 2.2 M | train test |
DACH | Ground truth data for Germany (D), Austria (A) and Switzerland (CH) | OSM data | 17.3 M | train test |
Pre-trained classification models for statsimi
:
Dataset name | Description | Datasets | Playground |
---|---|---|---|
Belgium | - | model.lib | |
Britain & Ireland | - | model.lib | map |
Czech Republic | - | model.lib | |
D-A-CH | Combined model for Germany (D), Austria (A) and Switzerland (CH) | model.lib | map |
Denmark | - | model.lib | |
Estonia | - | model.lib | |
Finland | - | model.lib | |
France | - | model.lib | |
Hungary | - | model.lib | |
Italy | - | model.lib | |
Netherlands | - | model.lib | |
Norway | - | model.lib | |
Poland | - | model.lib | |
Portugal | - | model.lib | |
Slovakia | - | model.lib | |
Slovenia | - | model.lib | |
Spain | - | model.lib | |
Sweden | - | model.lib |
Dataset name | Description | Datasets | Playground |
---|---|---|---|
Canada | - | model.lib | |
USA | - | model.lib |
Dataset name | Description | Datasets | Playground |
---|---|---|---|
Australia | - | model.lib | |
New Zealand | - | model.lib |