Ground truth datasets for our paper Similarity Classification of Public Transit Stations (Hannah Bast, Patrick Brosi, Markus Näther).
The datasets contain pairs of station identifiers, each consisting of a station label and its geographic position. Each pair is marked either similar (1) or not similar (0). The datasets have been generated from OpenStreetMap data. Basically, two identifiers are similar if they belong to the same station object (a way or node describing a station or platform) or to two station objects which are grouped by a relation public_transport=station. Two identifiers are not similar if they are both part of different public_transport=station relations. To avoid a blowup of not similar pairs, we only include not similar pairs that have a maximum distance 1,000 meters. Beyond that, we consider the similarity classification trivial.
Both datasets are sanitized to filter out obvious mapping mistakes. As a general rule, whenever our heuristics decide that something is a mapping mistake, we completely remove the suspicious pair from the dataset (it will appear neither as similar nor as not similar).
Each dataset is randomly split into a training dataset (20%) and a testing dataset (80%). The datasets come as gzipped tab-separated files (.tsv) with the following columns:
station1_id, station1_name, station1_lat, station1_lon, station2_id, station2_name, station2_lat, station2_lon, is_similar
An example row for the Freiburg area:
3614 Bertoldsbrunnen, Freiburg im Breisgau 47.995183 7.8502314 4118 Bertoldsbrunnen 47.9945867 7.8501955 1
| Dataset name | Description | Source | # pairs | Datasets |
|---|---|---|---|---|
| UK | Ground truth data for Great Britain and Ireland | OSM data | 2.2 M | train test |
| DACH | Ground truth data for Germany (D), Austria (A) and Switzerland (CH) | OSM data | 17.3 M | train test |
Pre-trained classification models for statsimi:
| Dataset name | Description | Datasets | Playground |
|---|---|---|---|
| Belgium | - | model.lib | |
| Britain & Ireland | - | model.lib | map |
| Czech Republic | - | model.lib | |
| D-A-CH | Combined model for Germany (D), Austria (A) and Switzerland (CH) | model.lib | map |
| Denmark | - | model.lib | |
| Estonia | - | model.lib | |
| Finland | - | model.lib | |
| France | - | model.lib | |
| Hungary | - | model.lib | |
| Italy | - | model.lib | |
| Netherlands | - | model.lib | |
| Norway | - | model.lib | |
| Poland | - | model.lib | |
| Portugal | - | model.lib | |
| Slovakia | - | model.lib | |
| Slovenia | - | model.lib | |
| Spain | - | model.lib | |
| Sweden | - | model.lib |
| Dataset name | Description | Datasets | Playground |
|---|---|---|---|
| Canada | - | model.lib | |
| USA | - | model.lib |
| Dataset name | Description | Datasets | Playground |
|---|---|---|---|
| Australia | - | model.lib | |
| New Zealand | - | model.lib |