station similarity classification datasets

Ground truth datasets for our paper Similarity Classification of Public Transit Stations (Hannah Bast, Patrick Brosi, Markus Näther).

Description

The datasets contain pairs of station identifiers, each consisting of a station label and its geographic position. Each pair is marked either similar (1) or not similar (0). The datasets have been generated from OpenStreetMap data. Basically, two identifiers are similar if they belong to the same station object (a way or node describing a station or platform) or to two station objects which are grouped by a relation public_transport=station. Two identifiers are not similar if they are both part of different public_transport=station relations. To avoid a blowup of not similar pairs, we only include not similar pairs that have a maximum distance 1,000 meters. Beyond that, we consider the similarity classification trivial.

Both datasets are sanitized to filter out obvious mapping mistakes. As a general rule, whenever our heuristics decide that something is a mapping mistake, we completely remove the suspicious pair from the dataset (it will appear neither as similar nor as not similar).

Each dataset is randomly split into a training dataset (20%) and a testing dataset (80%). The datasets come as gzipped tab-separated files (.tsv) with the following columns:

station1_id, station1_name, station1_lat, station1_lon, station2_id, station2_name, station2_lat, station2_lon, is_similar

An example row for the Freiburg area:

3614    Bertoldsbrunnen, Freiburg im Breisgau   47.995183   7.8502314   4118    Bertoldsbrunnen 47.9945867  7.8501955   1

Datasets

Dataset name Description Source # pairs Datasets
UK Ground truth data for Great Britain and Ireland OSM data 2.2 M train test
DACH Ground truth data for Germany (D), Austria (A) and Switzerland (CH) OSM data 17.3 M train test

Pretrained models

Pre-trained classification models for statsimi:

Europe

Dataset name Description Datasets Playground
Belgium - model.lib
Britain & Ireland - model.lib map
Czech Republic - model.lib
D-A-CH Combined model for Germany (D), Austria (A) and Switzerland (CH) model.lib map
Denmark - model.lib
Estonia - model.lib
Finland - model.lib
France - model.lib
Hungary - model.lib
Italy - model.lib
Netherlands - model.lib
Norway - model.lib
Poland - model.lib
Portugal - model.lib
Slovakia - model.lib
Slovenia - model.lib
Spain - model.lib
Sweden - model.lib

North America

Dataset name Description Datasets Playground
Canada - model.lib
USA - model.lib

Oceania

Dataset name Description Datasets Playground
Australia - model.lib
New Zealand - model.lib