/wikitext-template-parser

Parse wikitext Route diagram templates and compare with DB register of infrastructure

Primary LanguageF#MIT LicenseMIT

Wikitext Template Parser

Parse wikitext Route diagram templates and compare with DB register of infrastructure and Open-Data of Deutsche Bahn.

Introduction

  • a route is a route in a wikipedia article, an article may contain multiple routes,
  • entities compared are operational points (Betriebsstellen) like stations and stops,
  • reference is the DB register of infrastructure, data from RINF.

Comparison

The comparison of wiki data with available db data gives the follwing results:

Count Value Example
routes with all db data found in wikidata 644 Route 1700
routes with some db data not found in wikidata 48 Weinheim-Sulzbach, Route 3601

Statistics about operational points found

How many operational points are found:

Count Value Routes
operational points matched 6796 644
operational points missing 52 38
operational points with specified matching 11 10

How do operational points match:

Count Value
equal short names 2083
equal names 4130
same substring 556
border 37

There are 234 routes having only equal short names and equal names.

Statistics about articles

There are several reasons why it is not posssible to compare the data:

Count Value Remarks Example
articles total 1499 all articles with route templates are parsed
articles with empty route parameter 317 Schluff Eisenbahn
route is no passenger train 305 urban trains and freight trains are not checked Route 1734
routes compared with db data 697 routes with available db data are comapred Route 1700
routes shutdown 473 remark in railway guide (KBS) or operational points out of service Route 3745
routes with no db data found 185 articles with shut down routes Route 6603 down

Extracting the route infos

Extracting the route infos (i.e. route number, start and stop operational point) in 'STRECKENNR' template gives the follwing results:

Count Value Example
route number without operational point names 1014 4250
route number without operational point names and text ignored 124 6967; sä.MN
operational point names in <small> format tags 571 6135 <small>(Bln. Südkreuz–Elsterwerda)</small>
operational point names in text 22 1101 Lütjenbrode–Heiligenhafen

Operational points from route parameters should match with operational points from templates having distances, 100 entries are specified manually

Usage of shortnames

Analyzing the shortnames of operational points in the route diagrams gives the follwing results:

Count Value Example
distinct operational points with links to articles 12753
articles with infobox Bahnhof and shortname 2143 Wuppertal in route Düsseldorf–Elberfeld
articles without infobox Bahnhof 10610 Troisdorf in route Rechte Rheinstrecke

Installation

  • Login to RINF
    • manually download data of type SOL to file dbdata/RINF/SectionOfLines.csv,
    • manually download data of type OP to file dbdata/RINF/OperationalPoints.csv,
  • execute script scripts/restore.sh to download DB open data,
  • execute script scripts/rebuild.sh to download wikipedia articles and compare data,
  • execute dotnet run --project src/ResultsViewer/ResultsViewer.fsproj to view results.

There is a dockerfile containing these steps.