/Wikimedia

Wikimedia Downloads' processing tools

Primary LanguageC#Apache License 2.0Apache-2.0

Code Coverage Nuget

Toimik.Wikimedia

.NET 8 C# Wikimedia Downloads processing tools.

Features

  • Extracts external URL link records from local or remote externallinks.sql.gz datasets (schema version 1.29 & above)
  • More to come ...

Quick Start

Installation

Package Manager

PM> Install-Package Toimik.Wikimedia

.NET CLI

> dotnet add package Toimik.Wikimedia

Usage

Extracting External URL Links

Some of the datasets are collections of URLs that point to third parties' resources. Of particular interest are those whose filename is suffixed with externallinks.sql.gz and prefixed with <xx>wiki where <xx> is the first two / three language-specific characters.

e.g. enwiki...externallinks.sql.gz, ruwiki...externallinks.sql.gz.

As the filename's extension implies, each link points to an SQL file that is compressed using GZip. Specifically, each is a MySQL script to be fed to a MySQL program, which will auto-decompress the file to create and populate a table based on the respective schema detailed at https://www.mediawiki.org/wiki/Manual:Externallinks_table.

 

The following classes reduce disk space and memory requirements by eliminating the need to use MySQL at all.  

V129ExternalLinksExtractor.cs

// As the name implies, this extractor extracts from datasets meant for schema version 1.29 and above
var extractor = new V129ExternalLinksExtractor();
var path = ... // Path to a local `externallinks.sql.gz`
await foreach (ExternalLinksExtractor.Result result in extractor.Extract(path))
{
    ...
}

 

ExternalLinksStreamer.cs

var streamer = new ExternalLinksStreamer(
    new HttpClient(), // Ideally, a singleton should be used
    new V129ExternalLinksExtractor());
// This example streams the external URL links from November 2021's English dataset
var dataset = new Uri("http://dumps.wikimedia.org/enwiki/20211120/enwiki-20211120-externallinks.sql.gz");
await foreach (ExternalLinksExtractor.Result result in streamer.Stream(dataset))
{
    ...
}

 

Known Issue

Streaming some large files over HTTPS may throw a System.IO.IOException : Received an unexpected EOF or 0 bytes from the transport stream.

If that happens, do consider using HTTP instead.