/dojo-lang-detector

The objective of this kata is to label wikipedia articles with their languages.

Primary LanguagePython

dojo_lang_detector

The objective of this dojo idea is to label wikipedia articles with their languages.

To achieve this training and tests data are provided in this repository with their schemas documented below.

To get a score on the dojo leaderboard, your script will have to be able to take a filename of a test dataset and generate an answer file.

i.e:

python label_articles.py < test_200.json > team_n_answers.json

Then the grading.py script is going to be used to get the official dojo score.

The score is computed by adding 1 for correct guesses, -1 for incorrect guesses and 0 for no guess.

Files schema

lang_train.json

Label training dataset as a jsonl (JSON Lines) file containing objects with the following schema:

  1. text UTF-8 extract from wikipedia articles, cleared of HTML tags.
  2. lang iso code of the language in the extract.
  3. subject the wikipedia subject of the language of the extract

train_*.json

Another label training dataset as a jsonl file containing objects with the same schema as lang_train.json but in which the text is only 100 or 200 characters from the middle of the article.

test_100.json

Unlabel label test dataset as a jsonl file containing objects with the following schema:

text
100 characters long UTF-8 extract from wikipedia articles, cleared of HTML tags
example
Example identifier.

random_solution.py

Example solution, to demonstrate the expected output.

random_solution_answers.json

An example answer file as generated by random_solution.py, following schema:

example
Example identifier.
lang
Code of the language guessed by the solution. null denotes no guess.

languages.json

A json object containing the mapping between the languages iso codes and their human names.

jsonl

The jsonl format contains newline delimited json objects. For example:

    {"lang": "it", "text": "ico del Nord...", "subject": "Atlantic_Ocean"}
    {"lang": "be", "text": "га да гораду ...", "subject": "New_York_City"}

A typical way to decode those files in Python is to use such a generator comprehension: json.loads(line) for line in open(json_l_filename).

More details at http://jsonlines.org/.