The objective of this dojo idea is to label wikipedia articles with their languages.
To achieve this training and tests data are provided in this repository with their schemas documented below.
To get a score on the dojo leaderboard, your script will have to be able to take a filename of a test dataset and generate an answer file.
i.e:
python label_articles.py < test_200.json > team_n_answers.json
Then the grading.py script is going to be used to get the official dojo score.
The score is computed by adding 1 for correct guesses, -1 for incorrect guesses and 0 for no guess.
Label training dataset as a jsonl (JSON Lines) file containing objects with the following schema:
- text UTF-8 extract from wikipedia articles, cleared of HTML tags.
- lang iso code of the language in the extract.
- subject the wikipedia subject of the language of the extract
Another label training dataset as a jsonl file containing objects with the same schema as lang_train.json but in which the text is only 100 or 200 characters from the middle of the article.
Unlabel label test dataset as a jsonl file containing objects with the following schema:
- text
- 100 characters long UTF-8 extract from wikipedia articles, cleared of HTML tags
- example
- Example identifier.
Example solution, to demonstrate the expected output.
An example answer file as generated by random_solution.py
, following schema:
- example
- Example identifier.
- lang
- Code of the language guessed by the solution. null denotes no guess.
A json object containing the mapping between the languages iso codes and their human names.
The jsonl format contains newline delimited json objects. For example:
{"lang": "it", "text": "ico del Nord...", "subject": "Atlantic_Ocean"}
{"lang": "be", "text": "га да гораду ...", "subject": "New_York_City"}
A typical way to decode those files in Python is to use such a generator
comprehension: json.loads(line) for line in open(json_l_filename)
.
More details at http://jsonlines.org/.