dojo_lang_detector

The objective of this dojo idea is to label wikipedia articles with their languages.

To achieve this training and tests data are provided in this repository with their schemas documented below.

To get a score on the dojo leaderboard, your script will have to be able to take a filename of a test dataset and generate an answer file.

i.e:

python label_articles.py < test_200.json > team_n_answers.json

Then the grading.py script is going to be used to get the official dojo score.

The score is computed by adding 1 for correct guesses, -1 for incorrect guesses and 0 for no guess.

Files schema

lang_train.json

Label training dataset as a jsonl (JSON Lines) file containing objects with the following schema:

text UTF-8 extract from wikipedia articles, cleared of HTML tags.
lang iso code of the language in the extract.
subject the wikipedia subject of the language of the extract

train_*.json

Another label training dataset as a jsonl file containing objects with the same schema as lang_train.json but in which the text is only 100 or 200 characters from the middle of the article.

test_100.json

Unlabel label test dataset as a jsonl file containing objects with the following schema:

text: 100 characters long UTF-8 extract from wikipedia articles, cleared of HTML tags
example: Example identifier.

random_solution.py

Example solution, to demonstrate the expected output.

random_solution_answers.json

An example answer file as generated by random_solution.py, following schema:

example: Example identifier.
lang: Code of the language guessed by the solution. null denotes no guess.

languages.json

A json object containing the mapping between the languages iso codes and their human names.

jsonl

The jsonl format contains newline delimited json objects. For example:

    {"lang": "it", "text": "ico del Nord...", "subject": "Atlantic_Ocean"}
    {"lang": "be", "text": "га да гораду ...", "subject": "New_York_City"}

A typical way to decode those files in Python is to use such a generator comprehension: json.loads(line) for line in open(json_l_filename).

More details at http://jsonlines.org/.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
team_window		team_window
.gitignore		.gitignore
README.md		README.md
answer_100.json		answer_100.json
answer_200.json		answer_200.json
google_solution.py		google_solution.py
grade.py		grade.py
label_articles.py		label_articles.py
lang_train.json		lang_train.json
languages.json		languages.json
random_solution.py		random_solution.py
random_solution_answers.json		random_solution_answers.json
team_403_blob_result.json		team_403_blob_result.json
team_403_blobby.py		team_403_blobby.py
team_403_google_answers.json		team_403_google_answers.json
team_403_google_answers_200.json		team_403_google_answers_200.json
team_403_polglyot_answers.json		team_403_polglyot_answers.json
team_403_polgylot_answers_200.json		team_403_polgylot_answers_200.json
team_403_polyglot.py		team_403_polyglot.py
team_andy_answers.json		team_andy_answers.json
test_100.json		test_100.json
test_200.json		test_200.json
train_100.json		train_100.json
train_200.json		train_200.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dojo_lang_detector

Files schema

lang_train.json

train_*.json

test_100.json

random_solution.py

random_solution_answers.json

languages.json

jsonl

About

Releases

Packages

Contributors 5

Languages

ldnpydojo/dojo-lang-detector

Folders and files

Latest commit

History

Repository files navigation

dojo_lang_detector

Files schema

lang_train.json

train_*.json

test_100.json

random_solution.py

random_solution_answers.json

languages.json

jsonl

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages