A human-annotated morphosyntactic treebank for Turkish.
This is not an official Google product.
name | Turkish Web Treebank |
description | A human-annotated morphosyntactic
treebank for Turkish. |
sameAs | https://github.com/google-research-datasets/turkish-treebanks |
license | http://www.apache.org/licenses/LICENSE-2.0.txt |
Turkish Web Treebank (TWT, ISLRN: 177-333-742-633-6) consists of 4,851 sentences (66,466 words, and 81,370 inflectional groups), which are manually annotated for segmentation, morphology, part-of-speech and dependency relations. It is composed of two sections: web and Wikipedia. Web section is built by sampling and annotating 2,541 sentences from a representative set of Turkish Forum, Blog, How-to, Review & Guides webpages. Wikipedia section is built by sampling a sentence from 2,310 Turkish Wikipedia pages and annotating them.
Sentences | Words | Tokens | |
---|---|---|---|
Web | 2,541 | 26,519 | 32,422 |
Wiki | 2,310 | 39,947 | 48,498 |
In terms of splits, in our experiments, we use every 9th sentence in above linked CoNLL-U format files as the development set, and every 10th sentence as the test set. All other sentences belong to the training set. We advise you to do the same for comparable results. Python API section describes the library that you can use to retrieve the splits.
Both sections of TWT is provided in CoNLL-U format in separate files. We follow the descriptions for the fields of the CoNLL-U format as they are defined in the documentation of the Universal Dependencies project, only with the following differences:
- we use the original UPOS field to specifiy the coarse part-of-speech of the tokens.
- we use the original XPOS field to specify the fine part-of-speech of the tokens.
- we do not use the DEPS field as we do not provide enhanced dependency graph annotations, therefore we use "_" to mark the value of this field for all tokens.
- since our tokens correspond to inflectional groups, we only list the lemmas for multi inflectional group words in the LEMMA field of the first inflectional group; we mark the LEMMA field of all other inflectional groups of such words with "_".
Below is an example annotation for the sentence "Üst öğrenimi bitirmenin sağladığı haklar..." in CoNNL-U format as we use it.
ID | FORM | LEMMA | CPOS | FPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | Üst | üst | ADJ | JJ | Proper=False | 2 | amod | _ | _ |
2 | öğrenimi | öğrenim | NOUN | NN | PersonNumber=A3sg|Possessive=Pnon|Case=Acc|Proper=False | 4 | dobj | _ | _ |
3 | bit | bit | VERB | VB | Proper=False | 4 | ig | _ | SpaceAfter=No |
4 | ir | _ | VERB | VB | Derivation=Cau|Polarity=Pos|Proper=False | 5 | ig | _ | SpaceAfter=No |
5 | menin | _ | NOUN | VN | Derivation=Nonf|PersonNumber=A3sg|Possessive=Pnon|Case=Gen|Proper=False | 7 | poss | _ | _ |
6 | sağla | sağla | VERB | VB | Polarity=Pos|Proper=False | 7 | ig | _ | SpaceAfter=No |
7 | dığı | _ | ADJ | VJ | Derivation=PastPart|Possessive=P3sg|Proper=False | 8 | rcmod | _ | _ |
8 | haklar | hak | NOUN | NN | PersonNumber=A3sg|Possessive=Pnon|Case=Bare|Proper=False | 0 | root | _ | SpaceAfter=No |
9 | ... | ... | PUNCT | . | Proper=False | 8 | p | _ | _ |
Part-of-speech and morphology layer of TWT is annotated using the Tukish morphological analyzer. You can see that repository for the full part-of-speech and morphological feature category-value tagsets and their descriptions.
The dependency layer is annotated using a label set of 44 dependency relations. Below table provides the descriptions for the dependency relations that are used in annotating the TWT.
Label | Description |
---|---|
ROOT | root of the sentence |
acomp | adjectival complement |
advcl | adverbial clause |
advmod | adverbial modifier |
amod | adjectival modifier of NP |
appos | appositional modifier of NP |
attr | attribute dependent of a copular verb |
aux | auxiliary verb |
cc | coordinating conjunction |
ccomp | clausal complement of a verb or adjective |
clas | classifier |
conj | conjunct |
csubj | clausal subject |
det | determiner |
discourse | interjections and other discourse elements |
dislocated | dislocated elements |
dobj | direct object |
goeswith | parts of a word that were mistokenized |
ig | inflectional group |
iobj | indirect object |
list | list for chains of comparable items |
mark | complementizer (words introducing finite subordinate clause) |
mwe | multiword expression |
narg | argument of a nominal |
neg | negation |
nn | nominal modifier |
npadvmod | noun phrase used as an adverbial modifier of a verb |
nsubj | nominal subject |
num | numeric modifier of a noun |
number | element of compound number |
p | punctuation |
parataxis | parataxis |
pcomp | clausal complement of postposition |
pobj | object of postposition |
poss | possessive modifier |
preconj | preconjuct |
predet | predeterminer |
prep | postposition |
prt | particle |
rcmod | relative clause modifier |
remnant | ellipsis |
tmod | temporal modifier |
vocative | vocative |
xcomp | open clausal complement |
Following the Universal Dependencies annotation scheme, we also provide shallow segmentation annotations as miscellaneous features on tokens which are not whitespace segmented from the following ones in source text. We mark them with "SpaceAfter=No" feature category-value pair.
Together with the dataset we also provide a Python API that can be used to read annotated sentences (per web or Wikipedia sections and/or "train", "dev", "test" splits).
If you are using Bazel, you can depend on this repository as an external dependency of your project by adding the following to your WORKSPACE file:
git_repository(
name = "google_research_turkish_treebanks",
remote = "https://github.com/google-research-datasets/turkish-treebanks.git",
tag = "{version-tag}",
)
Then, you can simply use
@google_research_turkish_treebanks//turkish_treebanks:read
as a
dependecy of your relevant py_library
or py_binary
BUILD targets.
The API is also available on PyPi. To install the latest release from PyPi, run:
python3 -m pip install turkish-treebanks
To install from source, run below from the project root directory (preferably within a Python virtual environment):
bazel build //...
bazel-bin/setup install
To build and run the tools install Bazel version 5.3.2 and Python 3.10. All other intrinsic dependencies will be imported, built and taken care of by Bazel according to the WORKSPACE setup. If you are installing from PyPi, you need pip.
If you use or discuss this dataset in your work, please cite:
Kayadelen, T., Öztürel, A. & Bohnet, B. (2020). A Gold Standard Dependency Treebank for Turkish. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).
@inproceedings{kayadelen-ozturel-bohnet:2020:LREC,
author = {Kayadelen, Tolga and \"{O}zt\"{u}rel, Adnan and Bohnet, Bernd},
title = {A Gold Standard Dependency Treebank for Turkish},
booktitle = {Proceedings of The 12th Language Resources and Evaluation
Conference},
month = {May},
year = {2020},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {5158--5165},
url = {https://www.aclweb.org/anthology/2020.lrec-1.634}
}
If you have a technical question regarding the dataset, code or publication, please create an issue in this repository.
Unless otherwise noted, all original files are licensed under Apache License Version 2.0.