Skip to content

Latest commit

 

History

History
239 lines (200 loc) · 11.8 KB

README.md

File metadata and controls

239 lines (200 loc) · 11.8 KB

Turkish Treebanks

A human-annotated morphosyntactic treebank for Turkish.

This is not an official Google product.

Dataset Metadata

name Turkish Web Treebank
description A human-annotated morphosyntactic treebank for Turkish.
sameAs https://github.com/google-research-datasets/turkish-treebanks
license http://www.apache.org/licenses/LICENSE-2.0.txt

Dataset Description

Turkish Web Treebank (TWT, ISLRN: 177-333-742-633-6) consists of 4,851 sentences (66,466 words, and 81,370 inflectional groups), which are manually annotated for segmentation, morphology, part-of-speech and dependency relations. It is composed of two sections: web and Wikipedia. Web section is built by sampling and annotating 2,541 sentences from a representative set of Turkish Forum, Blog, How-to, Review & Guides webpages. Wikipedia section is built by sampling a sentence from 2,310 Turkish Wikipedia pages and annotating them.

Sentences Words Tokens
Web 2,541 26,519 32,422
Wiki 2,310 39,947 48,498

In terms of splits, in our experiments, we use every 9th sentence in above linked CoNLL-U format files as the development set, and every 10th sentence as the test set. All other sentences belong to the training set. We advise you to do the same for comparable results. Python API section describes the library that you can use to retrieve the splits.

Data format

Both sections of TWT is provided in CoNLL-U format in separate files. We follow the descriptions for the fields of the CoNLL-U format as they are defined in the documentation of the Universal Dependencies project, only with the following differences:

  • we use the original UPOS field to specifiy the coarse part-of-speech of the tokens.
  • we use the original XPOS field to specify the fine part-of-speech of the tokens.
  • we do not use the DEPS field as we do not provide enhanced dependency graph annotations, therefore we use "_" to mark the value of this field for all tokens.
  • since our tokens correspond to inflectional groups, we only list the lemmas for multi inflectional group words in the LEMMA field of the first inflectional group; we mark the LEMMA field of all other inflectional groups of such words with "_".

Below is an example annotation for the sentence "Üst öğrenimi bitirmenin sağladığı haklar..." in CoNNL-U format as we use it.

ID FORM LEMMA CPOS FPOS FEATS HEAD DEPREL DEPS MISC
1 Üst üst ADJ JJ Proper=False 2 amod _ _
2 öğrenimi öğrenim NOUN NN PersonNumber=A3sg|Possessive=Pnon|Case=Acc|Proper=False 4 dobj _ _
3 bit bit VERB VB Proper=False 4 ig _ SpaceAfter=No
4 ir _ VERB VB Derivation=Cau|Polarity=Pos|Proper=False 5 ig _ SpaceAfter=No
5 menin _ NOUN VN Derivation=Nonf|PersonNumber=A3sg|Possessive=Pnon|Case=Gen|Proper=False 7 poss _ _
6 sağla sağla VERB VB Polarity=Pos|Proper=False 7 ig _ SpaceAfter=No
7 dığı _ ADJ VJ Derivation=PastPart|Possessive=P3sg|Proper=False 8 rcmod _ _
8 haklar hak NOUN NN PersonNumber=A3sg|Possessive=Pnon|Case=Bare|Proper=False 0 root _ SpaceAfter=No
9 ... ... PUNCT . Proper=False 8 p _ _

Annotations

Part-of-speech and morphology layer of TWT is annotated using the Tukish morphological analyzer. You can see that repository for the full part-of-speech and morphological feature category-value tagsets and their descriptions.

The dependency layer is annotated using a label set of 44 dependency relations. Below table provides the descriptions for the dependency relations that are used in annotating the TWT.

Label Description
ROOT root of the sentence
acomp adjectival complement
advcl adverbial clause
advmod adverbial modifier
amod adjectival modifier of NP
appos appositional modifier of NP
attr attribute dependent of a copular verb
aux auxiliary verb
cc coordinating conjunction
ccomp clausal complement of a verb or adjective
clas classifier
conj conjunct
csubj clausal subject
det determiner
discourse interjections and other discourse elements
dislocated dislocated elements
dobj direct object
goeswith parts of a word that were mistokenized
ig inflectional group
iobj indirect object
list list for chains of comparable items
mark complementizer (words introducing finite subordinate clause)
mwe multiword expression
narg argument of a nominal
neg negation
nn nominal modifier
npadvmod noun phrase used as an adverbial modifier of a verb
nsubj nominal subject
num numeric modifier of a noun
number element of compound number
p punctuation
parataxis parataxis
pcomp clausal complement of postposition
pobj object of postposition
poss possessive modifier
preconj preconjuct
predet predeterminer
prep postposition
prt particle
rcmod relative clause modifier
remnant ellipsis
tmod temporal modifier
vocative vocative
xcomp open clausal complement

Following the Universal Dependencies annotation scheme, we also provide shallow segmentation annotations as miscellaneous features on tokens which are not whitespace segmented from the following ones in source text. We mark them with "SpaceAfter=No" feature category-value pair.

Python API

Together with the dataset we also provide a Python API that can be used to read annotated sentences (per web or Wikipedia sections and/or "train", "dev", "test" splits).

If you are using Bazel, you can depend on this repository as an external dependency of your project by adding the following to your WORKSPACE file:

git_repository(
  name = "google_research_turkish_treebanks",
  remote = "https://github.com/google-research-datasets/turkish-treebanks.git",
  tag = "{version-tag}",
)

Then, you can simply use @google_research_turkish_treebanks//turkish_treebanks:read as a dependecy of your relevant py_library or py_binary BUILD targets.

The API is also available on PyPi. To install the latest release from PyPi, run:

python3 -m pip install turkish-treebanks

To install from source, run below from the project root directory (preferably within a Python virtual environment):

bazel build //...
bazel-bin/setup install

Requirements

To build and run the tools install Bazel version 5.3.2 and Python 3.10. All other intrinsic dependencies will be imported, built and taken care of by Bazel according to the WORKSPACE setup. If you are installing from PyPi, you need pip.

Citing

If you use or discuss this dataset in your work, please cite:

Kayadelen, T., Öztürel, A. & Bohnet, B. (2020). A Gold Standard Dependency Treebank for Turkish. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).

@inproceedings{kayadelen-ozturel-bohnet:2020:LREC,
  author = {Kayadelen, Tolga  and  \"{O}zt\"{u}rel, Adnan  and  Bohnet, Bernd},
  title = {A Gold Standard Dependency Treebank for Turkish},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation
    Conference},
  month = {May},
  year = {2020},
  address = {Marseille, France},
  publisher = {European Language Resources Association},
  pages = {5158--5165},
  url = {https://www.aclweb.org/anthology/2020.lrec-1.634}
}

Contact

If you have a technical question regarding the dataset, code or publication, please create an issue in this repository.

License

Unless otherwise noted, all original files are licensed under Apache License Version 2.0.