forked from ontoportal/ncbo_annotator
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into sync-ncbo
- Loading branch information
Showing
81 changed files
with
24,928 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
|
||
This package contains the TreeTagger, a probabilistic part-of-speech | ||
tagger developed by Helmut Schmid. All rights are reserved by the | ||
Institute for Computational Linguistics at the University of Stuttgart. | ||
The programs have been compiled for Sun Sparcstations with SunOS operating | ||
system version 5.6 or higher. | ||
|
||
Files contained in this package: | ||
|
||
- FILES this file | ||
- COPYRIGHT Copyright notice | ||
- README How to use the tagger | ||
- bin/train-tree-tagger training program | ||
- bin/tree-tagger tagger programm | ||
- cmd/lookup.perl Perl script for pretagging | ||
- doc/nemlap94.ps paper describing the TreeTagger | ||
- doc/sigdat95.ps paper describing the TreeTagger | ||
|
||
This package can be downloaded at | ||
http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html | ||
|
||
Also available at this URL: | ||
- parameter files | ||
- shell scripts which convert text to the format required by the tagger | ||
- papers about the TreeTagger | ||
|
||
The shell script package should be unpacked in the same directory as the | ||
tagger package and the parameter files should be decompressed and moved | ||
to the lib subdirectory. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
|
||
************************ | ||
** TreeTagger License ** | ||
************************ | ||
|
||
1. The Institut fuer maschinelle Sprachverarbeitung, Universitaet | ||
Stuttgart, subsequently called ``the licenser'', grants you (the | ||
licensee) the rights to use the TreeTagger software subsequently | ||
called ``the system'' for evaluation, research and teaching | ||
purposes. Usage of the system for commercial purposes is forbidden. | ||
|
||
2. The licensee has no right to give or sell the system to third | ||
parties without written permission from the licenser. | ||
|
||
3. The licenser has no obligation to maintain the system. | ||
Nevertheless the licensee is encouraged to report to the licenser | ||
any problems with or suggestions for improvement of the system. | ||
|
||
4. The licenser has no obligation to make new releases available to the | ||
licensee, but were such updates are supplied they shall be governed by | ||
the terms of this agreement. | ||
|
||
NO WARRANTY | ||
|
||
5. BECAUSE THE SYSTEM IS LICENSED FREE OF CHARGE, WE PROVIDE | ||
ABSOLUTELY NO WARRANTY, TO THE EXTENT PERMITTED BY APPLICABLE STATE | ||
LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE LICENSER PROVIDES THE | ||
SYSTEM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR | ||
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF | ||
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK | ||
AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH THE LICENSEE. | ||
SHOULD THE SYSTEM PROVE DEFECTIVE, THE LICENSEE ASSUMES THE COST OF | ||
ALL NECESSARY SERVICING, REPAIR OR CORRECTION. | ||
|
||
6. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW WILL THE LICENSER BE | ||
LIABLE TO THE LICENSEE FOR DAMAGES, INCLUDING ANY LOST PROFITS, LOST | ||
MONIES, OR OTHER SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING | ||
OUT OF THE USE OR INABILITY TO USE (INCLUDING BUT NOT LIMITED TO LOSS | ||
OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY THIRD | ||
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAM) | ||
THE PROGRAM, EVEN IF THE LICENSEE HAS BEEN ADVISED OF THE POSSIBILITY | ||
OF SUCH DAMAGES, OR FOR ANY CLAIM BY ANY OTHER PARTY. | ||
|
||
|
||
The wording of this license agreement has been adapted from the | ||
license of the ALF system by Michael Hanus, Max-Planck-Institut | ||
Saarbruecken and the GnuEmacs General Public License (c) 1991 Free | ||
Software Foundation. | ||
|
||
|
||
Contact Adress: | ||
|
||
Helmut Schmid | ||
Institut fuer maschinelle Sprachverarbeitung (IMS) | ||
Universitaet Stuttgart | ||
Azenbergstr. 12 | ||
D-70174 Stuttgart, Germany | ||
|
||
[email protected] | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
|
||
/***************************************************************************/ | ||
/* How to use the TreeTagger */ | ||
/* Author: Helmut Schmid, University of Stuttgart, Germany */ | ||
/***************************************************************************/ | ||
|
||
|
||
The TreeTagger consists of two programs: train-tree-tagger is used to | ||
create a parameter file from a lexicon and a handtagged corpus. | ||
tree-tagger expects a parameter file and a text file as arguments and | ||
annotates the text with part-of-speech tags. The file formats are | ||
described below. By default, the programs are located in the ./bin | ||
sub-directory. | ||
|
||
If either of the programs is called without arguments, it will print | ||
information about its usage. | ||
|
||
|
||
Tagging | ||
------- | ||
|
||
Tagging is done with the tree-tagger program. It requires at least one | ||
command line argument, the parameter file. If no input file is specified, | ||
input will be read from stdin. If neither an input file nor an output file | ||
is specified, the tagger will print to stdout. | ||
|
||
tree-tagger {-options-} <parameter file> {<input file> {<output file>}} | ||
|
||
Description of the command line arguments: | ||
|
||
* <parameter file>: Name of a parameter file which was created with the | ||
train-tree-tagger program. | ||
* <input file>: Name of the file which is to be tagged. Each token in this | ||
file has to be on a separate line. Tokens may contain blanks. It is possible | ||
to override the lexical information contained in the parameter file of the | ||
tagger by specifying a list of possible tags after a token. This list has | ||
to be preceded by a tab character and the elements are separated by tab | ||
characters. This pretagging feature could be used e.g. to ensure that | ||
certain text-specific expressions are tagged properly. | ||
Punctuation marks must be on separate lines as well. Clitics (like "'s", | ||
"'re", and "'d" in English or "-la" and "-t-elle" in French) should be | ||
separated if they were separated in the training data. (The French and | ||
English parameter files available by ftp expect separation of clitics). | ||
Sample input file: | ||
He | ||
moved | ||
to | ||
New York City NP | ||
. | ||
* <output file>: Name of the file to which the tagger should write its output. | ||
|
||
Further optional command line arguments: | ||
|
||
* -token: tells the tagger to print the words also. | ||
* -lemma: tells the tagger to print the lemmas of the words also. | ||
* -sgml: tells the tagger to ignore tokens starting with '<' and ending | ||
with '>' (SGML tags). | ||
- -no-unknown: If an unknown word is encountered, emit the word form | ||
as lemma. This was previously the default behaviour. Now, the default | ||
behaviour is to print "<unknown>" as lemma. | ||
- -threshold <p>: This option tells the tagger to print all tags of a | ||
word with a probability higher than <p> times the largest probability. | ||
(The tagger will use a different algorithm in this case and the set of | ||
best tags might be different from the tags generated without this | ||
option.) | ||
- -prob: Print tag probabilities (in combination with option -threshold) | ||
- -pt-with-prob: If this option is specified, then each pretagging tag | ||
(see above) has to be followed by a whitespace and a tag probability | ||
value. | ||
- -pt-with-lemma: If this option is specified, then each pretagging tag | ||
(see above) has to be followed by a whitespace and a lemma. Lemmas may | ||
contain blanks. | ||
If both -pt-with-prob and -pt-with-lemma have been specified, then each | ||
pretagging tag is followed by a probability and a lemma in that order. | ||
|
||
The options below are for advanced users. Please, read the papers on the | ||
TreeTagger to fully understand their meaning. | ||
|
||
* -proto: If this option is specified, the tagger creates a file named | ||
"lexicon-protocol.txt", which contains information about the degree of | ||
ambiguity and about the other possible tags of a word form. The part of | ||
the lexicon in which the word form has been found is also indicated. 'f' | ||
means fullform lexicon and 's' means affix lexicon. 'h' means that the | ||
word contains a hyphen and that the part of the word following the | ||
hyphen has been found in the fullform lexicon. | ||
* -eps <epsilon>: Value which is used to replace zero lexical frequencies. | ||
This is the case if a word/tag pair is contained in the lexicon but not | ||
in the training corpus. The choice of this parameter has only minor | ||
influence on the tagging accuracy. | ||
* -base: If this option is specified, only lexical information is used | ||
for tagging but no contextual information about the preceding tags. | ||
This option is only useful in order to obtain a baseline result | ||
to which to compare the actual tagger output. | ||
|
||
|
||
|
||
Training | ||
-------- | ||
|
||
Training is done with the *train-tree-tagger* program. It expects at least | ||
four command line arguments which are described below. | ||
|
||
train-tree-tagger {options} <lexicon> <open class file> <input file> <output file> | ||
|
||
Description of the command line arguments: | ||
|
||
* <lexicon>: name of a file which contains the fullform lexicon. Each line | ||
of the lexicon corresponds to one word form and contains the word form | ||
and a sequence of tag-lemma pairs. Each tag is preceded by a tab character | ||
and each lemma is preceded by a blank or tab character. | ||
Example: | ||
|
||
aback RB aback | ||
abacuses NNS abacus | ||
abandon VB abandon VBP abandon | ||
abandoned JJ abandoned VBD abandon VBN abandon | ||
abandoning VBG abandon | ||
|
||
Attention: Ordinal and cardinal numbers which consist of digits | ||
(like 1, 13, 1278 or 2. and 75.) should not be included in the | ||
lexicon. Otherwise, the tagger will not be able to learn how to tag | ||
numbers which are not listed in the lexicon. Numbers with unusual | ||
tags should be added to the lexicon, however. If the training | ||
program reports an error because the POS tag used for numbers is | ||
unknown, you should add a lexicon entry for one number. | ||
|
||
Remark: The tagger doesn't need the lemmata for tagging actually. If | ||
you do not have the lemma information or if you do not plan to | ||
annotate corpora with lemmas, you can replace the lemma with a dummy | ||
value, e.g. "-". | ||
|
||
* <open class file>: name of a file which contains a list of open class tags | ||
i.e. possible tags of unknown word forms separated by whitespace. | ||
The tagger will use this information when it encounters unknown words, | ||
i.e. words which are not contained in the lexicon. | ||
Example: (for Penn Treebank tagset) | ||
|
||
FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ | ||
|
||
* <input file>: name of a file which contains tagged training data. The data | ||
must be in one-word-per-line format. This means that each line contains | ||
one token and one tag in that order separated by a tabulator. | ||
Punctuation marks are considered as tokens and must be tagged as well. | ||
The file should neither contain empty lines nor untagged SGML markup. | ||
Example: | ||
|
||
Pierre NP | ||
Vinken NP | ||
, , | ||
61 CD | ||
years NNS | ||
|
||
* <output file>: name of the file in which the resulting tagger parameters | ||
are stored. | ||
|
||
The following parameters are optional. Read the papers on the TreeTagger to | ||
fully understand their meaning. | ||
|
||
* -st <sent. tag>: the end-of-sentence part-of-speech tag, i.e. the tag which | ||
is assigned to sentence punctuation like ".", "!", "?". | ||
Default is "SENT". It is important to set this option properly, if your | ||
tag for sentence punctuation is not "SENT". | ||
* -cl <context length>: number of preceding words forming the statistical | ||
context. The default is 2 which corresponds to a trigram context. For | ||
small training corpora and/or large tagsets, it could be useful to reduce | ||
this parameter to 1. | ||
* -dtg <min. decision tree gain>: Threshold - If the information gain at a | ||
leaf node of the decision tree is below this threshold, the node is deleted. | ||
* -sw <weight>: A smoothing parameter, which determines how much the | ||
probability distribution of some decision tree node is smoothed with the | ||
probability distribution of the parent node. | ||
* -ecw <eq. class weight>: weight of the equivalence class based probability | ||
estimates. | ||
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an | ||
affix tree is below this threshold, it is deleted. The default is 1.2. | ||
|
||
The accuracy of the TreeTagger usually improves, if different settings | ||
of the above parameters are tested and the best combination is chosen. | ||
|
||
|
||
Caveat: Make sure that the lexicon and the training corpus contain no | ||
extra blanks. If the word form, for instance, is followed by a blank | ||
and a tab character, the blank will be considered part of the word. | ||
|
Oops, something went wrong.