Skip to content

Commit

Permalink
Merge branch 'master' into sync-ncbo
Browse files Browse the repository at this point in the history
  • Loading branch information
syphax-bouazzouni authored Feb 28, 2024
2 parents 63c9868 + 57204d8 commit 2d2c3af
Show file tree
Hide file tree
Showing 81 changed files with 24,928 additions and 24 deletions.
4 changes: 2 additions & 2 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,6 @@ group :development do
end

# NCBO gems (can be from a local dev path or from rubygems/git)
gem 'goo', github: 'ncbo/goo', branch: 'master'
gem 'ontologies_linked_data', github: 'ncbo/ontologies_linked_data', branch: 'master'
gem 'goo', github: 'ontoportal-lirmm/goo', branch: 'development'
gem 'sparql-client', github: 'ncbo/sparql-client', branch: 'master'
gem 'ontologies_linked_data', github: 'ontoportal-lirmm/ontologies_linked_data', branch: 'development'
30 changes: 15 additions & 15 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
GIT
remote: https://github.com/ncbo/goo.git
revision: 75436fe8e387febc53e34ee31ff0e6dd837a9d3f
remote: https://github.com/ontoportal-lirmm/sparql-client.git
revision: fb4a89b420f8eb6dda5190a126b6c62e32c4c0c9
branch: master
specs:
sparql-client (1.0.1)
json_pure (>= 1.4)
net-http-persistent (= 2.9.4)
rdf (>= 1.0)

GIT
remote: https://github.com/ontoportal-lirmm/goo.git
revision: dd9681d91873341850df5b49a0e7b8dd87a3d252
branch: development
specs:
goo (0.0.2)
addressable (~> 2.8)
Expand All @@ -14,9 +24,9 @@ GIT
uuid

GIT
remote: https://github.com/ncbo/ontologies_linked_data.git
revision: ee0013f0ee23876076bff9d9258b46371ec3b248
branch: master
remote: https://github.com/ontoportal-lirmm/ontologies_linked_data.git
revision: fa49a53a6f14569a8ad77bcd1baa503f2380c011
branch: development
specs:
ontologies_linked_data (0.0.1)
activesupport
Expand All @@ -33,16 +43,6 @@ GIT
rsolr
rubyzip

GIT
remote: https://github.com/ncbo/sparql-client.git
revision: d418d56a6c9ff5692f925b45739a2a1c66bca851
branch: master
specs:
sparql-client (1.0.1)
json_pure (>= 1.4)
net-http-persistent (= 2.9.4)
rdf (>= 1.0)

GEM
remote: https://rubygems.org/
specs:
Expand Down
Binary file added lib/Lemmatizer/Lemmatizer.jar
Binary file not shown.
29 changes: 29 additions & 0 deletions lib/Lemmatizer/TreeTagger/FILES
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@

This package contains the TreeTagger, a probabilistic part-of-speech
tagger developed by Helmut Schmid. All rights are reserved by the
Institute for Computational Linguistics at the University of Stuttgart.
The programs have been compiled for Sun Sparcstations with SunOS operating
system version 5.6 or higher.

Files contained in this package:

- FILES this file
- COPYRIGHT Copyright notice
- README How to use the tagger
- bin/train-tree-tagger training program
- bin/tree-tagger tagger programm
- cmd/lookup.perl Perl script for pretagging
- doc/nemlap94.ps paper describing the TreeTagger
- doc/sigdat95.ps paper describing the TreeTagger

This package can be downloaded at
http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html

Also available at this URL:
- parameter files
- shell scripts which convert text to the format required by the tagger
- papers about the TreeTagger

The shell script package should be unpacked in the same directory as the
tagger package and the parameter files should be decompressed and moved
to the lib subdirectory.
61 changes: 61 additions & 0 deletions lib/Lemmatizer/TreeTagger/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@

************************
** TreeTagger License **
************************

1. The Institut fuer maschinelle Sprachverarbeitung, Universitaet
Stuttgart, subsequently called ``the licenser'', grants you (the
licensee) the rights to use the TreeTagger software subsequently
called ``the system'' for evaluation, research and teaching
purposes. Usage of the system for commercial purposes is forbidden.

2. The licensee has no right to give or sell the system to third
parties without written permission from the licenser.

3. The licenser has no obligation to maintain the system.
Nevertheless the licensee is encouraged to report to the licenser
any problems with or suggestions for improvement of the system.

4. The licenser has no obligation to make new releases available to the
licensee, but were such updates are supplied they shall be governed by
the terms of this agreement.

NO WARRANTY

5. BECAUSE THE SYSTEM IS LICENSED FREE OF CHARGE, WE PROVIDE
ABSOLUTELY NO WARRANTY, TO THE EXTENT PERMITTED BY APPLICABLE STATE
LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE LICENSER PROVIDES THE
SYSTEM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK
AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH THE LICENSEE.
SHOULD THE SYSTEM PROVE DEFECTIVE, THE LICENSEE ASSUMES THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

6. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW WILL THE LICENSER BE
LIABLE TO THE LICENSEE FOR DAMAGES, INCLUDING ANY LOST PROFITS, LOST
MONIES, OR OTHER SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE (INCLUDING BUT NOT LIMITED TO LOSS
OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAM)
THE PROGRAM, EVEN IF THE LICENSEE HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES, OR FOR ANY CLAIM BY ANY OTHER PARTY.


The wording of this license agreement has been adapted from the
license of the ALF system by Michael Hanus, Max-Planck-Institut
Saarbruecken and the GnuEmacs General Public License (c) 1991 Free
Software Foundation.


Contact Adress:

Helmut Schmid
Institut fuer maschinelle Sprachverarbeitung (IMS)
Universitaet Stuttgart
Azenbergstr. 12
D-70174 Stuttgart, Germany

[email protected]


184 changes: 184 additions & 0 deletions lib/Lemmatizer/TreeTagger/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@

/***************************************************************************/
/* How to use the TreeTagger */
/* Author: Helmut Schmid, University of Stuttgart, Germany */
/***************************************************************************/


The TreeTagger consists of two programs: train-tree-tagger is used to
create a parameter file from a lexicon and a handtagged corpus.
tree-tagger expects a parameter file and a text file as arguments and
annotates the text with part-of-speech tags. The file formats are
described below. By default, the programs are located in the ./bin
sub-directory.

If either of the programs is called without arguments, it will print
information about its usage.


Tagging
-------

Tagging is done with the tree-tagger program. It requires at least one
command line argument, the parameter file. If no input file is specified,
input will be read from stdin. If neither an input file nor an output file
is specified, the tagger will print to stdout.

tree-tagger {-options-} <parameter file> {<input file> {<output file>}}

Description of the command line arguments:

* <parameter file>: Name of a parameter file which was created with the
train-tree-tagger program.
* <input file>: Name of the file which is to be tagged. Each token in this
file has to be on a separate line. Tokens may contain blanks. It is possible
to override the lexical information contained in the parameter file of the
tagger by specifying a list of possible tags after a token. This list has
to be preceded by a tab character and the elements are separated by tab
characters. This pretagging feature could be used e.g. to ensure that
certain text-specific expressions are tagged properly.
Punctuation marks must be on separate lines as well. Clitics (like "'s",
"'re", and "'d" in English or "-la" and "-t-elle" in French) should be
separated if they were separated in the training data. (The French and
English parameter files available by ftp expect separation of clitics).
Sample input file:
He
moved
to
New York City NP
.
* <output file>: Name of the file to which the tagger should write its output.

Further optional command line arguments:

* -token: tells the tagger to print the words also.
* -lemma: tells the tagger to print the lemmas of the words also.
* -sgml: tells the tagger to ignore tokens starting with '<' and ending
with '>' (SGML tags).
- -no-unknown: If an unknown word is encountered, emit the word form
as lemma. This was previously the default behaviour. Now, the default
behaviour is to print "<unknown>" as lemma.
- -threshold <p>: This option tells the tagger to print all tags of a
word with a probability higher than <p> times the largest probability.
(The tagger will use a different algorithm in this case and the set of
best tags might be different from the tags generated without this
option.)
- -prob: Print tag probabilities (in combination with option -threshold)
- -pt-with-prob: If this option is specified, then each pretagging tag
(see above) has to be followed by a whitespace and a tag probability
value.
- -pt-with-lemma: If this option is specified, then each pretagging tag
(see above) has to be followed by a whitespace and a lemma. Lemmas may
contain blanks.
If both -pt-with-prob and -pt-with-lemma have been specified, then each
pretagging tag is followed by a probability and a lemma in that order.

The options below are for advanced users. Please, read the papers on the
TreeTagger to fully understand their meaning.

* -proto: If this option is specified, the tagger creates a file named
"lexicon-protocol.txt", which contains information about the degree of
ambiguity and about the other possible tags of a word form. The part of
the lexicon in which the word form has been found is also indicated. 'f'
means fullform lexicon and 's' means affix lexicon. 'h' means that the
word contains a hyphen and that the part of the word following the
hyphen has been found in the fullform lexicon.
* -eps <epsilon>: Value which is used to replace zero lexical frequencies.
This is the case if a word/tag pair is contained in the lexicon but not
in the training corpus. The choice of this parameter has only minor
influence on the tagging accuracy.
* -base: If this option is specified, only lexical information is used
for tagging but no contextual information about the preceding tags.
This option is only useful in order to obtain a baseline result
to which to compare the actual tagger output.



Training
--------

Training is done with the *train-tree-tagger* program. It expects at least
four command line arguments which are described below.

train-tree-tagger {options} <lexicon> <open class file> <input file> <output file>

Description of the command line arguments:

* <lexicon>: name of a file which contains the fullform lexicon. Each line
of the lexicon corresponds to one word form and contains the word form
and a sequence of tag-lemma pairs. Each tag is preceded by a tab character
and each lemma is preceded by a blank or tab character.
Example:

aback RB aback
abacuses NNS abacus
abandon VB abandon VBP abandon
abandoned JJ abandoned VBD abandon VBN abandon
abandoning VBG abandon

Attention: Ordinal and cardinal numbers which consist of digits
(like 1, 13, 1278 or 2. and 75.) should not be included in the
lexicon. Otherwise, the tagger will not be able to learn how to tag
numbers which are not listed in the lexicon. Numbers with unusual
tags should be added to the lexicon, however. If the training
program reports an error because the POS tag used for numbers is
unknown, you should add a lexicon entry for one number.

Remark: The tagger doesn't need the lemmata for tagging actually. If
you do not have the lemma information or if you do not plan to
annotate corpora with lemmas, you can replace the lemma with a dummy
value, e.g. "-".

* <open class file>: name of a file which contains a list of open class tags
i.e. possible tags of unknown word forms separated by whitespace.
The tagger will use this information when it encounters unknown words,
i.e. words which are not contained in the lexicon.
Example: (for Penn Treebank tagset)

FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ

* <input file>: name of a file which contains tagged training data. The data
must be in one-word-per-line format. This means that each line contains
one token and one tag in that order separated by a tabulator.
Punctuation marks are considered as tokens and must be tagged as well.
The file should neither contain empty lines nor untagged SGML markup.
Example:

Pierre NP
Vinken NP
, ,
61 CD
years NNS

* <output file>: name of the file in which the resulting tagger parameters
are stored.

The following parameters are optional. Read the papers on the TreeTagger to
fully understand their meaning.

* -st <sent. tag>: the end-of-sentence part-of-speech tag, i.e. the tag which
is assigned to sentence punctuation like ".", "!", "?".
Default is "SENT". It is important to set this option properly, if your
tag for sentence punctuation is not "SENT".
* -cl <context length>: number of preceding words forming the statistical
context. The default is 2 which corresponds to a trigram context. For
small training corpora and/or large tagsets, it could be useful to reduce
this parameter to 1.
* -dtg <min. decision tree gain>: Threshold - If the information gain at a
leaf node of the decision tree is below this threshold, the node is deleted.
* -sw <weight>: A smoothing parameter, which determines how much the
probability distribution of some decision tree node is smoothed with the
probability distribution of the parent node.
* -ecw <eq. class weight>: weight of the equivalence class based probability
estimates.
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
affix tree is below this threshold, it is deleted. The default is 1.2.

The accuracy of the TreeTagger usually improves, if different settings
of the above parameters are tested and the best combination is chosen.


Caveat: Make sure that the lexicon and the training corpus contain no
extra blanks. If the word form, for instance, is followed by a blank
and a tab character, the blank will be considered part of the word.

Loading

0 comments on commit 2d2c3af

Please sign in to comment.