Merge branch 'master' into sync-ncbo

ontoportal-lirmm · Feb 28, 2024 · 2d2c3af · 2d2c3af
2 parents 63c9868 + 57204d8
commit 2d2c3af
Show file tree

Hide file tree

Showing 81 changed files with 24,928 additions and 24 deletions.
diff --git a/Gemfile b/Gemfile
@@ -16,6 +16,6 @@ group :development do
 end
 
 # NCBO gems (can be from a local dev path or from rubygems/git)
-gem 'goo', github: 'ncbo/goo', branch: 'master'
-gem 'ontologies_linked_data', github: 'ncbo/ontologies_linked_data', branch: 'master'
+gem 'goo', github: 'ontoportal-lirmm/goo', branch: 'development'
 gem 'sparql-client', github: 'ncbo/sparql-client', branch: 'master'
+gem 'ontologies_linked_data', github: 'ontoportal-lirmm/ontologies_linked_data', branch: 'development'
diff --git a/Gemfile.lock b/Gemfile.lock
@@ -1,7 +1,17 @@
 GIT
-  remote: https://github.com/ncbo/goo.git
-  revision: 75436fe8e387febc53e34ee31ff0e6dd837a9d3f
+  remote: https://github.com/ontoportal-lirmm/sparql-client.git
+  revision: fb4a89b420f8eb6dda5190a126b6c62e32c4c0c9
   branch: master
+  specs:
+    sparql-client (1.0.1)
+      json_pure (>= 1.4)
+      net-http-persistent (= 2.9.4)
+      rdf (>= 1.0)
+
+GIT
+  remote: https://github.com/ontoportal-lirmm/goo.git
+  revision: dd9681d91873341850df5b49a0e7b8dd87a3d252
+  branch: development
   specs:
     goo (0.0.2)
       addressable (~> 2.8)
@@ -14,9 +24,9 @@ GIT
       uuid
 
 GIT
-  remote: https://github.com/ncbo/ontologies_linked_data.git
-  revision: ee0013f0ee23876076bff9d9258b46371ec3b248
-  branch: master
+  remote: https://github.com/ontoportal-lirmm/ontologies_linked_data.git
+  revision: fa49a53a6f14569a8ad77bcd1baa503f2380c011
+  branch: development
   specs:
     ontologies_linked_data (0.0.1)
       activesupport
@@ -33,16 +43,6 @@ GIT
       rsolr
       rubyzip
 
-GIT
-  remote: https://github.com/ncbo/sparql-client.git
-  revision: d418d56a6c9ff5692f925b45739a2a1c66bca851
-  branch: master
-  specs:
-    sparql-client (1.0.1)
-      json_pure (>= 1.4)
-      net-http-persistent (= 2.9.4)
-      rdf (>= 1.0)
-
 GEM
   remote: https://rubygems.org/
   specs:

diff --git a/lib/Lemmatizer/Lemmatizer.jar b/lib/Lemmatizer/Lemmatizer.jar
diff --git a/lib/Lemmatizer/TreeTagger/FILES b/lib/Lemmatizer/TreeTagger/FILES
@@ -0,0 +1,29 @@
+
+This package contains the TreeTagger, a probabilistic part-of-speech
+tagger developed by Helmut Schmid. All rights are reserved by the 
+Institute for Computational Linguistics at the University of Stuttgart.
+The programs have been compiled for Sun Sparcstations with SunOS operating
+system version 5.6 or higher. 
+
+Files contained in this package:
+
+- FILES                 this file
+- COPYRIGHT             Copyright notice
+- README                How to use the tagger
+- bin/train-tree-tagger training program
+- bin/tree-tagger       tagger programm
+- cmd/lookup.perl       Perl script for pretagging
+- doc/nemlap94.ps       paper describing the TreeTagger
+- doc/sigdat95.ps       paper describing the TreeTagger
+
+This package can be downloaded at 
+http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html
+
+Also available at this URL:
+- parameter files
+- shell scripts which convert text to the format required by the tagger
+- papers about the TreeTagger
+
+The shell script package should be unpacked in the same directory as the
+tagger package and the parameter files should be decompressed and moved
+to the lib subdirectory.
diff --git a/lib/Lemmatizer/TreeTagger/LICENSE b/lib/Lemmatizer/TreeTagger/LICENSE
@@ -0,0 +1,61 @@
+
+                  ************************
+                  ** TreeTagger License **
+                  ************************
+
+1. The Institut fuer maschinelle Sprachverarbeitung, Universitaet
+   Stuttgart, subsequently called ``the licenser'', grants you (the
+   licensee) the rights to use the TreeTagger software subsequently
+   called ``the system'' for evaluation, research and teaching
+   purposes. Usage of the system for commercial purposes is forbidden.
+
+2. The licensee has no right to give or sell the system to third
+   parties without written permission from the licenser.
+
+3. The licenser has no obligation to maintain the system.
+   Nevertheless the licensee is encouraged to report to the licenser
+   any problems with or suggestions for improvement of the system.
+
+4. The licenser has no obligation to make new releases available to the
+   licensee, but were such updates are supplied they shall be governed by
+   the terms of this agreement.
+
+			   NO WARRANTY
+
+5. BECAUSE THE SYSTEM IS LICENSED FREE OF CHARGE, WE PROVIDE
+ABSOLUTELY NO WARRANTY, TO THE EXTENT PERMITTED BY APPLICABLE STATE
+LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE LICENSER PROVIDES THE
+SYSTEM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR
+IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK
+AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH THE LICENSEE.
+SHOULD THE SYSTEM PROVE DEFECTIVE, THE LICENSEE ASSUMES THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+6. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW WILL THE LICENSER BE
+LIABLE TO THE LICENSEE FOR DAMAGES, INCLUDING ANY LOST PROFITS, LOST
+MONIES, OR OTHER SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE (INCLUDING BUT NOT LIMITED TO LOSS
+OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAM)
+THE PROGRAM, EVEN IF THE LICENSEE HAS BEEN ADVISED OF THE POSSIBILITY
+OF SUCH DAMAGES, OR FOR ANY CLAIM BY ANY OTHER PARTY.
+
+
+The wording of this license agreement has been adapted from the
+license of the ALF system by Michael Hanus, Max-Planck-Institut
+Saarbruecken and the GnuEmacs General Public License (c) 1991 Free
+Software Foundation. 
+
+
+Contact Adress:
+
+Helmut Schmid
+Institut fuer maschinelle Sprachverarbeitung (IMS) 
+Universitaet Stuttgart
+Azenbergstr. 12
+D-70174 Stuttgart, Germany
+
+[email protected]
+
+
diff --git a/lib/Lemmatizer/TreeTagger/README b/lib/Lemmatizer/TreeTagger/README
@@ -0,0 +1,184 @@
+
+/***************************************************************************/
+/* How to use the TreeTagger                                               */
+/* Author: Helmut Schmid, University of Stuttgart, Germany                 */
+/***************************************************************************/
+
+
+The TreeTagger consists of two programs: train-tree-tagger is used to 
+create a parameter file from a lexicon and a handtagged corpus. 
+tree-tagger expects a parameter file and a text file as arguments and
+annotates the text with part-of-speech tags. The file formats are
+described below. By default, the programs are located in the ./bin
+sub-directory.
+
+If either of the programs is called without arguments, it will print
+information about its usage.
+
+
+Tagging
+-------
+
+Tagging is done with the tree-tagger program. It requires at least one
+command line argument, the parameter file. If no input file is specified,
+input will be read from stdin. If neither an input file nor an output file
+is specified, the tagger will print to stdout.
+
+tree-tagger {-options-} <parameter file> {<input file> {<output file>}}
+
+Description of the command line arguments:
+
+* <parameter file>: Name of a parameter file which was created with the 
+  train-tree-tagger program.
+* <input file>: Name of the file which is to be tagged. Each token in this 
+  file has to be on a separate line. Tokens may contain blanks. It is possible
+  to override the lexical information contained in the parameter file of the
+  tagger by specifying a list of possible tags after a token. This list has
+  to be preceded by a tab character and the elements are separated by tab 
+  characters. This pretagging feature could be used e.g. to ensure that
+  certain text-specific expressions are tagged properly.
+  Punctuation marks must be on separate lines as well. Clitics (like "'s",
+  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be
+  separated if they were separated in the training data. (The French and
+  English parameter files available by ftp expect separation of clitics).
+  Sample input file:
+    He
+    moved
+    to
+    New York City	NP
+    .
+* <output file>: Name of the file to which the tagger should write its output.
+
+Further optional command line arguments:
+
+* -token: tells the tagger to print the words also.
+* -lemma: tells the tagger to print the lemmas of the words also.
+* -sgml: tells the tagger to ignore tokens starting with '<' and ending
+  with '>' (SGML tags).
+- -no-unknown: If an unknown word is encountered, emit the word form
+  as lemma. This was previously the default behaviour. Now, the default 
+  behaviour is to print "<unknown>" as lemma.
+- -threshold <p>: This option tells the tagger to print all tags of a
+  word with a probability higher than <p> times the largest probability.
+  (The tagger will use a different algorithm in this case and the set of
+  best tags might be different from the tags generated without this
+  option.)
+- -prob: Print tag probabilities (in combination with option -threshold)
+- -pt-with-prob: If this option is specified, then each pretagging tag
+  (see above) has to be followed by a whitespace and a tag probability 
+  value.
+- -pt-with-lemma: If this option is specified, then each pretagging tag
+  (see above) has to be followed by a whitespace and a lemma. Lemmas may 
+  contain blanks.
+  If both -pt-with-prob and -pt-with-lemma have been specified, then each
+  pretagging tag is followed by a probability and a lemma in that order.
+
+The options below are for advanced users. Please, read the papers on the 
+TreeTagger to fully understand their meaning.
+
+* -proto: If this option is specified, the tagger creates a file named
+  "lexicon-protocol.txt", which contains information about the degree of
+  ambiguity and about the other possible tags of a word form. The part of
+  the lexicon in which the word form has been found is also indicated. 'f'
+  means fullform lexicon and 's' means affix lexicon. 'h' means that the
+  word contains a hyphen and that the part of the word following the
+  hyphen has been found in the fullform lexicon.
+* -eps <epsilon>: Value which is used to replace zero lexical frequencies.
+  This is the case if a word/tag pair is contained in the lexicon but not
+  in the training corpus. The choice of this parameter has only minor
+  influence on the tagging accuracy.
+* -base: If this option is specified, only lexical information is used
+  for tagging but no contextual information about the preceding tags.
+  This option is only useful in order to obtain a baseline result
+  to which to compare the actual tagger output.
+
+
+
+Training
+--------
+
+Training is done with the *train-tree-tagger* program. It expects at least
+four command line arguments which are described below.
+
+train-tree-tagger {options} <lexicon> <open class file> <input file> <output file>
+
+Description of the command line arguments:
+
+* <lexicon>: name of a file which contains the fullform lexicon. Each line 
+  of the lexicon corresponds to one word form and contains the word form 
+  and a sequence of tag-lemma pairs. Each tag is preceded by a tab character
+  and each lemma is preceded by a blank or tab character.
+  Example:
+
+aback	RB aback
+abacuses	NNS abacus
+abandon	VB abandon	VBP abandon
+abandoned	JJ abandoned	VBD abandon	VBN abandon
+abandoning	VBG abandon
+
+  Attention: Ordinal and cardinal numbers which consist of digits
+  (like 1, 13, 1278 or 2. and 75.) should not be included in the
+  lexicon. Otherwise, the tagger will not be able to learn how to tag
+  numbers which are not listed in the lexicon. Numbers with unusual
+  tags should be added to the lexicon, however. If the training
+  program reports an error because the POS tag used for numbers is
+  unknown, you should add a lexicon entry for one number.
+
+  Remark: The tagger doesn't need the lemmata for tagging actually. If
+  you do not have the lemma information or if you do not plan to
+  annotate corpora with lemmas, you can replace the lemma with a dummy
+  value, e.g. "-".
+
+* <open class file>: name of a file which contains a list of open class tags
+  i.e. possible tags of unknown word forms separated by whitespace.
+  The tagger will use this information when it encounters unknown words,
+  i.e. words which are not contained in the lexicon.
+  Example: (for Penn Treebank tagset)
+
+FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ
+
+* <input file>: name of a file which contains tagged training data. The data
+  must be in one-word-per-line format. This means that each line contains 
+  one token and one tag in that order separated by a tabulator. 
+  Punctuation marks are considered as tokens and must be tagged as well.
+  The file should neither contain empty lines nor untagged SGML markup.
+  Example:
+
+Pierre  NP
+Vinken  NP
+,       ,
+61      CD
+years   NNS
+
+* <output file>: name of the file in which the resulting tagger parameters 
+  are stored.
+
+The following parameters are optional. Read the papers on the TreeTagger to 
+fully understand their meaning.
+
+* -st <sent. tag>: the end-of-sentence part-of-speech tag, i.e. the tag which
+  is assigned to sentence punctuation like ".", "!", "?". 
+  Default is "SENT". It is important to set this option properly, if your
+  tag for sentence punctuation is not "SENT".
+* -cl <context length>: number of preceding words forming the statistical
+  context. The default is 2 which corresponds to a trigram context. For
+  small training corpora and/or large tagsets, it could be useful to reduce
+  this parameter to 1.
+* -dtg <min. decision tree gain>: Threshold - If the information gain at a 
+  leaf node of the decision tree is below this threshold, the node is deleted.
+* -sw <weight>: A smoothing parameter, which determines how much the
+  probability distribution of some decision tree node is smoothed with the
+  probability distribution of the parent node.
+* -ecw <eq. class weight>: weight of the equivalence class based probability
+  estimates.
+* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
+  affix tree is below this threshold, it is deleted. The default is 1.2.
+
+The accuracy of the TreeTagger usually improves, if different settings
+of the above parameters are tested and the best combination is chosen.
+
+
+Caveat: Make sure that the lexicon and the training corpus contain no
+extra blanks. If the word form, for instance, is followed by a blank
+and a tab character, the blank will be considered part of the word.
+