-
Notifications
You must be signed in to change notification settings - Fork 56
Joshua Q&A
- Two glue grammars are created for each non-terminal: S --> (X, X) and S --> (S X, S X)
- More informaiton at [email protected]:jweese/thrax.wiki.git
-
In the order that is according to left hand of the rules:
- constituent, e.g. NN, JJ
- concat, e.g. A + B (double concat A+B+C is too costly in decoding)
- CCG, e.g. A/B or A\B
- X (given different circumstances, see Jonny's github check-in notes, reverse engineered from SAMT)
-
In the decoding process, these grammars are treated equally and only differ by the features, e.g. ContainX= 0 or 1. And once X is involved, all its parent nodes will only possibly be X.
-
MERT is good for < 15 features; PRO is faster for a lot of features.
-
More about PRO, refer to the paper "Tuning as Ranking" written by Mark Hopkins and Jonathan May
-
default 300 is generally large enough to have some reasonable outputs
-
have to provide tuning data in order to make a reasonable model
-
Component Features tm_glue_0 is an indicator for the glue grammar (which contains rules like S -> S NP, and allows the decoder to just concatenate subtrees). Joshua generates prefixes for feature names: "tm_" + grammar name + "_" + feature name/index. So, tm_glue_0 refers to the first feature in the glue grammar (which is always 1 by default). This allows the decoder to learn a penalty (or preference) for using glue rules over "proper" SCFG derivations.
-
WordPenalty see section about "Identity Rules"
-
Rule-level Features
-
OWNER is the "owner" of the rules in the grammar; this is used todetermine which set of phrasal features apply to the grammar's rules. Having different owners allows different features to be applied to different grammars, and for grammars to share features across files.
-
Explained more in details in the joshua.config file.
"Basically, I generated a list of adjectives etc. from the grammar I was using, and added a new rule for each that allowed for the deletion. Generally, you can take the lexical part of PPDB XL (which is what you are using, I think), grep for [RB] and [JJ] and create a new grammar with deletion rules and some deletion indicator feature. Joshua can handle multiple grammars (via the "tm" line in the config file)." - Juri
- handled in Joshua decoder, it implicitly creates identity rules for OOV or every word like OOV --> (tok, tok). There is a flag called true_oovs_only in the pipeline (defined in ./src/joshua/decoder/chart_parser/Chart.java), but with empty rule features.
See src/joshua/decoder/chart_parser/Chart.java and src/joshua/decoder/ff/OOVPenaltyFF.java for details.
"In the chart, we add an OOV rule for every token in the input sentence. The rule is initialized with "" as the sparse feature string." - Jonny
"There is a feature that is triggered each time one is used, called OOVPenalty, so that is tunable. The OOVPenaltyFF feature (when activated with a line feature_function = OOVPenalty) will look at every rule used, and, if the rule's owner is "oov", will fire the feature. This is sort of a hack. As you suggest, and to be consistent, a better way (which is not currently used) to do it would be to have a feature named tm_oov_0 and a single value of "-1" in the sparse_features value when we initialize these OOV rules (which are already placed in their own grammar, created on-the-fly for each sentence, owned by "oov")." - Matt
"WordPenalty is a feature that assigns a cost to each word generated. OOVPenalty assigns a cost to OOVRules, which are identity rules that the decoder generates for every input word in case it's not part of its known vocabulary. " - Juri
- monolingual decoding problem (?)
input: "eleven"
the problem: identity output "eleven" get too small score because it has no rule-level features
0 ||| 11 ||| (ROOT ([GOAL] <s> ([CD] 11) </s>)) ||| AGigaSim=0.545 CharCountDiff=-4.000
CharLogCR=-1.099 Lex(e|f)=60.049 Lex(f|e)=60.049 Lexical=1.000 LogCount=3.932 Monotonic=1.000
PhrasePenalty=1.000 SourceWords=1.000 TargetWords=1.000 WordLenDiff=-4.000 WordPenalty=-1.303
lm_0=-6.483 p(LHS|e)=0.059 p(LHS|f)=1.241 p(e|LHS)=5.356 p(e|f)=1.093 p(e|f,LHS)=0.417
p(f|LHS)=10.000 p(f|e)=4.555 p(f|e,LHS)=5.061 ||| -0.049
0 ||| elf ||| (ROOT ([GOAL] <s> ([NN] elf) </s>)) ||| AGigaSim=0.207 CharCountDiff=-3.000
CharLogCR=-0.693 GoogleNgramSim=0.067 Lex(e|f)=62.901 Lex(f|e)=62.901 Lexical=1.000 Monotonic=1.000
PhrasePenalty=1.000 RarityPenalty=0.368 SourceWords=1.000 TargetWords=1.000 WordLenDiff=-3.000
WordPenalty=-1.303 lm_0=-8.796 p(LHS|e)=1.563 p(LHS|f)=1.851 p(e|LHS)=15.246 p(e|f)=5.043 p(e|f,LHS)=6.750
p(f|LHS)=12.350 p(f|e)=1.859 p(f|e,LHS)=3.854 ||| -2.276
0 ||| eleven ||| (ROOT ([GOAL] <s> ([X] eleven) </s>)) ||| OOVPenalty=1.000 WordPenalty=-1.303 lm_0=-7.066 ||| -7.185
-
Default in the pipeline pipeline.pl search 'rerank' will show two options, but it only re-ranks using rule-level features plus LM features etc.
-
How to add sentence-level features for re-ranking?