MWE numbering within sentence is inconsistent #42

nschneid · 2019-06-21T02:54:59Z

In some sentences all strong MWEs are numbered before weak ones; in others the numbering is by token offset.

This does not matter for the semantics, but it means that equivalent files will be superficially different. So perhaps we should enforce a normal form for numbering MWEs.

In the script for #41:

streusle/UDlextag2json.py

Lines 53 to 62 in 09014b4

    
           # Note that numbering of strong+weak MWEs doesn't follow a consistent order in the data! 
        
           # Ordering by first token offset (tiebreaker to strong MWE): 
        
           #xgroups = [(min(sg),'s',sg) for sg in sgroups] + [(min(wg),'w',wg) for wg in wgroups] 
        
           # Putting all strong expressions before any weak expressions: 
        
           xgroups = [(None,'s',sg) for sg in sgroups] +    [(None,'w',wg) for wg in wgroups] 
        
           # This means that the MWE columns are not *completely* determined by 
        
           # the lextag in a way that matches the original data, but different MWE 
        
           # orders does not matter semantically. 
        
           # See also check in _postproc_sent(), which ensures that the MWE numbers 
        
           # count from 1, but does not mandate an order.

streusle/UDlextag2json.py

Lines 124 to 129 in 09014b4

    
           # check that MWEs are numbered from 1 
        
           # fix_mwe_numbering.py was written to correct this 
        
           # However, this does NOT require a particular sort order of the MWEs in the sentence. 
        
           # It just requires that they have unique numbers 1, ..., N if there are N MWEs. 
        
           for i,(k,mwe) in enumerate(sorted(chain(sent['smwes'].items(), sent['wmwes'].items()), key=lambda x: int(x[0])), 1): 
        
               assert int(k)==i,(sent['sent_id'],i,k,mwe)

nschneid · 2019-06-22T15:13:16Z

For a normal form, it probably makes the most sense to number MWEs in ascending order by start token, using strength only as a tiebreaker (strong before weak—note that weak will be a superset of strong tokens). That way if the strength of an MWE in isolation is modified it won't require renumbering. And if the strength distinction is removed, it will mean collapsing some strong+weak combinations, but not reordering MWEs.

…#42)

nschneid · 2019-06-22T20:51:24Z

Numbering is renormalized in streusle.conllulex (not yet propagated to splits)

nschneid · 2019-06-23T21:00:30Z

Fully fixed in #47

nschneid mentioned this issue Jun 21, 2019

Evaluation script that unpacks lextag into remaining STREUSLE columns #41

Closed

nschneid added a commit that referenced this issue Jun 22, 2019

UDlextag2json.py: Go with token-based MWE ordering strategy per #42

048f40b

nschneid added a commit that referenced this issue Jun 22, 2019

normalize_mwe_numbering.py, which renumbers 626 MWEs in 241 sentences (…

6614b9d

…#42)

nschneid added a commit that referenced this issue Jun 22, 2019

Docs and validation to enforce normalized MWE numbering (#42)

0a3853e

nschneid closed this as completed Jun 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MWE numbering within sentence is inconsistent #42

MWE numbering within sentence is inconsistent #42

nschneid commented Jun 21, 2019

nschneid commented Jun 22, 2019

nschneid commented Jun 22, 2019

nschneid commented Jun 23, 2019

MWE numbering within sentence is inconsistent #42

MWE numbering within sentence is inconsistent #42

Comments

nschneid commented Jun 21, 2019

nschneid commented Jun 22, 2019

nschneid commented Jun 22, 2019

nschneid commented Jun 23, 2019