Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MWE numbering within sentence is inconsistent #42

Closed
nschneid opened this issue Jun 21, 2019 · 3 comments
Closed

MWE numbering within sentence is inconsistent #42

nschneid opened this issue Jun 21, 2019 · 3 comments

Comments

@nschneid
Copy link
Contributor

In some sentences all strong MWEs are numbered before weak ones; in others the numbering is by token offset.

This does not matter for the semantics, but it means that equivalent files will be superficially different. So perhaps we should enforce a normal form for numbering MWEs.

In the script for #41:

# Note that numbering of strong+weak MWEs doesn't follow a consistent order in the data!
# Ordering by first token offset (tiebreaker to strong MWE):
#xgroups = [(min(sg),'s',sg) for sg in sgroups] + [(min(wg),'w',wg) for wg in wgroups]
# Putting all strong expressions before any weak expressions:
xgroups = [(None,'s',sg) for sg in sgroups] + [(None,'w',wg) for wg in wgroups]
# This means that the MWE columns are not *completely* determined by
# the lextag in a way that matches the original data, but different MWE
# orders does not matter semantically.
# See also check in _postproc_sent(), which ensures that the MWE numbers
# count from 1, but does not mandate an order.

streusle/UDlextag2json.py

Lines 124 to 129 in 09014b4

# check that MWEs are numbered from 1
# fix_mwe_numbering.py was written to correct this
# However, this does NOT require a particular sort order of the MWEs in the sentence.
# It just requires that they have unique numbers 1, ..., N if there are N MWEs.
for i,(k,mwe) in enumerate(sorted(chain(sent['smwes'].items(), sent['wmwes'].items()), key=lambda x: int(x[0])), 1):
assert int(k)==i,(sent['sent_id'],i,k,mwe)

@nschneid
Copy link
Contributor Author

For a normal form, it probably makes the most sense to number MWEs in ascending order by start token, using strength only as a tiebreaker (strong before weak—note that weak will be a superset of strong tokens). That way if the strength of an MWE in isolation is modified it won't require renumbering. And if the strength distinction is removed, it will mean collapsing some strong+weak combinations, but not reordering MWEs.

@nschneid
Copy link
Contributor Author

Numbering is renormalized in streusle.conllulex (not yet propagated to splits)

@nschneid
Copy link
Contributor Author

Fully fixed in #47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant