Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Either just add attested English usages for c, d, and `i, or import corpus examples with those annotations #142

Open
nschneid opened this issue Aug 11, 2018 · 3 comments

Comments

@nschneid
Copy link
Contributor

Importing the corpus examples would require heuristically figuring them out from STREUSLE lexcat + POS information (because STREUSLE doesn't preserve their original use by annotators).

@nschneid nschneid changed the title Either just add attested usages for c, d, and `i, or import corpus examples with those annotations Either just add attested English usages for c, d, and `i, or import corpus examples with those annotations Aug 11, 2018
@nschneid
Copy link
Contributor Author

(It would be easy to add `i because these correspond to lexcat INF. Is it worth having them or would they get in the way of browsing actual supersense annotations for TO and FOR?)

@nschneid
Copy link
Contributor Author

`c

A tquery.py search in STREUSLE shows that `c is easy to recover: simply look for lexcat=CC with an ADP in the lexical expression:

  • Single-word ADP
./tquery.py streusle.json lc=CC "ll!=[ ]" upos=ADP
reviews-159371-0002	me and my dad where in NJ for a kc chiefs ( my home team ) >> vs << the NY jets and for game 4 of the world series .
reviews-303922-0004	They are honest about ' immediate ' concerns >> versus << ' recommended ' repairs and have very fair prices .
  • Multiword ADP
./tquery.py streusle.json lc=CC "ll=[ ]" upos=ADP
reviews-047184-0005	We have attended A Ward Dance Centre for over a year and really enjoy the friendly and welcoming way we are taught Ballroom and Latin >> as well as << the fun filled social dance evening held every Saturday evening ....
reviews-235423-0012	I do nt know how it is possible to make orange chicken , sesame chicken and kung pao chicken >> as well as << cheese puffs taste THAT bad but China Delight accomplished that .
reviews-310018-0002	She has a great way with children and is able to capture their personality >> as well as << many spontaneous images .
reviews-335815-0002	There are a couple decent people working there , but the rest are VERY dishonest , >> as well as << rude , I have yet to hear the truth come out of their mouths .
reviews-376320-0008	I completely enjoyed my whole check in experience and was impressed with the friendliness and professionalism of the staff >> as well as << the accommodations themselves .
  • Multiword non-ADP: these should not be imported
reviews-079375-0009	I live nearly two hours away >> and yet << I will still make the drive to see them !
reviews-376320-0004	This can tend to be a stressful experience in itself >> let alone << adding crossing boarders for the first time .

`d

For lexcat=DISC we have:

  • Single-word ADP, ADV, or SCONJ: we need a lexical whitelist of ADVs considered prepositions to exclude "next" and "too"
./tquery.py streusle.json lc=DISC "ll!=[ ]" "+upos=(ADP|ADV|SCONJ)"
reviews-004574-0006	('ADP',)	Secondly , the enchladas did not come with enchilada sauce .. but chili ... >> like << Hormel 's chili .. the cheese was American cheese !
reviews-004940-0004	('ADP',)	>> On << one order , we needed it rushed and shipped to a different state .
reviews-053248-0003	('ADP',)	The truth is , >> in << my and my dining partners ' experience , this is a fine little restaurant with some unique food .
reviews-109263-0003	('ADP',)	I had sonic in many other palces but >> for << some reason this sonic is always just covered in grease and not good ... :(
reviews-117115-0006	('ADV',)	>> Next << , they tried to force the window with a pry bar and then to break the window with a hammer .
reviews-185045-0001	('ADV',)	Rcommended by bees , >> too << !
reviews-228731-0001	('ADV',)	prepared the road test with a driving ... prepared the road test with a driving school in edmonton , but my instructor only trained me in a narrow street , >> hence << I took one 90 minute lesson from the Noble driving school to learn the skill of changing lane , and found them very friendly and professional .
reviews-268673-0005	('ADV',)	>> Besides << parking is a pain .. cramped and un-ruly with Kumon Parents next door .... gives me heebee gee bees'
reviews-294081-0006	('ADV',)	>> NEXT << , THERE SHOULD ONLY BE ONE PERSON BRINGING YOU YOUR FOOD .
reviews-333243-0014	('ADP',)	>> In << the words of my new accountant , THEY LET ME DOWN !
reviews-351561-0018	('ADP',)	17th , >> like << over by 16th and 15th YES , I say , one mile west of you .
reviews-351561-0021	('ADP',)	She says , Is that 17th >> like << over past broad .
reviews-366946-0004	('ADV',)	Filled up on too much beer and >> hence << can not comment on the food .
reviews-385281-0002	('SCONJ',)	I 'm 22 , and my hairdresser was great ( and not " old " >> like << one of the reviews says ) - she really listened to what I wanted and gave me tons of tips on how to style my hair so I could get it to look the way I wanted it to .
reviews-388799-0006	('ADP',)	I will sum it up >> with << , it was worth every penny !
  • Other single-word DISC: these are mostly NOUN ("thanks", "congratulations", "PS")

  • Multiword starting with ADP, ADV, or SCONJ: again, we need a whitelist to filter the ADVs to remove "short of that", "more often than not", "never mind", .... "Not to mention", for which "not" is (erroneously?) tagged as ADV, is borderline: it's an infinitive verb with negation.

./tquery.py streusle.json lc=DISC "ll=[ ]" "+upos=(ADP|ADV|SCONJ)" | egrep "\('(ADP|ADV|SCONJ)', " | wc -l
      72
  • Multiword starting with PART: this includes infinitive "to" ("to boot", "to be clear") and negation ("not a _ to be found"). We should filter out the negations.
./tquery.py streusle.json lc=DISC "ll=[ ]" +upos | egrep "\('PART', " | wc -l
       7
  • Multiword not starting with ADP, ADV, SCONJ, or PART: 25 are "thank you"; another 23 are mostly NPs or VPs, and 4 of these are borderline candidates for adpositions:
reviews-196219-0002     ('ADJ', 'ADP')  IT IS NOT A HIGH END STEAK HOUSE , >> MORE OF << THE CUISINE BRETT ENJOYS IN MISSISSIPPI .
reviews-290594-0008     ('VERB', 'ADP') It sounds ( >> according to << your own statement ) that they had a roomful of dogs , so they must be doing something right - and are keeping those dogs safe from potential problems .
reviews-291046-0004     ('VERB', 'ADP') >> According to << news accounts , the company is struggling .
reviews-364454-0005     ('VERB', 'ADP') >> Based on << the test results , the home inspector stated that the quality of their job was “ excellent ” .

@nschneid
Copy link
Contributor Author

Summary of the above: we could consider importing

  • all lexcat=INF tokens as `i,
  • lexcat=CC tokens as `c provided that the first word is tagged as ADP,
  • lexcat=DISC tokens as `d provided that the first word is one of the following:
    • ADP
    • SCONJ
    • ADV, and lemma found in a whitelist of prepositions
    • PART, and lemma is "to"

@nschneid nschneid changed the title Either just add attested English usages for c, d, and `i, or import corpus examples with those annotations Either just add attested English usages for \c, \d, and \`i, or import corpus examples with those annotations Jun 8, 2021
@nschneid nschneid changed the title Either just add attested English usages for \c, \d, and \`i, or import corpus examples with those annotations Either just add attested English usages for `c, d, and `i, or import corpus examples with those annotations Jun 8, 2021
@nschneid nschneid changed the title Either just add attested English usages for `c, d, and `i, or import corpus examples with those annotations Either just add attested English usages for ` c, d, and i ``, or import corpus examples with those annotations Jun 8, 2021
@nschneid nschneid changed the title Either just add attested English usages for ` c, d, and i ``, or import corpus examples with those annotations Either just add attested English usages for c, d, and `i, or import corpus examples with those annotations Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant