Skip to content

Commit

Permalink
Add report.pdf
Browse files Browse the repository at this point in the history
  • Loading branch information
Abhinav271828 committed Jul 29, 2021
1 parent b49a1a4 commit 9552dd4
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 17 deletions.
36 changes: 19 additions & 17 deletions report.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
---
title: Question Generation from Hindi Stories
subtitle: Computational Linguistics – 1 | Project 2
header-includes:
- \newfontfamily\devanagarifont{Kohinoor Devanagari}
author:
- Abhinav S Menon
- Pratyaksh Gautam
- Shashwat Singh
- Abhinav S Menon (2020114001)
- Pratyaksh Gautam (2020114002)
- Shashwat Singh (2020114016)
---

# Project 2 [कष्टाध्यायी]{lang=hi}
Expand All @@ -26,6 +28,17 @@ question words.

This list of questions is then printed to an output file.

### Algorithm
We have followed the paper's algorithm in most cases; small extensions have been made in the cases of

* Gender agreement for `k1s`: When `k1s` is an adjective, we try to guess its gender from the last matra and alter the gender of the question word [कैसा]{lang=hi} accordingly.
* Proper noun identification for `k5`: When the source is a place with a name (identified using the PoS tag `NNP`, assuming all and only place names are tagged as proper nouns), [कहाँ से]{lang=hi} is used instead of [किससे]{lang=hi}.
* Gender agreement for `r6`: Gender of head noun is identified using the suffix on the possessive, and accordingly changing the gender of the question word [किसका]{lang=hi}.
* `pof` relations: Substituted by [क्या]{lang=hi} in all cases.
* Adjectives, quantifiers and demonstratives for `nmod__adj`: Each case is dealt with separately; replacements are [कैसा]{lang=hi}, [कितना]{lang=hi} and [कौनसा]{lang=hi} respectively. Gender and number are identified using the same strategy as that for `k1s`.
* Coordinate clauses: When a sentence consists of two coordinate clauses (joined by [और]{lang=hi} or [तो]{lang=hi}), questions are generated from each clause separately.
* Subordinate clauses: Relative clauses are deleted before generating the questions.

### Important Functions

- `tree.py`
Expand All @@ -39,24 +52,13 @@ This list of questions is then printed to an output file.
- `sentence_tokenize()`: Simply tokenize using regex.

## Analysis and Observations
We have followed the paper's algorithm in most cases; small extensions have been made in the cases of

* gender agreement for `k1s`
* proper noun identification for `k5`
* gender agreement for `r6`
* `pof` relations
* adjectives, quantifiers and demonstratives for `nmod__adj`
* coordinate and subordinate clauses



### Limitations of Parser
Some discrepancies in the output of the parser lead to correspondingly ill-formed or meaningless questions:

* Sentences were parsed as having multiple roots.
* Chunking was incorrect in some phrases.
* Sentences were parsed as having multiple roots, especially long and complicated ones.
* Chunking was incorrect for some phrases.
* Quotes were unreliably parsed.
* Relative clauses were not parsed as dependent on the constituent they modify; rather, they are dependent on the verb of the main clause.
* Rhetorical questions occur without a question mark and are therefore not eliminated from the corpus during the tokenisation process (unlike ordinary questions).

## Possible Future Versions
Some improvements that can be made to this project in the future are:
Expand Down
Binary file added report.pdf
Binary file not shown.

0 comments on commit 9552dd4

Please sign in to comment.