Add report.pdf

sentient-bread · Jul 29, 2021 · 9552dd4 · 9552dd4
1 parent b49a1a4
commit 9552dd4
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 17 deletions.
diff --git a/report.md b/report.md
@@ -1,10 +1,12 @@
 ---
 title: Question Generation from Hindi Stories
 subtitle: Computational Linguistics – 1 | Project 2
+header-includes:
+    - \newfontfamily\devanagarifont{Kohinoor Devanagari}
 author:
-- Abhinav S Menon
-- Pratyaksh Gautam
-- Shashwat Singh
+- Abhinav S Menon (2020114001)
+- Pratyaksh Gautam (2020114002)
+- Shashwat Singh (2020114016)
 ---
 
 # Project 2 [कष्टाध्यायी]{lang=hi}
@@ -26,6 +28,17 @@ question words.
 
 This list of questions is then printed to an output file.
 
+### Algorithm
+We have followed the paper's algorithm in most cases; small extensions have been made in the cases of
+
+* Gender agreement for `k1s`: When `k1s` is an adjective, we try to guess its gender from the last matra and alter the gender of the question word [कैसा]{lang=hi} accordingly.
+* Proper noun identification for `k5`: When the source is a place with a name (identified using the PoS tag `NNP`, assuming all and only place names are tagged as proper nouns), [कहाँ से]{lang=hi} is used instead of [किससे]{lang=hi}.
+* Gender agreement for `r6`: Gender of head noun is identified using the suffix on the possessive, and accordingly changing the gender of the question word [किसका]{lang=hi}.
+* `pof` relations: Substituted by [क्या]{lang=hi} in all cases.
+* Adjectives, quantifiers and demonstratives for `nmod__adj`: Each case is dealt with separately; replacements are [कैसा]{lang=hi}, [कितना]{lang=hi} and [कौनसा]{lang=hi} respectively. Gender and number are identified using the same strategy as that for `k1s`.
+* Coordinate clauses: When a sentence consists of two coordinate clauses (joined by [और]{lang=hi} or [तो]{lang=hi}), questions are generated from each clause separately.
+* Subordinate clauses: Relative clauses are deleted before generating the questions.
+
 ### Important Functions
 
 - `tree.py`
@@ -39,24 +52,13 @@ This list of questions is then printed to an output file.
   - `sentence_tokenize()`: Simply tokenize using regex.
 
 ## Analysis and Observations
-We have followed the paper's algorithm in most cases; small extensions have been made in the cases of
-
-* gender agreement for `k1s`
-* proper noun identification for `k5`
-* gender agreement for `r6`
-* `pof` relations
-* adjectives, quantifiers and demonstratives for `nmod__adj`
-* coordinate and subordinate clauses
-
-
-
-### Limitations of Parser
 Some discrepancies in the output of the parser lead to correspondingly ill-formed or meaningless questions:
 
-* Sentences were parsed as having multiple roots.
-* Chunking was incorrect in some phrases.
+* Sentences were parsed as having multiple roots, especially long and complicated ones.
+* Chunking was incorrect for some phrases.
 * Quotes were unreliably parsed.
 * Relative clauses were not parsed as dependent on the constituent they modify; rather, they are dependent on the verb of the main clause.
+* Rhetorical questions occur without a question mark and are therefore not eliminated from the corpus during the tokenisation process (unlike ordinary questions).
 
 ## Possible Future Versions
 Some improvements that can be made to this project in the future are:

diff --git a/report.pdf b/report.pdf