Editing READMEs

CodingTil · Oct 24, 2023 · 143d6fc · 143d6fc
1 parent 9110d6b
commit 143d6fc
Show file tree

Hide file tree

Showing 2 changed files with 112 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -9,24 +9,124 @@ pip install -e .
 ```
 
 ## Usage
-### CLI
-If installed as a python package, the following command is available:
+If installed locally, henceforth the command `py_css` is available. Otherwise, the following entrypoint shall be called:
 ```bash
-py_css cli
+python py_css/main.py
+# OR, if installed locally:
+py_css
 ```
 
-Otherwise, the equivalent can be achieved by navigating into the repository and running the following:
+A detailed help page will be presented using:
 ```bash
-python py_css/main.py cli
+py_css --help
+```
+
+### CLI Mode
+If installed as a python package, the following command is available:
+```bash
+py_css cli
 ```
 
 ### Run Queries File
 ```bash
-python py_css/main.py run_file --log=INFO --queries=data/queries_train.csv --output=output/train.txt
+py_css run_file --log=INFO --queries=data/queries_train.csv --output=output/train.txt
 ```
 
 ### Run Queries and Evaluate Performance
 ```bash
-python py_css/main.py eval --log=INFO --queries=data/queries_train.csv --qrels=data/qrels_train.txt
+py_css eval --log=INFO --queries=data/queries_train.csv --qrels=data/qrels_train.txt
+```
+
+### Create Kaggle Runfile Format
+```bash
+py_css kaggle --log=INFO --queries=data/queries_test.csv --output=output/kaggle-prf.csv
+```
+
+
+## Retrieval Pipelines
+As outlined in the paper, four retrieval pipelines were implemented:
+
+### Baseline
+Can be selected by specifying the following parameters:
+```bash
+--method=baseline
+--baseline-params=1000,1000,50
+```
+
+#### Indexing
+For indexing, the document collection has to be placed into the `data/` folder.
+<br>
+[Further Instructions](data/README.md)
+
+#### Parameters
+| Position | ID | Description | Constraints |
+| --- | --- | --- | --- |
+| 0 | `bm25_docs` | The number of documents to be retrieved using `BM25`. | |
+| 1 | `mono_t5_docs` | The number of documents to be reranked by `monoT5` after retrieval. | `bm25_docs >= mono_t5_docs` |
+| 2 | `duo_t5_docs` | The number of documents to be reranked by `duoT5` after `monoT5` reranking. | `mono_t5_docs <= duo_t5_docs` |
+
+### Baseline + `RM3`
+
+Can be selected by specifying the following parameters:
+```bash
+--method=baseline-prf
+--baseline-prf-params=1000,17,26,1000,50
 ```
 
+#### Indexing
+For indexing, the document collection has to be placed into the `data/` folder.
+<br>
+[Further Instructions](data/README.md)
+
+#### Parameters
+| Position | ID | Description | Constraints |
+| --- | --- | --- | --- |
+| 0 | `bm25_docs` | The number of documents to be retrieved using `BM25`. | |
+| 1 | `rm3_fb_docs` | The number of documents to be used for `RM3` query expansion. | |
+| 2 | `rm3_fb_terms` | The number of terms to expand the query with using `RM3`. | |
+| 3 | `mono_t5_docs` | The number of documents to be reranked by `monoT5` after retrieval. | `bm25_docs >= mono_t5_docs` |
+| 4 | `duo_t5_docs` | The number of documents to be reranked by `duoT5` after `monoT5` reranking. | `mono_t5_docs <= duo_t5_docs` |
+
+
+### `doc2query`
+Can be selected by specifying the following parameters:
+```bash
+--method=doc2query
+--doc2query-params=1000,1000,50
+```
+
+#### Indexing
+For indexing, the document collection has to be placed into the `data/` folder.
+Additionally, descriptive queries for each document have to be generated using [this script](scripts/doc2query-t5.py).
+<br>
+[Further Instructions](data/README.md)
+
+#### Parameters
+| Position | ID | Description | Constraints |
+| --- | --- | --- | --- |
+| 0 | `bm25_docs` | The number of documents to be retrieved using `BM25`. | |
+| 1 | `mono_t5_docs` | The number of documents to be reranked by `monoT5` after retrieval. | `bm25_docs >= mono_t5_docs` |
+| 2 | `duo_t5_docs` | The number of documents to be reranked by `duoT5` after `monoT5` reranking. | `mono_t5_docs <= duo_t5_docs` |
+
+### `doc2query` + `RM3`
+
+Can be selected by specifying the following parameters:
+```bash
+--method=doc2query-prf
+--doc2query-prf-params=1000,17,26,1000,50
+```
+
+#### Indexing
+For indexing, the document collection has to be placed into the `data/` folder.
+Additionally, descriptive queries for each document have to be generated using [this script](scripts/doc2query-t5.py).
+<br>
+[Further Instructions](data/README.md)
+
+#### Parameters
+| Position | ID | Description | Constraints |
+| --- | --- | --- | --- |
+| 0 | `bm25_docs` | The number of documents to be retrieved using `BM25`. | |
+| 1 | `rm3_fb_docs` | The number of documents to be used for `RM3` query expansion. | |
+| 2 | `rm3_fb_terms` | The number of terms to expand the query with using `RM3`. | |
+| 3 | `mono_t5_docs` | The number of documents to be reranked by `monoT5` after retrieval. | `bm25_docs >= mono_t5_docs` |
+| 4 | `duo_t5_docs` | The number of documents to be reranked by `duoT5` after `monoT5` reranking. | `mono_t5_docs <= duo_t5_docs` |
diff --git a/data/README.md b/data/README.md
@@ -1,4 +1,9 @@
 # Data
 
 The document collection is MS MARCO Passages and has to be stored in `collection.tsv`.
+Furthermore, for the `doc2query` based approaches, descriptive queries for each document in the collection must be stored in `doc2query.tsv`.
+This file can be automatically generated using [this script](scripts/doc2query-t5.py). :warning: May take several days.
+
+A MS MARCO document collection has been provided [here](https://gustav1.ux.uis.no/dat640/msmarco-passage.tar.gz).
+A pre-generated `doc2query.tsv` file has been made available [here](https://drive.google.com/file/d/1vGGGu0eprxG_iUm9Z5xkbsKEwjJoAf_A/view?usp=drive_link).