Skip to content

Commit

Permalink
Editing READMEs
Browse files Browse the repository at this point in the history
  • Loading branch information
CodingTil committed Oct 24, 2023
1 parent 9110d6b commit 143d6fc
Show file tree
Hide file tree
Showing 2 changed files with 112 additions and 7 deletions.
114 changes: 107 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,124 @@ pip install -e .
```

## Usage
### CLI
If installed as a python package, the following command is available:
If installed locally, henceforth the command `py_css` is available. Otherwise, the following entrypoint shall be called:
```bash
py_css cli
python py_css/main.py
# OR, if installed locally:
py_css
```

Otherwise, the equivalent can be achieved by navigating into the repository and running the following:
A detailed help page will be presented using:
```bash
python py_css/main.py cli
py_css --help
```

### CLI Mode
If installed as a python package, the following command is available:
```bash
py_css cli
```

### Run Queries File
```bash
python py_css/main.py run_file --log=INFO --queries=data/queries_train.csv --output=output/train.txt
py_css run_file --log=INFO --queries=data/queries_train.csv --output=output/train.txt
```

### Run Queries and Evaluate Performance
```bash
python py_css/main.py eval --log=INFO --queries=data/queries_train.csv --qrels=data/qrels_train.txt
py_css eval --log=INFO --queries=data/queries_train.csv --qrels=data/qrels_train.txt
```

### Create Kaggle Runfile Format
```bash
py_css kaggle --log=INFO --queries=data/queries_test.csv --output=output/kaggle-prf.csv
```


## Retrieval Pipelines
As outlined in the paper, four retrieval pipelines were implemented:

### Baseline
Can be selected by specifying the following parameters:
```bash
--method=baseline
--baseline-params=1000,1000,50
```

#### Indexing
For indexing, the document collection has to be placed into the `data/` folder.
<br>
[Further Instructions](data/README.md)

#### Parameters
| Position | ID | Description | Constraints |
| --- | --- | --- | --- |
| 0 | `bm25_docs` | The number of documents to be retrieved using `BM25`. | |
| 1 | `mono_t5_docs` | The number of documents to be reranked by `monoT5` after retrieval. | `bm25_docs >= mono_t5_docs` |
| 2 | `duo_t5_docs` | The number of documents to be reranked by `duoT5` after `monoT5` reranking. | `mono_t5_docs <= duo_t5_docs` |

### Baseline + `RM3`

Can be selected by specifying the following parameters:
```bash
--method=baseline-prf
--baseline-prf-params=1000,17,26,1000,50
```

#### Indexing
For indexing, the document collection has to be placed into the `data/` folder.
<br>
[Further Instructions](data/README.md)

#### Parameters
| Position | ID | Description | Constraints |
| --- | --- | --- | --- |
| 0 | `bm25_docs` | The number of documents to be retrieved using `BM25`. | |
| 1 | `rm3_fb_docs` | The number of documents to be used for `RM3` query expansion. | |
| 2 | `rm3_fb_terms` | The number of terms to expand the query with using `RM3`. | |
| 3 | `mono_t5_docs` | The number of documents to be reranked by `monoT5` after retrieval. | `bm25_docs >= mono_t5_docs` |
| 4 | `duo_t5_docs` | The number of documents to be reranked by `duoT5` after `monoT5` reranking. | `mono_t5_docs <= duo_t5_docs` |


### `doc2query`
Can be selected by specifying the following parameters:
```bash
--method=doc2query
--doc2query-params=1000,1000,50
```

#### Indexing
For indexing, the document collection has to be placed into the `data/` folder.
Additionally, descriptive queries for each document have to be generated using [this script](scripts/doc2query-t5.py).
<br>
[Further Instructions](data/README.md)

#### Parameters
| Position | ID | Description | Constraints |
| --- | --- | --- | --- |
| 0 | `bm25_docs` | The number of documents to be retrieved using `BM25`. | |
| 1 | `mono_t5_docs` | The number of documents to be reranked by `monoT5` after retrieval. | `bm25_docs >= mono_t5_docs` |
| 2 | `duo_t5_docs` | The number of documents to be reranked by `duoT5` after `monoT5` reranking. | `mono_t5_docs <= duo_t5_docs` |

### `doc2query` + `RM3`

Can be selected by specifying the following parameters:
```bash
--method=doc2query-prf
--doc2query-prf-params=1000,17,26,1000,50
```

#### Indexing
For indexing, the document collection has to be placed into the `data/` folder.
Additionally, descriptive queries for each document have to be generated using [this script](scripts/doc2query-t5.py).
<br>
[Further Instructions](data/README.md)

#### Parameters
| Position | ID | Description | Constraints |
| --- | --- | --- | --- |
| 0 | `bm25_docs` | The number of documents to be retrieved using `BM25`. | |
| 1 | `rm3_fb_docs` | The number of documents to be used for `RM3` query expansion. | |
| 2 | `rm3_fb_terms` | The number of terms to expand the query with using `RM3`. | |
| 3 | `mono_t5_docs` | The number of documents to be reranked by `monoT5` after retrieval. | `bm25_docs >= mono_t5_docs` |
| 4 | `duo_t5_docs` | The number of documents to be reranked by `duoT5` after `monoT5` reranking. | `mono_t5_docs <= duo_t5_docs` |
5 changes: 5 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# Data

The document collection is MS MARCO Passages and has to be stored in `collection.tsv`.
Furthermore, for the `doc2query` based approaches, descriptive queries for each document in the collection must be stored in `doc2query.tsv`.
This file can be automatically generated using [this script](scripts/doc2query-t5.py). :warning: May take several days.

A MS MARCO document collection has been provided [here](https://gustav1.ux.uis.no/dat640/msmarco-passage.tar.gz).
A pre-generated `doc2query.tsv` file has been made available [here](https://drive.google.com/file/d/1vGGGu0eprxG_iUm9Z5xkbsKEwjJoAf_A/view?usp=drive_link).

0 comments on commit 143d6fc

Please sign in to comment.