A python project implementing sematic focused search on podcast documents from Spotify (TREC collection).
to run this project with a demonstration index please follow these steps:
-
Clone this repository
-
Set up the python virtual environment using the IR_venv.yml (windows users please see the note at the bottom of this readme).
-
Launch the virtual environment
(optional - to re-create the BM25 index using the files found in Sampled_docs)
-
cd to the location /Spotify_Information_Retrieval/src/indexing/
-
run python BM25_create_demo.py
-
cd to the location /Spotify_Information_Retrieval/src/
(to launch the spotify transcript search engine graphical user interface)
- run python main.py
(to run the evaluation scripts (with options for 4 types of search strategies))
- run python evaluation.py
(to run the unit testing:
- First edit the 'evaluation.py' file to comment out the final 2 lines:
Don't forget to revert after running the unit testing.
- Replace the documents in /Spotify_Information_Retrieval/Sampled_docs with the files ts1.json and ts2.json that are in /Spotify_Information_Retrieval/Testing/Sampled_docs_testing.
- Move the 'testing_index.pkl' and 'unittest.metadata.csv files into /Spotify_Information_Retrieval/Files/Local_pickles)
- run unit_testing.py
NOTE FOR WINDOWS USERS If the .yml files do not work for creating a virtual environment, the following packages should be installed via conda, or installer of choice:
- pandas 1.5.2
- matplotlib 3.7.1
- nltk 3.7
- scikit-learn 1.2.0
- feedparser
- pysimplegui 4.60.4