- Meet the data science cookiecutter requirements, in brief:
- Install:
direnv
andconda
- Install:
- Run
make install
to configure the development environment:- Setup the conda environment
- Configure
pre-commit
- Make sure you have a
.env
file with the following keys:
OPENAI_API_KEY = 'YOUR-KEY-HERE'
- Key hyperparameters are stored in
dsp_ai_eval/config/base.yaml
. This file contains: the research question for the project; hyperparameters for the topic modelling and GPT prompting; plus paths to relevant files in the S3 bucket. dsp_ai_eval/getters/
contains functions for obtaining different raw and processed datasets and artefacts. This is a Nesta DS convention.- There are three pipelines in
dsp_ai_eval/pipeline/
:generate_themes_with_gpt/
: pipeline for obtaining repeated GPT answers to the research question.process_abstracts/
: pipeline for performing text clustering on research abstractsprocess_gpt_summaries/
: pipeline for performing text clustering on the summaries obtained with thegenerate_themes_with_gpt/
pipeline
At the moment the workflow is not fully reproducible - that work is forthcoming!
To update the pipeline to work with new research abstracts:
- Download your own research abstracts and upload to the s3 bucket
- Update relevant paths to your data in
dsp_ai_eval/config/base.yaml
- Update the getters in
dsp_ai_eval/getters/scite.py
(you may have more or fewer data files to be concatenated than in the previous iteration of this project) - Update the data deduplication steps in
dsp_ai_eval/pipeline/process_abstracts/embed_scite_abstracts.py
.
If you wish to repeat the experiment prompting GPT repeatedly for answers to a research question:
- Update the RQ in
dsp_ai_eval/config/base.yaml
- Update the filepaths under
gpt_themes_pipeline
indsp_ai_eval/config/base.yaml
as desired - You should now be able to run
dsp_ai_eval/pipeline/generate_themes_with_gpt/ask_gpt+for_themes.py
Technical and working style guidelines
Project based on Nesta's data science project template (Read the docs here).