Skip to content

Commit

Permalink
Refactor clinicaltrials.gov processor
Browse files Browse the repository at this point in the history
  • Loading branch information
jim-sheldon committed May 13, 2024
1 parent 49ee45f commit 7825cf0
Show file tree
Hide file tree
Showing 17 changed files with 3,625 additions and 0 deletions.
31 changes: 31 additions & 0 deletions .github/workflows/stability_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Test G.h subsystem

on:
push:
branches: [main]
paths:
- ".github/workflows/stability_tests.yml"
- "src/*"
pull_request:
paths:
- ".github/workflows/stability_tests.yml"
- "src/*"
workflow_dispatch:

env:
PLATFORM: ${{ vars.PLATFORM }}

jobs:
tests:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v3

- name: Install Cython dependencies
run: sudo apt update && sudo apt install -y gcc build-essential python3-dev

- name: Install python dependencies
run: python3 -m pip install --user .

- name: Run tests
run: python3 -m pytest -v
25 changes: 25 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
BSD 2-Clause License

Copyright (c) 2021, Benjamin M. Gyori
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
64 changes: 64 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Trial Synth

Interpretation and integration of clinical trials data from multiple registries in a computable form


## Adding registries and processors

Getting data of each registry happens via a corresponding `Processor` object, its components, and relevant methods and functions.
Users create a `Processor` by composition; some require different work than others.

Some common components include:
- Fetcher: get raw data from a source (e.g. a REST API)
- Transformer: change the data into a desired format for downstream use
- Storer: save the data to some location(s)
- Validator: check the data for quality
- Configuration: management of processor behaviors
- CLI: entrypoint for downstream use

![Processor with common components](./composition.svg)

A user can add a registry and `Processor` by test-driven development. A pattern for doing this looks like:
- Get a sample response from the API and save it to a file
- Write the test for API stability
- This can detect unexpected API behavior (e.g. changes or outages)
- Add the file to the imposter to use as a stubbed response
- Add a new port for the stubbed service to the imposter and compose files
- Write the test for the processor's end-to-end behaviors
- The input is the sample response
- Write the components and functions to make the test pass
- Set the `url` environment variable for the `Processor` to the stub's port


## Installation

Run the following to install this repository's package in editable mode:

```
$ git clone https://github.com/gyorilab/trialsynth
$ cd trialsynth
$ pip install -e
```


## Local run

Users can run all system components on their computers via the compose stack, wrapped by a shell script:
```
./run_e2e.sh
```


## Testing

Users can run all the suite of full system tests on their computers via the compose stack, wrapped by a shell script:
```
./test_e2e.sh
```


## References

[Test-driven development](https://tidyfirst.substack.com/p/canon-tdd)
[Mountebank](https://www.mbtest.org/)

1 change: 1 addition & 0 deletions composition.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2,069 changes: 2,069 additions & 0 deletions poetry.lock

Large diffs are not rendered by default.

38 changes: 38 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
[tool.poetry]
name = "trialsynth"
version = "0.1.0-alpha"
description = "Extracts clinical trial information from sources"
authors = ["jim-sheldon <[email protected]>"]
license = "BSD-2-Clause"
readme = "README.md"

packages = [
{ include = "trial_synth", from = "src" },
]

classifiers = [
"Programming Language :: Python",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
]

[tool.poetry.dependencies]
python = ">=3.10"
gilda = "^1.1.0"
pandas = "<2" # indra does not support >=2
tqdm = "^4.66.2"
indra = {git = "https://github.com/sorgerlab/indra.git"}
requests = "^2.31.0"
click = "^8.1.7"
addict = "^2.4.0"
pydantic = "^2.7.1"
adeft = {git = "git+https://github.com/gyorilab/adeft.git"}

[tool.poetry.group.test.dependencies]
pytest = "^8.1.1"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Empty file added setup.py
Empty file.
45 changes: 45 additions & 0 deletions src/trial_synth/clinical_trials_dot_gov/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import logging

import click

from .config import Config, DATA_DIR
from .fetch import Fetcher
from .store import Storer
from .transform import Transformer
from .processor import Processor
from .util import ensure_output_directory_exists


logger = logging.getLogger(__name__)


@click.command()
def main():
click.secho("Processing clincaltrials.gov data", fg="green", bold=True)
ensure_output_directory_exists()
config = Config()
fetcher = Fetcher(
url=config.api_url,
request_parameters=config.api_parameters
)
transformer = Transformer()
storer = Storer(
node_iterator=transformer.get_nodes,
node_types=config.node_types,
data_directory=DATA_DIR
)
clinical_trials_processor = Processor(
config=config,
fetcher=fetcher,
storer=storer,
transformer=transformer
)
clinical_trials_processor.ensure_api_response_data_saved()
clinical_trials_processor.clean_and_transform_data()
clinical_trials_processor.set_nodes_and_edges()
clinical_trials_processor.validate_data()
clinical_trials_processor.save_graph_data()


if __name__ == "__main__":
main()
76 changes: 76 additions & 0 deletions src/trial_synth/clinical_trials_dot_gov/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
"""Clinicaltrials.gov processor configuration"""

from dataclasses import dataclass
import logging
import os
from pathlib import Path


LOGGING_LEVEL = os.environ.get("LOGGING_LEVEL", "INFO")

PROCESSOR_NAME = os.environ.get(
"CLINICAL_TRIALS_DOT_GOV_PROCESSOR_NAME",
"clinicaltrials"
)
API_URL = os.environ.get(
"CLINICAL_TRIALS_DOT_GOV_API_URL",
"https://clinicaltrials.gov/api/v2/studies"
)

HOME_DIR = os.environ.get("HOME_DIRECTORY", Path.home())
PARENT_DIR_STR = os.environ.get("BASE_DIRECTORY", ".data")
DATA_DIR_STR = os.environ.get("DATA_DIRECTORY", "clinicaltrials")
DATA_DIR = Path(HOME_DIR, PARENT_DIR_STR, DATA_DIR_STR)
UNPROCESSED_FILE_PATH_STR = os.environ.get(
"CLINICAL_TRIALS_RAW_DATA",
"clinical_trials.tsv.gz"
)
NODES_FILE_NAME_STR = os.environ.get("NODES_FILE", "nodes.tsv.gz")
NODES_INDRA_FILE_NAME_STR = os.environ.get("NODES_INDRA_FILE", "nodes.pkl")
EDGES_FILE_NAME_STR = os.environ.get("EDGES_FILE", "edges.tsv.gz")

FIELDS = [
"NCTId",
"BriefTitle",
"Condition",
"ConditionMeshTerm",
"ConditionMeshId",
"InterventionName",
"InterventionType",
"InterventionMeshTerm",
"InterventionMeshId",
"StudyType",
"DesignAllocation",
"OverallStatus",
"Phase",
"WhyStopped",
"SecondaryIdType",
"SecondaryId",
"StartDate", # Month [day], year: "November 1, 2023", "May 1984" or NaN
"StartDateType", # "Actual" or "Anticipated" (or NaN)
"ReferencePMID" # these are tagged as relevant by the author, but not necessarily about the trial
]

root = logging.getLogger()
root.setLevel(LOGGING_LEVEL)


@dataclass
class Config:
"""
User-mutable properties of Clinicaltrials.gov data processing
"""

name = PROCESSOR_NAME
api_url = API_URL
api_parameters = {
"fields": ",".join(FIELDS), # actually column names, not fields
"pageSize": 1000,
"countTotal": "true"
}

unprocessed_file_path = Path(DATA_DIR, UNPROCESSED_FILE_PATH_STR)
nodes_path = Path(DATA_DIR, NODES_FILE_NAME_STR)
nodes_indra_path = Path(DATA_DIR, NODES_INDRA_FILE_NAME_STR)
edges_path = Path(DATA_DIR, EDGES_FILE_NAME_STR)
node_types = ["BioEntity", "ClinicalTrial"]
Loading

0 comments on commit 7825cf0

Please sign in to comment.