-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refactor clinicaltrials.gov processor
- Loading branch information
1 parent
49ee45f
commit 9d3b2bb
Showing
16 changed files
with
3,624 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
name: Test G.h subsystem | ||
|
||
on: | ||
push: | ||
branches: [main] | ||
paths: | ||
- ".github/workflows/stability_tests.yml" | ||
- "src/*" | ||
pull_request: | ||
paths: | ||
- ".github/workflows/stability_tests.yml" | ||
- "src/*" | ||
workflow_dispatch: | ||
|
||
jobs: | ||
tests: | ||
runs-on: ubuntu-22.04 | ||
steps: | ||
- uses: actions/checkout@v3 | ||
|
||
- name: Install Cython dependencies | ||
run: sudo apt update && sudo apt install -y gcc build-essential python3-dev curl | ||
|
||
- name: Install poetry | ||
run: curl -sSL https://install.python-poetry.org | python3 - | ||
|
||
- name: Install python dependencies | ||
run: poetry install | ||
|
||
- name: Run tests | ||
run: poetry run pytest -v |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
BSD 2-Clause License | ||
|
||
Copyright (c) 2021, Benjamin M. Gyori | ||
All rights reserved. | ||
|
||
Redistribution and use in source and binary forms, with or without | ||
modification, are permitted provided that the following conditions are met: | ||
|
||
1. Redistributions of source code must retain the above copyright notice, this | ||
list of conditions and the following disclaimer. | ||
|
||
2. Redistributions in binary form must reproduce the above copyright notice, | ||
this list of conditions and the following disclaimer in the documentation | ||
and/or other materials provided with the distribution. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | ||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE | ||
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE | ||
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | ||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | ||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | ||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | ||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Trial Synth | ||
|
||
Interpretation and integration of clinical trials data from multiple registries in a computable form | ||
|
||
|
||
## Adding registries and processors | ||
|
||
Getting data of each registry happens via a corresponding `Processor` object, its components, and relevant methods and functions. | ||
Users create a `Processor` by composition; some require different work than others. | ||
|
||
Some common components include: | ||
- Fetcher: get raw data from a source (e.g. a REST API) | ||
- Transformer: change the data into a desired format for downstream use | ||
- Storer: save the data to some location(s) | ||
- Validator: check the data for quality | ||
- Configuration: management of processor behaviors | ||
- CLI: entrypoint for downstream use | ||
|
||
![Processor with common components](./composition.svg) | ||
|
||
A user can add a registry and `Processor` by test-driven development. A pattern for doing this looks like: | ||
- Get a sample response from the API and save it to a file | ||
- Write the test for API stability | ||
- This can detect unexpected API behavior (e.g. changes or outages) | ||
- Add the file to the imposter to use as a stubbed response | ||
- Add a new port for the stubbed service to the imposter and compose files | ||
- Write the test for the processor's end-to-end behaviors | ||
- The input is the sample response | ||
- Write the components and functions to make the test pass | ||
- Set the `url` environment variable for the `Processor` to the stub's port | ||
|
||
|
||
## Installation | ||
|
||
Run the following to install this repository's package in editable mode: | ||
|
||
``` | ||
$ git clone https://github.com/gyorilab/trialsynth | ||
$ cd trialsynth | ||
$ pip install -e | ||
``` | ||
|
||
|
||
## Local run | ||
|
||
Users can run all system components on their computers via the compose stack, wrapped by a shell script: | ||
``` | ||
./run_e2e.sh | ||
``` | ||
|
||
|
||
## Testing | ||
|
||
Users can run all the suite of full system tests on their computers via the compose stack, wrapped by a shell script: | ||
``` | ||
./test_e2e.sh | ||
``` | ||
|
||
|
||
## References | ||
|
||
[Test-driven development](https://tidyfirst.substack.com/p/canon-tdd) | ||
[Mountebank](https://www.mbtest.org/) | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
[tool.poetry] | ||
name = "trialsynth" | ||
version = "0.1.0-alpha" | ||
description = "Extracts clinical trial information from sources" | ||
authors = ["jim-sheldon <[email protected]>"] | ||
license = "BSD-2-Clause" | ||
readme = "README.md" | ||
|
||
packages = [ | ||
{ include = "trial_synth", from = "src" }, | ||
] | ||
|
||
classifiers = [ | ||
"Programming Language :: Python", | ||
"Programming Language :: Python :: 3.10", | ||
"Programming Language :: Python :: 3.11", | ||
"Programming Language :: Python :: 3.12", | ||
] | ||
|
||
[tool.poetry.dependencies] | ||
python = ">=3.10" | ||
gilda = "^1.1.0" | ||
pandas = "<2" # indra does not support >=2 | ||
tqdm = "^4.66.2" | ||
indra = {git = "https://github.com/sorgerlab/indra.git"} | ||
requests = "^2.31.0" | ||
click = "^8.1.7" | ||
addict = "^2.4.0" | ||
pydantic = "^2.7.1" | ||
|
||
[tool.poetry.group.test.dependencies] | ||
pytest = "^8.1.1" | ||
|
||
[build-system] | ||
requires = ["poetry-core"] | ||
build-backend = "poetry.core.masonry.api" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
import logging | ||
|
||
import click | ||
|
||
from .config import Config, DATA_DIR | ||
from .fetch import Fetcher | ||
from .store import Storer | ||
from .transform import Transformer | ||
from .process import Processor | ||
from .util import ensure_output_directory_exists | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@click.command() | ||
def main(): | ||
click.secho("Processing clincaltrials.gov data", fg="green", bold=True) | ||
ensure_output_directory_exists() | ||
config = Config() | ||
fetcher = Fetcher( | ||
url=config.api_url, | ||
request_parameters=config.api_parameters | ||
) | ||
transformer = Transformer() | ||
storer = Storer( | ||
node_iterator=transformer.get_nodes, | ||
node_types=config.node_types, | ||
data_directory=DATA_DIR | ||
) | ||
clinical_trials_processor = Processor( | ||
config=config, | ||
fetcher=fetcher, | ||
storer=storer, | ||
transformer=transformer | ||
) | ||
clinical_trials_processor.ensure_api_response_data_saved() | ||
clinical_trials_processor.clean_and_transform_data() | ||
clinical_trials_processor.set_nodes_and_edges() | ||
clinical_trials_processor.validate_data() | ||
clinical_trials_processor.save_graph_data() | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
"""Clinicaltrials.gov processor configuration""" | ||
|
||
from dataclasses import dataclass | ||
import logging | ||
import os | ||
from pathlib import Path | ||
|
||
|
||
LOGGING_LEVEL = os.environ.get("LOGGING_LEVEL", "INFO") | ||
|
||
PROCESSOR_NAME = os.environ.get( | ||
"CLINICAL_TRIALS_DOT_GOV_PROCESSOR_NAME", | ||
"clinicaltrials" | ||
) | ||
API_URL = os.environ.get( | ||
"CLINICAL_TRIALS_DOT_GOV_API_URL", | ||
"https://clinicaltrials.gov/api/v2/studies" | ||
) | ||
|
||
HOME_DIR = os.environ.get("HOME_DIRECTORY", Path.home()) | ||
PARENT_DIR_STR = os.environ.get("BASE_DIRECTORY", ".data") | ||
DATA_DIR_STR = os.environ.get("DATA_DIRECTORY", "clinicaltrials") | ||
DATA_DIR = Path(HOME_DIR, PARENT_DIR_STR, DATA_DIR_STR) | ||
UNPROCESSED_FILE_PATH_STR = os.environ.get( | ||
"CLINICAL_TRIALS_RAW_DATA", | ||
"clinical_trials.tsv.gz" | ||
) | ||
NODES_FILE_NAME_STR = os.environ.get("NODES_FILE", "nodes.tsv.gz") | ||
NODES_INDRA_FILE_NAME_STR = os.environ.get("NODES_INDRA_FILE", "nodes.pkl") | ||
EDGES_FILE_NAME_STR = os.environ.get("EDGES_FILE", "edges.tsv.gz") | ||
|
||
FIELDS = [ | ||
"NCTId", | ||
"BriefTitle", | ||
"Condition", | ||
"ConditionMeshTerm", | ||
"ConditionMeshId", | ||
"InterventionName", | ||
"InterventionType", | ||
"InterventionMeshTerm", | ||
"InterventionMeshId", | ||
"StudyType", | ||
"DesignAllocation", | ||
"OverallStatus", | ||
"Phase", | ||
"WhyStopped", | ||
"SecondaryIdType", | ||
"SecondaryId", | ||
"StartDate", # Month [day], year: "November 1, 2023", "May 1984" or NaN | ||
"StartDateType", # "Actual" or "Anticipated" (or NaN) | ||
"ReferencePMID" # these are tagged as relevant by the author, but not necessarily about the trial | ||
] | ||
|
||
root = logging.getLogger() | ||
root.setLevel(LOGGING_LEVEL) | ||
|
||
|
||
@dataclass | ||
class Config: | ||
""" | ||
User-mutable properties of Clinicaltrials.gov data processing | ||
""" | ||
|
||
name = PROCESSOR_NAME | ||
api_url = API_URL | ||
api_parameters = { | ||
"fields": ",".join(FIELDS), # actually column names, not fields | ||
"pageSize": 1000, | ||
"countTotal": "true" | ||
} | ||
|
||
unprocessed_file_path = Path(DATA_DIR, UNPROCESSED_FILE_PATH_STR) | ||
nodes_path = Path(DATA_DIR, NODES_FILE_NAME_STR) | ||
nodes_indra_path = Path(DATA_DIR, NODES_INDRA_FILE_NAME_STR) | ||
edges_path = Path(DATA_DIR, EDGES_FILE_NAME_STR) | ||
node_types = ["BioEntity", "ClinicalTrial"] |
Oops, something went wrong.