A collection of clinical properties and vocabularies, relevant within the context of a namespace, typically a disease type being researched, to be used to annotate a biospecimen, or a collection of biospecimens represented by a surgery event, or patient record for biospecimens.
These clinical properties are selected from existing specifications, and are intended to be used in annotation of clinical data for research purposes in the LSP Sample Tracking System.
All data recorded in the Sample Tracker are de-identified and contain no personally identifiable information or personal health information (PHI). Accordingly, no clinical property values described by this terminology should contain such information.
The TSV (tab-separated value) files in this repository detail the clinical fields (properties) and controlled vocabularies in the collection, as well as metadata specifying how to annotate the clinical fields and vocabularies using identifiers and descriptive properties.
- clinical_properties.csv - the collection of clinical properties (fields) to be used to annotate the patients, surgeries, and biospecimens.
- clinical_vocabularies.csv - the collection of controlled vocabularies for the clinical properties.
- clinical_properties_summary.xlsx - (READ ONLY) a summary format,
provided for information and collaboration with domain experts.
- Excel format used to collect clinical property and clinical vocabulary terms from domain experts,
- Separate sheet for each namespace and resource, with clinical properties listed in the first row and vocabulary terms listed in columns.
- See workflow section below for more information.
- metadata_properties.csv - the list of metadata properties used to annotate clinical_properties.csv and clinical_vocabularies.csv.
- namespaces.csv - A list of namespaces (disease types) used to form the natural key namespace for each term and vocabulary.
- resources.csv - A list of resources (patient, surgery, biospecimen) used to form the natural key with the namespace for each term and vocabulary.
This terminology is organized such that each term is defined by a unique natural key.
A clinical property term is identified by a three-part natural key formed by the
combination [resource
, namespace
, key
]
** NOTE: the combination of [namespace
,key
] must also be unique (a namespace
and key
combination may not be reused between resources).
- a
resource
is the entity record to be annotated, one of:patient
surgery
biospecimen
- a
namespace
is the identifier of the disease type being studied for the clinical property or vocabulary, one of the values in namespaces.csv - a
key
is an identifier for the term that is unique for thenamespace
context. - a
key
is typically created by "normalizing" thetitle
and follows the rules for normalization described below.
Additionally, each term is assigned a data type:
string
- a text stringinteger
- a whole numberfloat
- a decimal numberdate
- a date in the format YYYY-MM-DDboolean
- a true/false valuearraystring
- a comma separated list of text stringsarrayint
- a comma separated list of whole numbers
Additionally, each clinical property term will be assigned a unique identifier when it is registered in the LSP Sample Tracker, and will contain references to external identifiers.
A clinical vocabulary term is identified by a four part
natural key formed by the combination [resource
, namespace
, field_key
, key
]
where:
field_key
is the key of the clinical property term (field).key
is an identifier for the vocabulary term that is unique for the clinical property [resource
,namespace
,field_key
] context.- a
key
is typically created by "normalizing" thetitle
and follows the rules for normalization described below.
Additionally, each clinical vocabulary term will be assigned a unique identifier when it is registered in the LSP Sample Tracker, and will contain references to external identifiers.
The key
for the clinical property and vocabulary entries are formed by
lowercasing the title, removing non-alphanumeric characters, and replacing spaces with underscores.
- Example: "Tumor Grade" -> "tumor_grade"
- Example: "Histologic Grade (WHO/ISUP)" -> "histologic_grade_who_isup"
- Example: "% Dedifferentiated" -> "percent_dedifferentiated"
- allowed characters: [a-z0-9_]
- may not start or end with an underscore
- may not contain two consecutive underscores
- certain terms may be manually normalized to avoid conflicts
- other conventions may be used, such are replacing symbols such as
%
bypercent
or#
bynumber
Three workflows are envisioned for updating the specification:
- Direct update: edit and validate
clinical_properties.csv
andclinical_vocabularies.csv
- Merge summary file data: new fields and vocabularies from a
summary.xlsx
file. - Merge external data: new fields, vocabularies, and updates from externally generated
clinical_properties.csv
andclinical_vocabularies.csv
files, e.g. from the LSP sample tracking database.
The brttools package provides tools to enable these workflows.
pip install .
Validate specification files and generate a summary file.
brttool -d path-to-files/
Requires the complete set of specification files in the BDRT repository:
clinical_properties.csv
,clinical_vocabularies.csv
metadata_properties.csv
,resources.csv
, andnamespaces.csv
.
Actions:
- Validate: column structure and data type using fields defined in
metadata_properties.csv
- Validate resources and namespaces.
- Update the ordinal column
- Enforce unique constraints: using key columns and alternate key columns (titles instead of keys)
- Verify vocabulary terms are matched with property terms
- Output a
summary
(xlsx) file that lists each (resource, namespace) set of properties in separate sheets. Each property is listed as a column header, and each vocabulary is listed in the column values.
Import, merge and validate a summary file.
brttool -d path-to-files/ -s path-to-summary-file
Actions:
- Read summary file
- Merge with existing
clinical_properties.csv
andclinical_vocabularies.csv
.
- perform a left join from new data to existing data (ignore unmatched rows in existing data),
- overwrite null values with non-null values from existing data
- Validate
- Output to
clinical_properties_from_summary.csv
andclinical_vocabularies_from_summary.csv
- Other:
- interpret a single vocab value as a "prompt"
- interpret extra vocab separated by a blank line at the end as a "description"
- Set the property
data_type
tointeger
,float
, orstring
based on title name patterns (iifdata_type
not set in existing specification file).
Merge data from one specification file (either "-cp, --clinical_properties", or "-cv", "--clinical_vocabularies" ) to another using natural keys.
brt_mergetool -f1 new_base_specification_file -f2 overlay_specification_file [--clinical_properties or --clinical vocabularies]
Merge file2 specification data into file1:
- Left join file1 to file2 on natural keys [resource, namespace, key]
- Preserve non-null file1 values, merge non-null file2 values.
Note: uses: pandas.DataFrame.update to perform merge operation