-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #133 from monarch-initiative/develop
Update docs, other changes for cookiecutter usage
- Loading branch information
Showing
47 changed files
with
1,198 additions
and
883 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# Set update schedule for GitHub Actions | ||
|
||
version: 2 | ||
updates: | ||
|
||
- package-ecosystem: "github-actions" | ||
directory: "/" | ||
schedule: | ||
# Check for updates to GitHub Actions every week | ||
interval: "weekly" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,33 @@ | ||
name: publish on pypi | ||
|
||
on: | ||
release: | ||
types: [published] | ||
release: | ||
types: [published] | ||
|
||
jobs: | ||
publish: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout sources | ||
uses: actions/checkout@v4 | ||
publish: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout sources | ||
uses: actions/checkout@v4 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: "3.10" | ||
- name: Set up Python | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: "3.10" | ||
|
||
- name: Install dependencies | ||
run: | | ||
pip install poetry && poetry install | ||
- name: Install dependencies | ||
run: | | ||
pip install poetry && poetry install | ||
- name: Build | ||
run: | | ||
poetry build | ||
- name: Build | ||
run: | | ||
poetry version $(git describe --tags --abbrev=0) | ||
poetry build | ||
- name: Publish to PyPi | ||
env: | ||
PYPI_API_TOKEN: ${{ secrets.PYPI_API_TOKEN }} | ||
run: | | ||
poetry config http-basic.pypi "__token__" "${PYPI_API_TOKEN}" | ||
poetry publish | ||
- name: Publish to PyPi | ||
env: | ||
PYPI_API_TOKEN: ${{ secrets.PYPI_API_TOKEN }} | ||
run: | | ||
poetry config http-basic.pypi "__token__" "${PYPI_API_TOKEN}" | ||
poetry publish |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
<sub> | ||
(For CLI usage, see the [CLI commands](../Usage/CLI.md) page.) | ||
</sub> | ||
|
||
Koza is designed to process and transform existing data into a target csv/json/jsonl format. | ||
|
||
This process is internally known as an **ingest**. Ingests are defined by: | ||
|
||
1. [Source config yaml](./source_config.md): Ingest configuration, including: | ||
- metadata, formats, required columns, any SSSOM files, etc. | ||
1. [Map config yaml](./mapping.md): (Optional) configures creation of mapping dictionary | ||
1. [Transform code](./transform.md): a Python script, with specific transform instructions |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
|
||
Mapping with Koza is optional, but can be done in two ways: | ||
|
||
- Automated mapping with SSSOM files | ||
- Manual mapping with a map config yaml | ||
|
||
### SSSOM Mapping | ||
|
||
Koza supports mapping with SSSOM files (Semantic Similarity of Source and Target Ontology Mappings). | ||
Simply add the path to the SSSOM file to your source config, the desired target prefixes, | ||
and any prefixes you want to use to filter the SSSOM file. | ||
Koza will automatically create a mapping lookup table which will automatically | ||
attempt to map any values in the source file to an ID with the target prefix. | ||
|
||
```yaml | ||
sssom_config: | ||
sssom_file: './path/to/your_mapping_file.sssom.tsv' | ||
filter_prefixes: | ||
- 'SOMEPREFIX' | ||
- 'OTHERPREFIX' | ||
target_prefixes: | ||
- 'OTHERPREFIX' | ||
use_match: | ||
- 'exact' | ||
``` | ||
**Note:** Currently, only the `exact` match type is supported (`narrow` and `broad` match types will be added in the future). | ||
|
||
### Manual Mapping / Additional Data | ||
|
||
The map config yaml allows you to include data from other sources in your ingests, | ||
which may have different columns or formats. | ||
|
||
If you don't have an SSSOM file, or you want to manually map some values, you can use a map config yaml. | ||
You can then add this map to your source config yaml in the `depends_on` property. | ||
|
||
Koza will then create a nested dictionary with the specified key and values. | ||
For example, the following map config yaml maps values from the `STRING` column to the `entrez` and `NCBI taxid` columns. | ||
|
||
```yaml | ||
# koza/examples/maps/entrez-2-string.yaml | ||
name: ... | ||
files: ... | ||
columns: | ||
- 'NCBI taxid' | ||
- 'entrez' | ||
- 'STRING' | ||
key: 'STRING' | ||
values: | ||
- 'entrez' | ||
- 'NCBI taxid' | ||
``` | ||
|
||
|
||
The mapping dict will be available in your transform script from the `koza_app` object (see the Transform Code section below). | ||
|
||
--- | ||
|
||
**Next Steps: [Transform Code](./transform.md)** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
This YAML file sets properties for the ingest of a single file type from a within a Source. | ||
|
||
!!! tip "Paths are relative to the directory from which you execute Koza." | ||
|
||
## Source Configuration Properties | ||
|
||
| **Required properties** | | | ||
| --------------------------- | ------------------------------------------------------------------------------------------------------ | | ||
| `name` | Name of the data ingest, as `<data source>_<type_of_ingest>`, <br/>ex. `hpoa_gene_to_disease` | | ||
| `files` | List of files to process | | ||
| | | | ||
| `node_properties` | List of node properties to include in output | | ||
| `edge_properties` | List of edge properties to include in output | | ||
| **Note** | Either node or edge properties (or both) must be defined in the primary config yaml for your transform | | ||
| | | | ||
| **Optional properties** | | | ||
| `file_archive` | Path to a file archive containing the file(s) to process <br/> Supported archive formats: zip, gzip | | ||
| `format` | Format of the data file(s) (CSV or JSON) | | ||
| `sssom_config` | Configures usage of SSSOM mapping files | | ||
| `depends_on` | List of map config files to use | | ||
| `metadata` | Metadata for the source, either a list of properties,<br/>or path to a `metadata.yaml` | | ||
| `transform_code` | Path to a python file to transform the data | | ||
| `transform_mode` | How to process the transform file | | ||
| `global_table` | Path to a global translation table file | | ||
| `local_table` | Path to a local translation table file | | ||
| `field_type_map` | Dict of field names and their type (using the FieldType enum) | | ||
| `filters` | List of filters to apply | | ||
| `json_path` | Path within JSON object containing data to process | | ||
| `required_properties` | List of properties that must be present in output (JSON only) | | ||
| | | | ||
| **CSV-Specific Properties** | | | ||
| `delimiter` | Delimiter for csv files (**Required for CSV format**) | | ||
| **Optional CSV Properties** | | | ||
| `columns` | List of columns to include in output (CSV only) | | ||
| `header` | Header row index for csv files | | ||
| `header_delimiter` | Delimiter for header in csv files | | ||
| `header_prefix` | Prefix for header in csv files | | ||
| `comment_char` | Comment character for csv files | | ||
| `skip_blank_lines` | Skip blank lines in csv files | | ||
|
||
## Metadata Properties | ||
|
||
Metadata is optional, and can be defined as a list of properties and values, or as a path to a `metadata.yaml` file, | ||
for example - `metadata: "./path/to/metadata.yaml"`. | ||
Remember that the path is relative to the directory from which you execute Koza. | ||
|
||
| **Metadata Properties** | | | ||
| ----------------------- | ---------------------------------------------------------------------------------------- | | ||
| name | Name of data source, ex. "FlyBase" | | ||
| description | Description of data/ingest | | ||
| ingest_title | \*Title of source of data, map to biolink name | | ||
| ingest_url | \*URL to source of data, Maps to biolink iri | | ||
| provided_by | `<data source>_<type_of_ingest>`, ex. `hpoa_gene_to_disease` (see config propery "name") | | ||
| rights | Link to license information for the data source | | ||
|
||
**\*Note**: For more information on `ingest_title` and `ingest_url`, see the [infores catalog](https://biolink.github.io/information-resource-registry/infores_catalog.yaml) | ||
|
||
## Composing Configuration from Multiple Yaml Files | ||
|
||
Koza's custom YAML Loader supports importing/including other yaml files with an `!include` tag. | ||
|
||
For example, if you had a file named `standard-columns.yaml`: | ||
|
||
```yaml | ||
- "column_1" | ||
- "column_2" | ||
- "column_3" | ||
- "column_4": "int" | ||
``` | ||
Then in any ingests you wish to use these columns, you can simply `!include` them: | ||
|
||
```yaml | ||
columns: !include "./path/to/standard-columns.yaml" | ||
``` | ||
|
||
--- | ||
|
||
**Next Steps: [Mapping and Additional Data](./mapping.md)** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
Koza includes a `mock_koza` fixture (see `src/koza/utils/testing_utils`) that can be used to test your ingest configuration. This fixture accepts the following arguments: | ||
|
||
| Argument | Type | Description | | ||
| ------------------ | ------------------------- | ------------------------------------- | | ||
| Required Arguments | | ||
| `name` | `str` | The name of the ingest | | ||
| `data` | `Union[Dict, List[Dict]]` | The data to be ingested | | ||
| `transform_code` | `str` | Path to the transform code to be used | | ||
| Optional Arguments | | ||
| `map_cache` | `Dict` | Map cache to be used | | ||
| `filters` | `List(str)` | List of filters to apply to data | | ||
| `global_table` | `str` | Path to the global table | | ||
| `local_table` | `str` | Path to the local table | | ||
|
||
The `mock_koza` fixture returns a list of entities that would be generated by the ingest configuration. | ||
This list can then be used to test the output based on the transform script. | ||
|
||
Here is an example of how to use the `mock_koza` fixture to test an ingest configuration: | ||
|
||
```python | ||
import pytest | ||
|
||
from koza.utils.testing_utils import mock_koza | ||
|
||
# Define the source name and transform script path | ||
INGEST_NAME = "your_ingest_name" | ||
TRANSFORM_SCRIPT = "./src/{{cookiecutter.__project_slug}}/transform.py" | ||
|
||
# Define an example row to test (as a dictionary) | ||
@pytest.fixture | ||
def example_row(): | ||
return { | ||
"example_column_1": "entity_1", | ||
"example_column_2": "entity_6", | ||
"example_column_3": "biolink:related_to", | ||
} | ||
|
||
# Or a list of rows | ||
@pytest.fixture | ||
def example_list_of_rows(): | ||
return [ | ||
{ | ||
"example_column_1": "entity_1", | ||
"example_column_2": "entity_6", | ||
"example_column_3": "biolink:related_to", | ||
}, | ||
{ | ||
"example_column_1": "entity_2", | ||
"example_column_2": "entity_7", | ||
"example_column_3": "biolink:related_to", | ||
}, | ||
] | ||
|
||
# Define the mock koza transform | ||
@pytest.fixture | ||
def mock_transform(mock_koza, example_row): | ||
return mock_koza( | ||
INGEST_NAME, | ||
example_row, | ||
TRANSFORM_SCRIPT, | ||
) | ||
|
||
# Or for multiple rows | ||
@pytest.fixture | ||
def mock_transform_multiple_rows(mock_koza, example_list_of_rows): | ||
return mock_koza( | ||
INGEST_NAME, | ||
example_list_of_rows, | ||
TRANSFORM_SCRIPT, | ||
) | ||
|
||
# Test the output of the transform | ||
|
||
def test_single_row(mock_transform): | ||
assert len(mock_transform) == 1 | ||
entity = mock_transform[0] | ||
assert entity | ||
assert entity.subject == "entity_1" | ||
|
||
|
||
def test_multiple_rows(mock_transform_multiple_rows): | ||
assert len(mock_transform_multiple_rows) == 2 | ||
entity_1 = mock_transform_multiple_rows[0] | ||
entity_2 = mock_transform_multiple_rows[1] | ||
assert entity_1.subject == "entity_1" | ||
assert entity_2.subject == "entity_2" | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
This Python script is where you'll define the specific steps of your data transformation. | ||
Koza will load this script and execute it for each row of data in your source file, | ||
applying any filters and mapping as defined in your source config yaml, | ||
and outputting the transformed data to the target csv/json/jsonl file. | ||
|
||
When Koza is called, either by command-line or as a library using `transform_source()`, | ||
it creates a `KozaApp` object for the specified ingest. | ||
This KozaApp will be your entry point to Koza: | ||
|
||
```python | ||
from koza.cli_utils import get_koza_app | ||
koza_app = get_koza_app('your-source-name') | ||
``` | ||
|
||
The KozaApp object has the following methods which can be used in your transform code: | ||
|
||
| Method | Description | | ||
| ------------------- | ------------------------------------------------- | | ||
| `get_row()` | Returns the next row of data from the source file | | ||
| `next_row()` | Skip to the next row in the data file | | ||
| `get_map(map_name)` | Returns the mapping dict for the specified map | | ||
| `process_sources()` | TBD | | ||
| `process_maps()` | Initializes the KozaApp's map cache | | ||
| `write(*args)` | Writes the transformed data to the target file | | ||
|
||
Once you have processed a row of data, and created a biolink entity node or edge object (or both), | ||
you can pass these to `koza_app.write()` to output the transformed data to the target file. | ||
|
||
??? tldr "Example Python Transform Script" | ||
|
||
```python | ||
# other imports, eg. uuid, pydantic, etc. | ||
import uuid | ||
from biolink_model.datamodel.pydanticmodel_v2 import Gene, PairwiseGeneToGeneInteraction | ||
|
||
# Koza imports | ||
from koza.cli_utils import get_koza_app | ||
|
||
# This is the name of the ingest you want to run | ||
source_name = 'map-protein-links-detailed' | ||
koza_app = get_koza_app(source_name) | ||
|
||
# If your ingest depends_on a mapping file, you can access it like this: | ||
map_name = 'entrez-2-string' | ||
koza_map = koza_app.get_map(map_name) | ||
|
||
# This grabs the first/next row from the source data | ||
# Koza will reload this script and return the next row until it reaches EOF or row-limit | ||
while (row := koza_app.get_row()) is not None: | ||
# Now you can lay out your actual transformations, and define your output: | ||
|
||
gene_a = Gene(id='NCBIGene:' + koza_map[row['protein1']]['entrez']) | ||
gene_b = Gene(id='NCBIGene:' + koza_map[row['protein2']]['entrez']) | ||
|
||
pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction( | ||
id="uuid:" + str(uuid.uuid1()), | ||
subject=gene_a.id, | ||
object=gene_b.id, | ||
predicate="biolink:interacts_with" | ||
) | ||
|
||
# Finally, write the transformed row to the target file | ||
koza_app.write(gene_a, gene_b, pairwise_gene_to_gene_interaction) | ||
``` | ||
|
||
If you pass nodes, as well as edges, to `koza_app.write()`, Koza will automatically create a node file and an edge file. | ||
If you pass only nodes, Koza will create only a node file, and if you pass only edges, Koza will create only an edge file. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
::: src.koza.cli_utils |
Oops, something went wrong.