Skip to content

Commit

Permalink
Merge pull request #133 from monarch-initiative/develop
Browse files Browse the repository at this point in the history
Update docs, other changes for cookiecutter usage
  • Loading branch information
glass-ships authored May 13, 2024
2 parents c08b69b + 2b6c014 commit 7563b8f
Show file tree
Hide file tree
Showing 47 changed files with 1,198 additions and 883 deletions.
10 changes: 10 additions & 0 deletions .github/dependabot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Set update schedule for GitHub Actions

version: 2
updates:

- package-ecosystem: "github-actions"
directory: "/"
schedule:
# Check for updates to GitHub Actions every week
interval: "weekly"
47 changes: 24 additions & 23 deletions .github/workflows/publish.yaml
Original file line number Diff line number Diff line change
@@ -1,32 +1,33 @@
name: publish on pypi

on:
release:
types: [published]
release:
types: [published]

jobs:
publish:
runs-on: ubuntu-latest
steps:
- name: Checkout sources
uses: actions/checkout@v4
publish:
runs-on: ubuntu-latest
steps:
- name: Checkout sources
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install dependencies
run: |
pip install poetry && poetry install
- name: Install dependencies
run: |
pip install poetry && poetry install
- name: Build
run: |
poetry build
- name: Build
run: |
poetry version $(git describe --tags --abbrev=0)
poetry build
- name: Publish to PyPi
env:
PYPI_API_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
run: |
poetry config http-basic.pypi "__token__" "${PYPI_API_TOKEN}"
poetry publish
- name: Publish to PyPi
env:
PYPI_API_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
run: |
poetry config http-basic.pypi "__token__" "${PYPI_API_TOKEN}"
poetry publish
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![Pyversions](https://img.shields.io/pypi/pyversions/koza.svg)](https://pypi.python.org/pypi/koza)
[![PyPi](https://img.shields.io/pypi/v/koza.svg)](https://pypi.python.org/pypi/koza)
![Github Action](https://github.com/monarch-initiative/koza/actions/workflows/build.yml/badge.svg)
![Github Action](https://github.com/monarch-initiative/koza/actions/workflows/test.yaml/badge.svg)

![pupa](docs/img/pupa.png)

Expand Down
12 changes: 12 additions & 0 deletions docs/Ingests/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<sub>
(For CLI usage, see the [CLI commands](../Usage/CLI.md) page.)
</sub>

Koza is designed to process and transform existing data into a target csv/json/jsonl format.

This process is internally known as an **ingest**. Ingests are defined by:

1. [Source config yaml](./source_config.md): Ingest configuration, including:
- metadata, formats, required columns, any SSSOM files, etc.
1. [Map config yaml](./mapping.md): (Optional) configures creation of mapping dictionary
1. [Transform code](./transform.md): a Python script, with specific transform instructions
62 changes: 62 additions & 0 deletions docs/Ingests/mapping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@

Mapping with Koza is optional, but can be done in two ways:

- Automated mapping with SSSOM files
- Manual mapping with a map config yaml

### SSSOM Mapping

Koza supports mapping with SSSOM files (Semantic Similarity of Source and Target Ontology Mappings).
Simply add the path to the SSSOM file to your source config, the desired target prefixes,
and any prefixes you want to use to filter the SSSOM file.
Koza will automatically create a mapping lookup table which will automatically
attempt to map any values in the source file to an ID with the target prefix.

```yaml
sssom_config:
sssom_file: './path/to/your_mapping_file.sssom.tsv'
filter_prefixes:
- 'SOMEPREFIX'
- 'OTHERPREFIX'
target_prefixes:
- 'OTHERPREFIX'
use_match:
- 'exact'
```
**Note:** Currently, only the `exact` match type is supported (`narrow` and `broad` match types will be added in the future).

### Manual Mapping / Additional Data

The map config yaml allows you to include data from other sources in your ingests,
which may have different columns or formats.

If you don't have an SSSOM file, or you want to manually map some values, you can use a map config yaml.
You can then add this map to your source config yaml in the `depends_on` property.

Koza will then create a nested dictionary with the specified key and values.
For example, the following map config yaml maps values from the `STRING` column to the `entrez` and `NCBI taxid` columns.

```yaml
# koza/examples/maps/entrez-2-string.yaml
name: ...
files: ...
columns:
- 'NCBI taxid'
- 'entrez'
- 'STRING'
key: 'STRING'
values:
- 'entrez'
- 'NCBI taxid'
```


The mapping dict will be available in your transform script from the `koza_app` object (see the Transform Code section below).

---

**Next Steps: [Transform Code](./transform.md)**
79 changes: 79 additions & 0 deletions docs/Ingests/source_config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
This YAML file sets properties for the ingest of a single file type from a within a Source.

!!! tip "Paths are relative to the directory from which you execute Koza."

## Source Configuration Properties

| **Required properties** | |
| --------------------------- | ------------------------------------------------------------------------------------------------------ |
| `name` | Name of the data ingest, as `<data source>_<type_of_ingest>`, <br/>ex. `hpoa_gene_to_disease` |
| `files` | List of files to process |
| | |
| `node_properties` | List of node properties to include in output |
| `edge_properties` | List of edge properties to include in output |
| **Note** | Either node or edge properties (or both) must be defined in the primary config yaml for your transform |
| | |
| **Optional properties** | |
| `file_archive` | Path to a file archive containing the file(s) to process <br/> Supported archive formats: zip, gzip |
| `format` | Format of the data file(s) (CSV or JSON) |
| `sssom_config` | Configures usage of SSSOM mapping files |
| `depends_on` | List of map config files to use |
| `metadata` | Metadata for the source, either a list of properties,<br/>or path to a `metadata.yaml` |
| `transform_code` | Path to a python file to transform the data |
| `transform_mode` | How to process the transform file |
| `global_table` | Path to a global translation table file |
| `local_table` | Path to a local translation table file |
| `field_type_map` | Dict of field names and their type (using the FieldType enum) |
| `filters` | List of filters to apply |
| `json_path` | Path within JSON object containing data to process |
| `required_properties` | List of properties that must be present in output (JSON only) |
| | |
| **CSV-Specific Properties** | |
| `delimiter` | Delimiter for csv files (**Required for CSV format**) |
| **Optional CSV Properties** | |
| `columns` | List of columns to include in output (CSV only) |
| `header` | Header row index for csv files |
| `header_delimiter` | Delimiter for header in csv files |
| `header_prefix` | Prefix for header in csv files |
| `comment_char` | Comment character for csv files |
| `skip_blank_lines` | Skip blank lines in csv files |

## Metadata Properties

Metadata is optional, and can be defined as a list of properties and values, or as a path to a `metadata.yaml` file,
for example - `metadata: "./path/to/metadata.yaml"`.
Remember that the path is relative to the directory from which you execute Koza.

| **Metadata Properties** | |
| ----------------------- | ---------------------------------------------------------------------------------------- |
| name | Name of data source, ex. "FlyBase" |
| description | Description of data/ingest |
| ingest_title | \*Title of source of data, map to biolink name |
| ingest_url | \*URL to source of data, Maps to biolink iri |
| provided_by | `<data source>_<type_of_ingest>`, ex. `hpoa_gene_to_disease` (see config propery "name") |
| rights | Link to license information for the data source |

**\*Note**: For more information on `ingest_title` and `ingest_url`, see the [infores catalog](https://biolink.github.io/information-resource-registry/infores_catalog.yaml)

## Composing Configuration from Multiple Yaml Files

Koza's custom YAML Loader supports importing/including other yaml files with an `!include` tag.

For example, if you had a file named `standard-columns.yaml`:

```yaml
- "column_1"
- "column_2"
- "column_3"
- "column_4": "int"
```
Then in any ingests you wish to use these columns, you can simply `!include` them:

```yaml
columns: !include "./path/to/standard-columns.yaml"
```

---

**Next Steps: [Mapping and Additional Data](./mapping.md)**
87 changes: 87 additions & 0 deletions docs/Ingests/testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
Koza includes a `mock_koza` fixture (see `src/koza/utils/testing_utils`) that can be used to test your ingest configuration. This fixture accepts the following arguments:

| Argument | Type | Description |
| ------------------ | ------------------------- | ------------------------------------- |
| Required Arguments |
| `name` | `str` | The name of the ingest |
| `data` | `Union[Dict, List[Dict]]` | The data to be ingested |
| `transform_code` | `str` | Path to the transform code to be used |
| Optional Arguments |
| `map_cache` | `Dict` | Map cache to be used |
| `filters` | `List(str)` | List of filters to apply to data |
| `global_table` | `str` | Path to the global table |
| `local_table` | `str` | Path to the local table |

The `mock_koza` fixture returns a list of entities that would be generated by the ingest configuration.
This list can then be used to test the output based on the transform script.

Here is an example of how to use the `mock_koza` fixture to test an ingest configuration:

```python
import pytest

from koza.utils.testing_utils import mock_koza

# Define the source name and transform script path
INGEST_NAME = "your_ingest_name"
TRANSFORM_SCRIPT = "./src/{{cookiecutter.__project_slug}}/transform.py"

# Define an example row to test (as a dictionary)
@pytest.fixture
def example_row():
return {
"example_column_1": "entity_1",
"example_column_2": "entity_6",
"example_column_3": "biolink:related_to",
}

# Or a list of rows
@pytest.fixture
def example_list_of_rows():
return [
{
"example_column_1": "entity_1",
"example_column_2": "entity_6",
"example_column_3": "biolink:related_to",
},
{
"example_column_1": "entity_2",
"example_column_2": "entity_7",
"example_column_3": "biolink:related_to",
},
]

# Define the mock koza transform
@pytest.fixture
def mock_transform(mock_koza, example_row):
return mock_koza(
INGEST_NAME,
example_row,
TRANSFORM_SCRIPT,
)

# Or for multiple rows
@pytest.fixture
def mock_transform_multiple_rows(mock_koza, example_list_of_rows):
return mock_koza(
INGEST_NAME,
example_list_of_rows,
TRANSFORM_SCRIPT,
)

# Test the output of the transform

def test_single_row(mock_transform):
assert len(mock_transform) == 1
entity = mock_transform[0]
assert entity
assert entity.subject == "entity_1"


def test_multiple_rows(mock_transform_multiple_rows):
assert len(mock_transform_multiple_rows) == 2
entity_1 = mock_transform_multiple_rows[0]
entity_2 = mock_transform_multiple_rows[1]
assert entity_1.subject == "entity_1"
assert entity_2.subject == "entity_2"
```
67 changes: 67 additions & 0 deletions docs/Ingests/transform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
This Python script is where you'll define the specific steps of your data transformation.
Koza will load this script and execute it for each row of data in your source file,
applying any filters and mapping as defined in your source config yaml,
and outputting the transformed data to the target csv/json/jsonl file.

When Koza is called, either by command-line or as a library using `transform_source()`,
it creates a `KozaApp` object for the specified ingest.
This KozaApp will be your entry point to Koza:

```python
from koza.cli_utils import get_koza_app
koza_app = get_koza_app('your-source-name')
```

The KozaApp object has the following methods which can be used in your transform code:

| Method | Description |
| ------------------- | ------------------------------------------------- |
| `get_row()` | Returns the next row of data from the source file |
| `next_row()` | Skip to the next row in the data file |
| `get_map(map_name)` | Returns the mapping dict for the specified map |
| `process_sources()` | TBD |
| `process_maps()` | Initializes the KozaApp's map cache |
| `write(*args)` | Writes the transformed data to the target file |

Once you have processed a row of data, and created a biolink entity node or edge object (or both),
you can pass these to `koza_app.write()` to output the transformed data to the target file.

??? tldr "Example Python Transform Script"

```python
# other imports, eg. uuid, pydantic, etc.
import uuid
from biolink_model.datamodel.pydanticmodel_v2 import Gene, PairwiseGeneToGeneInteraction

# Koza imports
from koza.cli_utils import get_koza_app

# This is the name of the ingest you want to run
source_name = 'map-protein-links-detailed'
koza_app = get_koza_app(source_name)

# If your ingest depends_on a mapping file, you can access it like this:
map_name = 'entrez-2-string'
koza_map = koza_app.get_map(map_name)

# This grabs the first/next row from the source data
# Koza will reload this script and return the next row until it reaches EOF or row-limit
while (row := koza_app.get_row()) is not None:
# Now you can lay out your actual transformations, and define your output:

gene_a = Gene(id='NCBIGene:' + koza_map[row['protein1']]['entrez'])
gene_b = Gene(id='NCBIGene:' + koza_map[row['protein2']]['entrez'])

pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
id="uuid:" + str(uuid.uuid1()),
subject=gene_a.id,
object=gene_b.id,
predicate="biolink:interacts_with"
)

# Finally, write the transformed row to the target file
koza_app.write(gene_a, gene_b, pairwise_gene_to_gene_interaction)
```

If you pass nodes, as well as edges, to `koza_app.write()`, Koza will automatically create a node file and an edge file.
If you pass only nodes, Koza will create only a node file, and if you pass only edges, Koza will create only an edge file.
1 change: 0 additions & 1 deletion docs/Usage/API.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/Usage/Module.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.koza.cli_utils
Loading

0 comments on commit 7563b8f

Please sign in to comment.