Merge pull request #133 from monarch-initiative/develop

Update docs, other changes for cookiecutter usage
monarch-initiative · May 13, 2024 · 7563b8f · 7563b8f
2 parents c08b69b + 2b6c014
commit 7563b8f
Show file tree

Hide file tree

Showing 47 changed files with 1,198 additions and 883 deletions.
diff --git a/.github/dependabot.yaml b/.github/dependabot.yaml
@@ -0,0 +1,10 @@
+# Set update schedule for GitHub Actions
+
+version: 2
+updates:
+
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      # Check for updates to GitHub Actions every week
+      interval: "weekly"
diff --git a/.github/workflows/publish.yaml b/.github/workflows/publish.yaml
@@ -1,32 +1,33 @@
 name: publish on pypi
 
 on:
-    release:
-        types: [published]
+  release:
+    types: [published]
 
 jobs:
-    publish:
-        runs-on: ubuntu-latest
-        steps:
-            - name: Checkout sources
-              uses: actions/checkout@v4
+  publish:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout sources
+        uses: actions/checkout@v4
 
-            - name: Set up Python
-              uses: actions/setup-python@v5
-              with:
-                  python-version: "3.10"
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
 
-            - name: Install dependencies
-              run: |
-                  pip install poetry && poetry install
+      - name: Install dependencies
+        run: |
+          pip install poetry && poetry install
 
-            - name: Build
-              run: |
-                   poetry build
+      - name: Build
+        run: |
+          poetry version $(git describe --tags --abbrev=0)
+          poetry build
 
-            - name: Publish to PyPi
-              env:
-                 PYPI_API_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
-              run: |
-                  poetry config http-basic.pypi "__token__" "${PYPI_API_TOKEN}"
-                  poetry publish
+      - name: Publish to PyPi
+        env:
+          PYPI_API_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
+        run: |
+          poetry config http-basic.pypi "__token__" "${PYPI_API_TOKEN}"
+          poetry publish
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 [![Pyversions](https://img.shields.io/pypi/pyversions/koza.svg)](https://pypi.python.org/pypi/koza)
 [![PyPi](https://img.shields.io/pypi/v/koza.svg)](https://pypi.python.org/pypi/koza)
-![Github Action](https://github.com/monarch-initiative/koza/actions/workflows/build.yml/badge.svg)
+![Github Action](https://github.com/monarch-initiative/koza/actions/workflows/test.yaml/badge.svg)
 
 ![pupa](docs/img/pupa.png)  
 

diff --git a/docs/Ingests/index.md b/docs/Ingests/index.md
@@ -0,0 +1,12 @@
+<sub>
+(For CLI usage, see the [CLI commands](../Usage/CLI.md) page.)
+</sub>  
+
+Koza is designed to process and transform existing data into a target csv/json/jsonl format.  
+
+This process is internally known as an **ingest**. Ingests are defined by:  
+
+1. [Source config yaml](./source_config.md): Ingest configuration, including:
+    -  metadata, formats, required columns, any SSSOM files, etc. 
+1. [Map config yaml](./mapping.md): (Optional) configures creation of mapping dictionary  
+1. [Transform code](./transform.md): a Python script, with specific transform instructions 
diff --git a/docs/Ingests/mapping.md b/docs/Ingests/mapping.md
@@ -0,0 +1,62 @@
+
+Mapping with Koza is optional, but can be done in two ways:  
+
+- Automated mapping with SSSOM files  
+- Manual mapping with a map config yaml
+
+### SSSOM Mapping
+
+Koza supports mapping with SSSOM files (Semantic Similarity of Source and Target Ontology Mappings).  
+Simply add the path to the SSSOM file to your source config, the desired target prefixes,  
+and any prefixes you want to use to filter the SSSOM file.  
+Koza will automatically create a mapping lookup table which will automatically  
+attempt to map any values in the source file to an ID with the target prefix.
+
+```yaml
+sssom_config:
+    sssom_file: './path/to/your_mapping_file.sssom.tsv'
+    filter_prefixes: 
+        - 'SOMEPREFIX'
+        - 'OTHERPREFIX'
+    target_prefixes: 
+        - 'OTHERPREFIX'
+    use_match:
+        - 'exact'
+```
+
+**Note:** Currently, only the `exact` match type is supported (`narrow` and `broad` match types will be added in the future).
+
+### Manual Mapping / Additional Data
+
+The map config yaml allows you to include data from other sources in your ingests,  
+which may have different columns or formats.  
+
+If you don't have an SSSOM file, or you want to manually map some values, you can use a map config yaml.  
+You can then add this map to your source config yaml in the `depends_on` property.  
+
+Koza will then create a nested dictionary with the specified key and values.  
+For example, the following map config yaml maps values from the `STRING` column to the `entrez` and `NCBI taxid` columns.
+
+```yaml
+# koza/examples/maps/entrez-2-string.yaml
+name: ...
+files: ...
+
+columns:
+- 'NCBI taxid'
+- 'entrez'
+- 'STRING'
+
+key: 'STRING'
+
+values:
+- 'entrez'
+- 'NCBI taxid'
+```
+
+
+The mapping dict will be available in your transform script from the `koza_app` object (see the Transform Code section below).
+
+---
+
+**Next Steps: [Transform Code](./transform.md)**
diff --git a/docs/Ingests/source_config.md b/docs/Ingests/source_config.md
@@ -0,0 +1,79 @@
+This YAML file sets properties for the ingest of a single file type from a within a Source.
+
+!!! tip "Paths are relative to the directory from which you execute Koza."
+
+## Source Configuration Properties
+
+| **Required properties**     |                                                                                                        |
+| --------------------------- | ------------------------------------------------------------------------------------------------------ |
+| `name`                      | Name of the data ingest, as `<data source>_<type_of_ingest>`, <br/>ex. `hpoa_gene_to_disease`          |
+| `files`                     | List of files to process                                                                               |
+|                             |                                                                                                        |
+| `node_properties`           | List of node properties to include in output                                                           |
+| `edge_properties`           | List of edge properties to include in output                                                           |
+| **Note**                    | Either node or edge properties (or both) must be defined in the primary config yaml for your transform |
+|                             |                                                                                                        |
+| **Optional properties**     |                                                                                                        |
+| `file_archive`              | Path to a file archive containing the file(s) to process <br/> Supported archive formats: zip, gzip    |
+| `format`                    | Format of the data file(s) (CSV or JSON)                                                               |
+| `sssom_config`              | Configures usage of SSSOM mapping files                                                                |
+| `depends_on`                | List of map config files to use                                                                        |
+| `metadata`                  | Metadata for the source, either a list of properties,<br/>or path to a `metadata.yaml`                 |
+| `transform_code`            | Path to a python file to transform the data                                                            |
+| `transform_mode`            | How to process the transform file                                                                      |
+| `global_table`              | Path to a global translation table file                                                                |
+| `local_table`               | Path to a local translation table file                                                                 |
+| `field_type_map`            | Dict of field names and their type (using the FieldType enum)                                          |
+| `filters`                   | List of filters to apply                                                                               |
+| `json_path`                 | Path within JSON object containing data to process                                                     |
+| `required_properties`       | List of properties that must be present in output (JSON only)                                          |
+|                             |                                                                                                        |
+| **CSV-Specific Properties** |                                                                                                        |
+| `delimiter`                 | Delimiter for csv files (**Required for CSV format**)                                                  |
+| **Optional CSV Properties** |                                                                                                        |
+| `columns`                   | List of columns to include in output (CSV only)                                                        |
+| `header`                    | Header row index for csv files                                                                         |
+| `header_delimiter`          | Delimiter for header in csv files                                                                      |
+| `header_prefix`             | Prefix for header in csv files                                                                         |
+| `comment_char`              | Comment character for csv files                                                                        |
+| `skip_blank_lines`          | Skip blank lines in csv files                                                                          |
+
+## Metadata Properties
+
+Metadata is optional, and can be defined as a list of properties and values, or as a path to a `metadata.yaml` file,
+for example - `metadata: "./path/to/metadata.yaml"`.  
+Remember that the path is relative to the directory from which you execute Koza.
+
+| **Metadata Properties** |                                                                                          |
+| ----------------------- | ---------------------------------------------------------------------------------------- |
+| name                    | Name of data source, ex. "FlyBase"                                                       |
+| description             | Description of data/ingest                                                               |
+| ingest_title            | \*Title of source of data, map to biolink name                                           |
+| ingest_url              | \*URL to source of data, Maps to biolink iri                                             |
+| provided_by             | `<data source>_<type_of_ingest>`, ex. `hpoa_gene_to_disease` (see config propery "name") |
+| rights                  | Link to license information for the data source                                          |
+
+**\*Note**: For more information on `ingest_title` and `ingest_url`, see the [infores catalog](https://biolink.github.io/information-resource-registry/infores_catalog.yaml)
+
+## Composing Configuration from Multiple Yaml Files
+
+Koza's custom YAML Loader supports importing/including other yaml files with an `!include` tag.
+
+For example, if you had a file named `standard-columns.yaml`:
+
+```yaml
+- "column_1"
+- "column_2"
+- "column_3"
+- "column_4": "int"
+```
+
+Then in any ingests you wish to use these columns, you can simply `!include` them:
+
+```yaml
+columns: !include "./path/to/standard-columns.yaml"
+```
+
+---
+
+**Next Steps: [Mapping and Additional Data](./mapping.md)**
diff --git a/docs/Ingests/testing.md b/docs/Ingests/testing.md
@@ -0,0 +1,87 @@
+Koza includes a `mock_koza` fixture (see `src/koza/utils/testing_utils`) that can be used to test your ingest configuration. This fixture accepts the following arguments:
+
+| Argument           | Type                      | Description                           |
+| ------------------ | ------------------------- | ------------------------------------- |
+| Required Arguments |
+| `name`             | `str`                     | The name of the ingest                |
+| `data`             | `Union[Dict, List[Dict]]` | The data to be ingested               |
+| `transform_code`   | `str`                     | Path to the transform code to be used |
+| Optional Arguments |
+| `map_cache`        | `Dict`                    | Map cache to be used                  |
+| `filters`          | `List(str)`               | List of filters to apply to data      |
+| `global_table`     | `str`                     | Path to the global table              |
+| `local_table`      | `str`                     | Path to the local table               |
+
+The `mock_koza` fixture returns a list of entities that would be generated by the ingest configuration.  
+This list can then be used to test the output based on the transform script.
+
+Here is an example of how to use the `mock_koza` fixture to test an ingest configuration:
+
+```python
+import pytest
+
+from koza.utils.testing_utils import mock_koza
+
+# Define the source name and transform script path
+INGEST_NAME = "your_ingest_name"
+TRANSFORM_SCRIPT = "./src/{{cookiecutter.__project_slug}}/transform.py"
+
+# Define an example row to test (as a dictionary)
+@pytest.fixture
+def example_row():
+    return {
+        "example_column_1": "entity_1",
+        "example_column_2": "entity_6",
+        "example_column_3": "biolink:related_to",
+    }
+
+# Or a list of rows
+@pytest.fixture
+def example_list_of_rows():
+    return [
+        {
+            "example_column_1": "entity_1",
+            "example_column_2": "entity_6",
+            "example_column_3": "biolink:related_to",
+        },
+        {
+            "example_column_1": "entity_2",
+            "example_column_2": "entity_7",
+            "example_column_3": "biolink:related_to",
+        },
+    ]
+
+# Define the mock koza transform
+@pytest.fixture
+def mock_transform(mock_koza, example_row):
+    return mock_koza(
+        INGEST_NAME,
+        example_row,
+        TRANSFORM_SCRIPT,
+    )
+
+# Or for multiple rows
+@pytest.fixture
+def mock_transform_multiple_rows(mock_koza, example_list_of_rows):
+    return mock_koza(
+        INGEST_NAME,
+        example_list_of_rows,
+        TRANSFORM_SCRIPT,
+    )
+
+# Test the output of the transform
+
+def test_single_row(mock_transform):
+    assert len(mock_transform) == 1
+    entity = mock_transform[0]
+    assert entity
+    assert entity.subject == "entity_1"
+
+
+def test_multiple_rows(mock_transform_multiple_rows):
+    assert len(mock_transform_multiple_rows) == 2
+    entity_1 = mock_transform_multiple_rows[0]
+    entity_2 = mock_transform_multiple_rows[1]
+    assert entity_1.subject == "entity_1"
+    assert entity_2.subject == "entity_2"
+```
diff --git a/docs/Ingests/transform.md b/docs/Ingests/transform.md
@@ -0,0 +1,67 @@
+This Python script is where you'll define the specific steps of your data transformation.
+Koza will load this script and execute it for each row of data in your source file,  
+applying any filters and mapping as defined in your source config yaml,  
+and outputting the transformed data to the target csv/json/jsonl file.
+
+When Koza is called, either by command-line or as a library using `transform_source()`,  
+it creates a `KozaApp` object for the specified ingest.  
+This KozaApp will be your entry point to Koza:
+
+```python
+from koza.cli_utils import get_koza_app
+koza_app = get_koza_app('your-source-name')
+```
+
+The KozaApp object has the following methods which can be used in your transform code:
+
+| Method              | Description                                       |
+| ------------------- | ------------------------------------------------- |
+| `get_row()`         | Returns the next row of data from the source file |
+| `next_row()`        | Skip to the next row in the data file             |
+| `get_map(map_name)` | Returns the mapping dict for the specified map    |
+| `process_sources()` | TBD                                               |
+| `process_maps()`    | Initializes the KozaApp's map cache               |
+| `write(*args)`      | Writes the transformed data to the target file    |
+
+Once you have processed a row of data, and created a biolink entity node or edge object (or both),  
+you can pass these to `koza_app.write()` to output the transformed data to the target file.
+
+??? tldr "Example Python Transform Script"
+
+    ```python
+    # other imports, eg. uuid, pydantic, etc.
+    import uuid
+    from biolink_model.datamodel.pydanticmodel_v2 import Gene, PairwiseGeneToGeneInteraction
+
+    # Koza imports
+    from koza.cli_utils import get_koza_app
+
+    # This is the name of the ingest you want to run
+    source_name = 'map-protein-links-detailed'
+    koza_app = get_koza_app(source_name)
+
+    # If your ingest depends_on a mapping file, you can access it like this:
+    map_name = 'entrez-2-string'
+    koza_map = koza_app.get_map(map_name)
+
+    # This grabs the first/next row from the source data
+    # Koza will reload this script and return the next row until it reaches EOF or row-limit
+    while (row := koza_app.get_row()) is not None:
+        # Now you can lay out your actual transformations, and define your output:
+
+        gene_a = Gene(id='NCBIGene:' + koza_map[row['protein1']]['entrez'])
+        gene_b = Gene(id='NCBIGene:' + koza_map[row['protein2']]['entrez'])
+
+        pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
+            id="uuid:" + str(uuid.uuid1()),
+            subject=gene_a.id,
+            object=gene_b.id,
+            predicate="biolink:interacts_with"
+        )
+
+        # Finally, write the transformed row to the target file
+        koza_app.write(gene_a, gene_b, pairwise_gene_to_gene_interaction)
+    ```
+
+    If you pass nodes, as well as edges, to `koza_app.write()`, Koza will automatically create a node file and an edge file.
+    If you pass only nodes, Koza will create only a node file, and if you pass only edges, Koza will create only an edge file.
diff --git a/docs/Usage/API.md b/docs/Usage/API.md
diff --git a/docs/Usage/Module.md b/docs/Usage/Module.md
@@ -0,0 +1 @@
+::: src.koza.cli_utils