Update documentation (#27)

* initial docs * remove new use case text from docs, this now comes from pipeline.json * update pipeline.json documentation
SocialFinanceDigitalLabs · Sep 9, 2024 · 09ce56a · 09ce56a
1 parent e7c323c
commit 09ce56a
Show file tree

Hide file tree

Showing 14 changed files with 552 additions and 3,718 deletions.
diff --git a/README.md b/README.md
@@ -11,22 +11,6 @@ Most of the utilities are centred around three core datasets:
 * CIN Census
 * Annex A
 
-## CIN Census
-
-The CIN Census file is provided as one or more XML files. The core tools allow you to validate both an entire CIN Census
-file, or validate individual Child elements and flag or discard those that do not conform with
-the [CIN Census schema](liiatools/spec/cin/cin-2022.xsd).
-
-In addition, the tool can detect and conform non-ISO8601 dates to the correct format and reject incorrect values from
-enumerations. 
-
-
-
-# liiatools
-
-This document is designed as a user guide for installing and configuring the liiatools PyPI package released to 
-London DataStore for the processing of Children’s Services datasets deposited by London Local Authorities 
-
 ## Introduction to LIIA project 
 
 The LIIA (London Innovation and Improvement Alliance) project brings together Children’s Services data from all the 
@@ -37,44 +21,10 @@ pan-London datasets.
 Please see [LIIA Child Level Data Project](https://liia.london/liia-programme/targeted-work/child-level-data-project) 
 for more information about the project, its aims and partners. 
 
-## Purpose of liiatools package 
-
-The package is designed to process data deposited onto the London DataStore by each of the 33 London LAs such that it 
-can be used in for analysis purposes. The processing carries out the following tasks: 
-
-* validates the deposited data against a specification 
-* removes information that does not conform to the specification
-* degrades sensitive information to reduce the sensitivity of the data that is shared 
-* exports the processed data in an analysis-friendly format for: 
-  * the LA to analyse at a single-LA level 
-  * additional partners to analyse at a pan-London level 
-
-The package is designed to enable processing of data from several datasets routinely created by all LAs as part of 
-their statutory duties. In v0.1 the datasets that can be processed are: 
-
-* [Annex A](/liiatools/datasets/annex_a/README.md)
-* CIN (Future Release)
-* 903 (Future Release)
-
-The package is designed to process data that is deposited by LAs into a folder directory to be created on the London 
-DataStore for this purpose.Page Break 
-
-### Installing liiatools
-Liiatools can be installed using the following:
-
-    Pip install liiatools
-Or using poetry:
-
-    poetry add liiatools 
-
-### Configuring liiatools
-
-All of the functions in liiatools are accessed through CLI commands. Refer to the help function for more info 
-
-    python -m liiatools --help
+## Purpose of liia-tools-pipeline package 
 
+The package is designed to process data deposited onto the data platform by local authorities such that it can be used for analysis purposes.
 
-# liia-code-server
 This is a Dagster code server library which is setup to be used as a code server.
 
 ## How to use:

diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -0,0 +1,66 @@
+# Frequently Asked Questions
+
+## 1. How do I add a new cleaning function?
+
+* The cleaning functions themselves should be added to the [common/converters.py](/liiatools/common/converters.py) file. These are built to accept individual values and return a clean value. If there are any errors we want to raise ValueError with an appropriate error message.
+
+* The new cleaning function can then be implemented in the conform_cell_types function in the [common/stream_filters.py](/liiatools/common/stream_filters.py) file. Here we want to add an new: `if column_spec.type == "new_type"` and the corresponding new cleaning function.
+
+* The new `column_spec.type` needs to align with the Column class in the [__data_schema.py](/liiatools/common/spec/__data_schema.py) file. Here is where you will want to add the new type will will follow the pattern `new_type_name: possible_types`. The `possible_types` can range from simple string literals to more complex classes, such as the Numeric class or Category class.
+
+* Once complete be sure to add unit tests for the cleaning functions to the [common/test_converters.py](/liiatools/tests/common/test_converters.py) and [common/test_filter.py](/liiatools/tests/common/test_filter.py) files.
+
+## 2. How do I apply a new cleaning function to .xml files?
+
+* Once you have followed the steps outlined in question 1, you may need to create a new function that converts the .xsd schema into a Column class that is readable by the pipeline. Examples of these can be found in the [common/stream_filters.py](/liiatools/common/stream_filters.py) file. How you apply these will depend on the naming conventions you have used in the .xsd schema but looking at the examples should give you an idea. Below please find a more detailed breakdown of the existing _create_category_spec function:
+
+```python
+def _create_category_spec(field: str, file: Path) -> List[Category] | None:
+    """
+    Create a list of Category classes containing the different categorical values of a given field to conform categories
+    e.g. [Category(code='0', name='Not an Agency Worker'), Category(code='1', name='Agency Worker')]
+
+    :param field: Name of the categorical field you want to find the values for
+    :param file: Path to the .xsd schema containing possible categories
+    :return: List of Category classes of categorical values and potential alternatives
+    """
+    category_spec = []
+
+    xsd_xml = ET.parse(file)  # Use the xml.etree.ElementTree parse functionality to read the .xsd schema
+    search_elem = f".//{{http://www.w3.org/2001/XMLSchema}}simpleType[@name='{field}']"  # Use the field argument to search for a specific field, this will be a string you have determined in the stream_filters.py file e.g. "some_string_ending_in_type". You can see the simpleType aligns with simpleType in the .xsd schema
+    element = xsd_xml.find(search_elem)  # Search through the .xsd schema for this field
+
+    if element is not None:
+        search_value = f".//{{http://www.w3.org/2001/XMLSchema}}enumeration"  # Find the 'code' parameter which is within the .xsd enumeration node
+        value = element.findall(search_value)
+        if value:
+            for v in value:  # The category element of the Column class is a list of Category classes, so we append a Category class to a list
+                category_spec.append(Category(code=v.get("value")))  # Grab the value found in the .xsd schema
+
+            search_doc = f".//{{http://www.w3.org/2001/XMLSchema}}documentation"  # Find the 'name' parameter which is within the .xsd documentation node
+            documentation = element.findall(search_doc)
+            for i, d in enumerate(documentation):  # Use enumerate to correctly loop through the existing Category classes
+                category_spec[i].name = d.text  # Add a name value to the existing Category classes so each one has both code and name
+
+                #  Before returning the finished category_spec you could for example add another loop to add potential regex patterns if necessary
+            return category_spec
+    else:
+        return
+```
+
+## 3. How do I add new enriching / degrading functions?
+
+* The new enriching and degrading functions should be added to the [_transform_functions.py](/liiatools/common/_transform_functions.py) file. These are built to accept a row (pd.Series) of data and output transformed values. These functions can vary from being simple additions of metadata, such as the year or LA, to specific hashing values of a specific column.
+
+* You will need to add the new function(s) and then include this in the corresponding enrich_functions/degrade_functions dictionaries. The key in this dictionaries determine what key to put in the corresponding pipeline.json file, and the value determines what function is performed. e.g. `"year": add_year` will be called in the pipeline.json like this:
+
+```json
+"id": "YEAR",
+"type": "integer",
+"enrich": "year",
+"sort": 0
+```
+
+* Here we have created a new column called YEAR, of type integer, using the enrich function `add_year` and finally being the first column when it comes to sort order. 
+
+