From 09ce56a8eb75b4147d0b83f94b65523fcf271b3a Mon Sep 17 00:00:00 2001 From: patrick-troy <58770937+patrick-troy@users.noreply.github.com> Date: Mon, 9 Sep 2024 13:12:24 +0300 Subject: [PATCH] Update documentation (#27) * initial docs * remove new use case text from docs, this now comes from pipeline.json * update pipeline.json documentation --- README.md | 54 +- docs/FAQ.md | 66 ++ docs/annex_a.md | 266 ----- docs/annex_a_pipeline.md | 299 ------ docs/cin-report-assessment-factors.ipynb | 1255 ---------------------- docs/cin-report-outcomes.ipynb | 619 ----------- docs/cin_census.md | 288 ----- docs/cin_census_questions.md | 574 ---------- docs/fix_episodes.md | 9 + docs/general_pipeline.md | 142 +++ docs/library-dependencies.md | 47 - docs/pipeline.md | 103 -- docs/pipeline_creation.md | 333 ++++++ docs/ssda903.md | 215 ---- 14 files changed, 552 insertions(+), 3718 deletions(-) create mode 100644 docs/FAQ.md delete mode 100644 docs/annex_a.md delete mode 100644 docs/annex_a_pipeline.md delete mode 100644 docs/cin-report-assessment-factors.ipynb delete mode 100644 docs/cin-report-outcomes.ipynb delete mode 100644 docs/cin_census.md delete mode 100644 docs/cin_census_questions.md create mode 100644 docs/fix_episodes.md create mode 100644 docs/general_pipeline.md delete mode 100644 docs/library-dependencies.md delete mode 100644 docs/pipeline.md create mode 100644 docs/pipeline_creation.md delete mode 100644 docs/ssda903.md diff --git a/README.md b/README.md index e2b88e14..0623d9fb 100644 --- a/README.md +++ b/README.md @@ -11,22 +11,6 @@ Most of the utilities are centred around three core datasets: * CIN Census * Annex A -## CIN Census - -The CIN Census file is provided as one or more XML files. The core tools allow you to validate both an entire CIN Census -file, or validate individual Child elements and flag or discard those that do not conform with -the [CIN Census schema](liiatools/spec/cin/cin-2022.xsd). - -In addition, the tool can detect and conform non-ISO8601 dates to the correct format and reject incorrect values from -enumerations. - - - -# liiatools - -This document is designed as a user guide for installing and configuring the liiatools PyPI package released to -London DataStore for the processing of Children’s Services datasets deposited by London Local Authorities - ## Introduction to LIIA project The LIIA (London Innovation and Improvement Alliance) project brings together Children’s Services data from all the @@ -37,44 +21,10 @@ pan-London datasets. Please see [LIIA Child Level Data Project](https://liia.london/liia-programme/targeted-work/child-level-data-project) for more information about the project, its aims and partners. -## Purpose of liiatools package - -The package is designed to process data deposited onto the London DataStore by each of the 33 London LAs such that it -can be used in for analysis purposes. The processing carries out the following tasks: - -* validates the deposited data against a specification -* removes information that does not conform to the specification -* degrades sensitive information to reduce the sensitivity of the data that is shared -* exports the processed data in an analysis-friendly format for: - * the LA to analyse at a single-LA level - * additional partners to analyse at a pan-London level - -The package is designed to enable processing of data from several datasets routinely created by all LAs as part of -their statutory duties. In v0.1 the datasets that can be processed are: - -* [Annex A](/liiatools/datasets/annex_a/README.md) -* CIN (Future Release) -* 903 (Future Release) - -The package is designed to process data that is deposited by LAs into a folder directory to be created on the London -DataStore for this purpose.Page Break - -### Installing liiatools -Liiatools can be installed using the following: - - Pip install liiatools -Or using poetry: - - poetry add liiatools - -### Configuring liiatools - -All of the functions in liiatools are accessed through CLI commands. Refer to the help function for more info - - python -m liiatools --help +## Purpose of liia-tools-pipeline package +The package is designed to process data deposited onto the data platform by local authorities such that it can be used for analysis purposes. -# liia-code-server This is a Dagster code server library which is setup to be used as a code server. ## How to use: diff --git a/docs/FAQ.md b/docs/FAQ.md new file mode 100644 index 00000000..bd607a90 --- /dev/null +++ b/docs/FAQ.md @@ -0,0 +1,66 @@ +# Frequently Asked Questions + +## 1. How do I add a new cleaning function? + +* The cleaning functions themselves should be added to the [common/converters.py](/liiatools/common/converters.py) file. These are built to accept individual values and return a clean value. If there are any errors we want to raise ValueError with an appropriate error message. + +* The new cleaning function can then be implemented in the conform_cell_types function in the [common/stream_filters.py](/liiatools/common/stream_filters.py) file. Here we want to add an new: `if column_spec.type == "new_type"` and the corresponding new cleaning function. + +* The new `column_spec.type` needs to align with the Column class in the [__data_schema.py](/liiatools/common/spec/__data_schema.py) file. Here is where you will want to add the new type will will follow the pattern `new_type_name: possible_types`. The `possible_types` can range from simple string literals to more complex classes, such as the Numeric class or Category class. + +* Once complete be sure to add unit tests for the cleaning functions to the [common/test_converters.py](/liiatools/tests/common/test_converters.py) and [common/test_filter.py](/liiatools/tests/common/test_filter.py) files. + +## 2. How do I apply a new cleaning function to .xml files? + +* Once you have followed the steps outlined in question 1, you may need to create a new function that converts the .xsd schema into a Column class that is readable by the pipeline. Examples of these can be found in the [common/stream_filters.py](/liiatools/common/stream_filters.py) file. How you apply these will depend on the naming conventions you have used in the .xsd schema but looking at the examples should give you an idea. Below please find a more detailed breakdown of the existing _create_category_spec function: + +```python +def _create_category_spec(field: str, file: Path) -> List[Category] | None: + """ + Create a list of Category classes containing the different categorical values of a given field to conform categories + e.g. [Category(code='0', name='Not an Agency Worker'), Category(code='1', name='Agency Worker')] + + :param field: Name of the categorical field you want to find the values for + :param file: Path to the .xsd schema containing possible categories + :return: List of Category classes of categorical values and potential alternatives + """ + category_spec = [] + + xsd_xml = ET.parse(file) # Use the xml.etree.ElementTree parse functionality to read the .xsd schema + search_elem = f".//{{http://www.w3.org/2001/XMLSchema}}simpleType[@name='{field}']" # Use the field argument to search for a specific field, this will be a string you have determined in the stream_filters.py file e.g. "some_string_ending_in_type". You can see the simpleType aligns with simpleType in the .xsd schema + element = xsd_xml.find(search_elem) # Search through the .xsd schema for this field + + if element is not None: + search_value = f".//{{http://www.w3.org/2001/XMLSchema}}enumeration" # Find the 'code' parameter which is within the .xsd enumeration node + value = element.findall(search_value) + if value: + for v in value: # The category element of the Column class is a list of Category classes, so we append a Category class to a list + category_spec.append(Category(code=v.get("value"))) # Grab the value found in the .xsd schema + + search_doc = f".//{{http://www.w3.org/2001/XMLSchema}}documentation" # Find the 'name' parameter which is within the .xsd documentation node + documentation = element.findall(search_doc) + for i, d in enumerate(documentation): # Use enumerate to correctly loop through the existing Category classes + category_spec[i].name = d.text # Add a name value to the existing Category classes so each one has both code and name + + # Before returning the finished category_spec you could for example add another loop to add potential regex patterns if necessary + return category_spec + else: + return +``` + +## 3. How do I add new enriching / degrading functions? + +* The new enriching and degrading functions should be added to the [_transform_functions.py](/liiatools/common/_transform_functions.py) file. These are built to accept a row (pd.Series) of data and output transformed values. These functions can vary from being simple additions of metadata, such as the year or LA, to specific hashing values of a specific column. + +* You will need to add the new function(s) and then include this in the corresponding enrich_functions/degrade_functions dictionaries. The key in this dictionaries determine what key to put in the corresponding pipeline.json file, and the value determines what function is performed. e.g. `"year": add_year` will be called in the pipeline.json like this: + +```json +"id": "YEAR", +"type": "integer", +"enrich": "year", +"sort": 0 +``` + +* Here we have created a new column called YEAR, of type integer, using the enrich function `add_year` and finally being the first column when it comes to sort order. + + diff --git a/docs/annex_a.md b/docs/annex_a.md deleted file mode 100644 index c0c9bc26..00000000 --- a/docs/annex_a.md +++ /dev/null @@ -1,266 +0,0 @@ -# Data Specification: Annex A - -Test Coverage: 70% - -Three CLI options: - -* cleanfile(input, la_code, la_log_dir, output) - Cleans input Annex A xlsx files according to config and outputs cleaned xlsx files -* la_agg(input, output) - Joins data from newly cleaned Annex A file (output of cleanfile()) to existing Annex A data for the depositing local authority -* pan_agg(input, la_code, output) - Merges data from newly merged Annex A file (output of la_agg()) to existing pan-London Annex A data - - -KWS: I have a some questions about the configuration mechanism. I believe it may originate in some code Celine and I developed to simplify processing -in Jupyter notebooks and have maintained the same approach. The key thing about that configuration is that it was NOT standardised and the -user (notebook developer) could override the defaults. As this is not supported by the pipeline, I think it adds a lot of unnecessary complexity -as well as means these pipelines are unneccesarily interdependent. Releasing a small change to one pipeline means the entire set need to be retested -due to the up-down nature of the configuration: - -* liiatools - * datasets - * annex_a - * spec - * annex_a - -Because of this, package names are also unnecessarily long and therefore relie on abbreviations making it confusing for reviewers and future developers. - -A simpler approach would be to have each pipeline as a standalone package with a single configuration file, e.g. - -* liiatools_annex_a - * clean - * local_authority_aggregation - * pan_london_aggregation - * config - -This could then be versioned independently and released independently allowing for simpler change management and less testing required. - -For some dataset configurations there is also a lot of duplication, and in some cases hardcoded duplication, where this could simply be combined into a single file. Most of the features here will be supported by the default `sfdata` format. - -## CLI COMMAND: Cleanfile - -* cleanfile - * Config() - - Tries very hard to be configurable - but doesn't seem to be used? - - Uses environment variables to set the config - do we need this? - - * la_name - looks up name based on code - - **WARNING**: uses a function called `flip_dict` which does not check for duplicate values - so could fail silently - - * check_file_type [shared] - check if input file is supported (REQ AASUPTYPE) - * Uses pathlib to get the file stem and suffix - * Returns None if suffix is an exact match (case sensitive) of one of the `file_types` - * Raises AssertionError if suffix is one of the `supported_file_types` (but not one of the `file_types`) - * If neither of those conditions are met: - * writes a warning to a dynamic file name **DAGSTER WARNING** - * returns the string "incorrect file type" - - * the return value of this is now checked and if the value is "incorrect file type" then the pipeline is stopped - - * **NEEDS REWRITE FOR DAGSTER** - - * read file using `parse_sheets` from sfdata_stream_parser - - **Uses open - DAGSTER WARNING** - * configure stream (`clean_config.configure_stream`) - * Uses: config["datasources"] and config["data_config"] - * identify_blank_rows (REQ AABLANKROW) - - Q: Why aren't these removed here as it would massively speed up the rest of the process? - * add_sheet_name (REQ AASHEETNAME) - * Add property `sheet_name` to all child events - * Add property `column_headers` to all child events - * Add property `column_header` based on the column index and the `column_headers` property - * Modified property `column_header` by trying to match it to a known header - - Q: Some slightly confusing code here - e.g. forcing the existing value to string when we have already set this - - Silently ignores any errors - * Looks up sheet & column in config["data_config"] and adds this to property `category_config` - * Looks up sheet & column in config["datasources"] and adds this to property `other_config` - - * clean stream (`cleaner.clean`) - * clean_cell_category - * clean_integers - * to_integer [local] - some strange code here - also returns empty string if it can't convert to integer - * clean_dates - * to_date [shared] - very strict date format - also may not support excel dates if not formatted - * clean_postcodes - * check_postcode [shared] - Uses a simplified postcode regex - ignores case but does not uppercase - - Check runs based on header name rather than type like the others - - * degrade - * degrade_postcodes - * Check runs based on header name rather than type like the others - * to_short_postcode [shared] - - Uses a simplified postcode regex - ignores case - - By this stage postcodes should already be correctly formatted - so don't need the regex - - Fails silently - - * degrade_dob - * Check runs based on header name rather than type like the others - * to_month_only_dob [shared] - - Replaces day with 01 - - Returns empty string if it can't convert to date - - Misleading name - - * log_errors - * create_error_table - - Emits an ErrorTable event **INSTEAD OF** the EndTable event (**EndTable is removed**) - * blank_error_check - - Checks if 'required' (not `canbeblank`) fields are filled - * create_error_list - - `formatting_error` - - Sets a property on the ErrorTable event with a collected set of errors - - The errors are just the column_headers where the 'error_name' is "1" - * create_error_list - - `blank_error` - - Sets a property on the ErrorTable event with a collected set of errors - - The errors are just the column_headers where the 'error_name' is "1" - - As this is just running code over the `blank_error_check` property - this could be done in one step - * inherit_error - - `extra_columns` - - Difficult to read, but think it copies the error_name from the starttable to the following error_table - - Not quite sure exactly what this does, but believe there are sequence bugs here - * duplicate_column_check - * _duplicate_columns - - Works - but quite verbose - could use the library https://iteration-utilities.readthedocs.io/en/latest/generated/duplicates.html - which is one of the fastests implementations around - - For some reason converts this list to a long descriptive string - would be better done at output - - Uses a hacky conversion to string, rather than join - * inherit_error - - `duplicate_columns` - - Could this not just be done as part of the previous step? - * create_file_match_error - - Uses try/except to check if `sheet_name` property exists - and adds long error message if not - * create_missing_sheet_error - - Checks against hardcoded list of sheet names rather than load from config and adds long error to error table event - - Would fail on multiple containers in single file - unlikely to happen but still bug - - * create_la_child_id - * Adds the LA code to the end of the existing child or adopter ids. - - Q: This is just preference - but why at the end rather than start? It would mean they sorted nice and follows the logical hierarchy of the data. - - * save_stream - * coalesce_row - * Creates a dict and **REPLACES** StartRow and EndRow with a RowEvent - - known cells are also removed, but unknown cells are kept potentially creating quite a confusing stream after this - * filter_rows - * Hardcoded list of sheets to remove - but doesn't remove them, only flags filter=0 or filter=1 (this time uses int not string) - * create_tables - * Condences the stream into a TableEvent holding a tablib version of the data - - REMOVES StartTable events - - Yields EndTable events and TableEvents - - If RowEvent and filter == 0 then adds to the table and REMOVES the RowEvent, otherwise yields the RowEvent - * save_tables - - StartContainer - create new DataBook - - EndContainer - save DataBook to file using open and hardcoded filename pattern - - StartTable - set the sheet_name - **StartTable IS REMOVED BY THIS STAGE?** - - TableEvent with data - add to DataBook - - **Uses open - DAGSTER WARNING** - - * save_errors_la - * Creates an error file - - This function is so complex I don't think I can understand it - - Also can't be unit tested - and is not tested in the test suite - - Uses a hardcoded filename pattern - - **Uses open - DAGSTER WARNING** - - - -Requirement | Description ---- | --- -AASUPTYPE | Must be one of xlsx or xlsm - however, supports also .xml and .csv? - check if this is correct | -AABLANKROW | Blank rows are flagged with `blank_row="1"` -AASHEETNAME | Match the loaded table against one of the Annex A sheet names using fuzzy matching with regex -AACAT | Clean categories -AAINT | Clean integers -AADATE | Clean dates -AAPCODE | Clean postcodes -AADEGPCODE | Degrade the postcode column -AACHILD_ID | Adds the Local Authority code to the Child Unique ID / Individual adopter identifier to create a unique identifier for each child - - -## CLI COMMAND: la_agg - -* la_agg - * Config() - - Uses environment variables to set the config - do we need this? - - * split_file - * Reads the input file into a dict of sheet_name: dataframe - - just an alias for pd.read_excel **DAGSTER WARNING** - - could just as well call read_excel directly? - - * sort_dict - * Sorts the sheets by config['sort_order'] - - **WARNING** - Python dicts are not ordered - so this is not guaranteed to work - from Python 3.6 onwards CPython dicts are ordered - but this is not a language feature - - * merge_la_files - * **Reads** hardcoded file name from **output** - this is confusing naming? - * Calls pd.read_excel **DAGSTER WARNING** - - Fails silently if file does not exist - - * _merge_dfs - * merges original file (from split_file) with the output file using pd.concat - * **WARNING**: If a sheet does not exist in the input file - it will not be created in the output file and any existin records dropped - bug? - - * deduplicate - * Uses config["dedup"] - this is a list of the primary key columns for each table - * Calls df.drop_duplicates on each table using the primary key columns from the config - - * convert_datetimes - * Uses config["dates"] - this is a list of the date columns for each table - - Calls pd.to_datetime on each table using the date columns from the config **uses hardcoded date format** - - * remove_old_data - * Takes index_date from config file (**DAGSTER WARNING - this date must be configurable**) - - Uses hardcoded table and column names - - Does some pandas magic that is difficult to follow - but silently swallows errors so could be dangerous - - **WARNING** As this is a IG protection function I would prefer it to be clearer and more explicit as well as well tested. The function - itself has 100% code coverage, but it depends on the `_remove_years` function which fails silently and the fail condition is not tested. - - * convert_dates - * Uses config["dates"] - this is a list of the date columns for each table - - This is a duplicate of `convert_datetimes` (**DRY**) with the addition of `.dt.date` on each column - - * export_file - * Exports the sheets to an excel file using pd.ExcelWriter (**DAGSTER WARNING**) - - Uses hardcoded filename - in fact the same hardcoded one as the input ouput file but without a constant - - Suggest the sorting of sheets could happen here as a list rather than the sorted dict above - - -There is no logging or sanity checks performed by this code. The modifications are relatively simple, but there are still obvious things that could go wrong. -The overwriting of the input file means that if a run fails, the original file is lost with no rollback mechanism. - -There is also a clear race condition where a second run starts before another is complete - this would lead to the loss of session data from whichever run finishes first. - -These are probably minor issue, but nonetheless without debugging or logging it would be difficult to diagnose or even be aware of any issues. - - -## CLI COMMAND: pan_agg - - -* pan_agg - * Config() - - Uses environment variables to set the config - do we need this? - - * flip_dict - see comments above - - * split_file - see comments above - also **DRY** (re-implemented) - * Removes two hardcoded sheets from the input file - * Removes columns from config by column name - could match columns from mutiple tables - - * merge_agg_files - * **Reads** hardcoded file name from **output** (comments as above - this is a constant filename) - - * _merge_dfs - * Takes original file, drops all entries for the current LA, merges with the current file, and then overwrites the original file - * Takes sheet names from 'pan london file' - means this needs to manually created and must exist - won't recover from error - - * convert_dates - * Uses config["dates"] - this is a list of the date columns for each table - - this is yet another duplicate of this function **DRY** - - why aren't dates preserved in the first instance? - this is just plastering over a bug - - * export_file - * Exports the sheets to an excel file using pd.ExcelWriter (**DAGSTER WARNING**) - * Same as the comments above for la_agg - -Same comments apply to this as the la_agg file process about risk of data loss. The issue is compounded by the fact that the merge function is called for each LA file and so chances of simultaneous runs are significantly higher. - -For this I would recommend that the merge function is moved to a central task that *ALWAYS* read the individual files from each LA's private stores. This would mean that the merge function would be idempotent and could be run as many times as required without risk of data loss. diff --git a/docs/annex_a_pipeline.md b/docs/annex_a_pipeline.md deleted file mode 100644 index d3ac6768..00000000 --- a/docs/annex_a_pipeline.md +++ /dev/null @@ -1,299 +0,0 @@ - -# Annex A: Pipeline - -The Annex A pipeline can only process one Annex A file at a time. This is a single Excel file with multiple sheets. - -The upload process should ideally only allow for one file at a time. No additional metadata is required. - -The overall flow of the pipeline is as follows: - -1. Prepfile - move file and collect metadata -2. Cleanfile - ensure file is in a consistent format -3. Apply Privacy Policy - degrade data to meet data minimisation rules -4. History 1 - Archive data -5. History 2 - Rollup data -6. Client reports -7. Prepare shareable reports -8. Data Retention 1 - Clear old history data -9. Data Retention 2 - Clear old session data - - -## Annex A - Prepfile - -Moves an incoming file to a session folder and removes from incoming. - -Inputs: - * A single .xlsx or .xlsm file containing the multiple Annex A sheets - -Outputs: - * Creates a new session folder containing the incoming file - -``` -session-/ -├── incoming-annex-a.xlsx -└── logs/ - ├── user.log - └── error.log -``` - -Process: - -* **1:** Is there a new file in the incoming folder? - * **1.1:** Yes: Continue to 2 - * **1.2:** No: Exit -* **2:** Create a new session folder -* **3:** Move the incoming file to the session folder - - -```mermaid -graph TD - START((Timed\nTrigger)) - A{Is there a new file\nin the incoming folder?} - B((Exit)) - C[Create a new session folder] - D[Move the incoming file to the session folder] - - START --> A - A -->|Yes| C - A -->|No| B - C --> D - -``` - - - - -## Annex A - Cleanfile - -Cleans the incoming file by normalising column headers and data types, and saves sheets as individual CSV files. - -This process uses the stream parser (in future SFDATA) and therefore has to run -as a single dagster operation. - -Inputs: - * Session folder with incoming Annex A file - -Outputs: - * Adds clean csv tables to session folder - - ``` -session-/ -├── incoming-annex-a.xlsx -├── clean-annex-a/ -│ ├── annex-a-list-1.csv -│ ├── annex-a-list-2.csv -│ └── ... etc -└── logs/ -``` - -Process: - -* **1:** Convert the file to a stream - * **1.1:** If error occurs, write to user log and error log then exit -* **2:** Remove blank rows -* **3:** Promote header to create table -* **4:** Identify the columns based on headers and sheet name -* **5:** Adds schema to cells based on the identified columns -* **6:** Clean the data - * **6.1:** Clean dates - * **6.2:** Clean categories - * **6.3:** Clean integers - * **6.4:** Clean postcodes -* **7:** Collect data into tables / dataframes -* **8:** Save tables to session folder -* **9:** Create data quality report and save to session folder - -Questions: -* Do we want to create a "cleaned" Annex A excel for the user folder? - -```mermaid -graph TD - START((Triggered by\nprevious job)) - A[Convert the file to a stream] - A1[\Write to user log\] - A2((Exit)) - B[Remove blank rows] - C[Promote header to create table] - D[Identify the columns based on headers and sheet name] - E[Adds schema to cells based on the identified columns] - F[Clean the data] - F1[Clean dates] - F2[Clean categories] - F3[Clean integers] - F4[Clean postcodes] - G[Collect data into tables / dataframes] - H[\Save tables to session folder\] - I[\Create data quality report and save to session folder\] - EXIT((Exit)) - - START --> A - A -->|Error| A1 - A1 --> A2 - A --> B - B --> C - C --> D - D --> E - E --> F - F --> F1 - F1 --> F2 - F2 --> F3 - F3 --> F4 - F4 --> G - G --> H - H --> I - I --> EXIT -``` - -## Annex A - Apply Privacy Policy - -Working on each of the tables in turn, this process will degrade the data to meet data minimisation rules: - * Dates all set to the first of the month - * Postcodes all set to the first 4 characters (excluding spaces) - * Some tables need rows deleted if there are blanks in a specific column - -**WARNING:** This assumes all of this code will be re-written to run in pandas -instead of using the stream parser. This should have minimal impact as the functions are reusable. - -Inputs: - * Session folder with individual table CSV files - -Outputs: - * Adds privacy minimised csv tables to session folder - -``` -session-/ -├── incoming-annex-a.xlsx -├── clean-annex-a/ -├── privacy-annex-a/ -│ ├── annex-a-list-1.csv -│ ├── annex-a-list-2.csv -│ └── ... etc -└── logs/ -``` - -Process: - -* **1:** Read each table in turn from the session folder making sure to use schema information to set the data types -* **2:** Degrade date columns - based on the schema find date columns that require degradation and degrade them -* **3:** Degrade postcode columns - based on the schema find postcode columns that require degradation and degrade them -* **3:** Remove rows with blank values in specific column: based on the schema filter the current table where there are nulls in the protected columns -* **4:** Save the tables to the privacy folder - -```mermaid -graph TD - START((Triggered by\nprevious job)) - A[Read each table in turn] - B[Degrade date columns] - C[Degrade postcode columns] - D[Remove rows with specific blank values] - E[\Save the tables to the privacy folder\] - EXIT((Exit)) - - START --> A - A --> B - B --> C - C --> D - D --> E - E --> EXIT -``` - - -## Annex A - History 1 - Archive data - -The purpose of this set of jobs is to create a historic view of all the data that has been uploaded. There is however a retention policy that sets a limit on how much historic data is kept. - -This job keeps a copy of every uploaded file and merges these, then deduplicates by only retaining the most recent record for each table row. - -This process is structured so that an archive of the history data is preserved in case of data corruption, and should allows the steps to be re-run to build as much -history as is retained. - -Inputs: - * Session folder with individual table CSV files - -Outputs: - * History folder if it does not exist - * Timestamped folder with current data - -``` -history-annex-a/ -├── 2021-01-01-12-00-00/ -│ ├── snapshot -│ │ ├── annex-a-list-1.csv -│ │ ├── annex-a-list-2.csv -│ │ └── ... etc -└── ... etc -``` - -Process: - * Create history folder if it does not exist - * Create timestamped folder and substructure - * Copies current tables to timestamped folder - -## Annex A - History 2 - Rollup data - -This is designed to maximise chances of recovering from data corruption. We have a -set of snapshot data. Some of these may contain history rollup files. We will not -touch rollup files if they exist. It will be a manual job to clean up any corrupted data. This job will create rollup files where needed, but if a rollup -file already exists, subsequent rollups will use this. - -Inputs: - * History folder with timestamped folders - * Retention configuration (years to retain) - -Outputs: - * Rollup (history) folders in each timestamp folder - * A current folder with the most recent rollup - -``` -history-annex-a/ -├── current/ -│ ├── annex-a-list-1.csv -│ ├── annex-a-list-2.csv -│ └── ... etc -├── 2021-01-01-12-00-00/ -│ ├── snapshot -│ │ ├── annex-a-list-1.csv -│ │ ├── annex-a-list-2.csv -│ │ └── ... etc -│ └── history -│ ├── annex-a-list-1.csv -│ ├── annex-a-list-2.csv -│ └── ... etc -└── ... etc -``` - -Process: - * Starting at the beginning, for each timestamped folder: - * For each table: - * Does the folder contain a history file for this table? - * Finds the previous history folder - * Concatenates the current tables with the previous history tables - * Deduplicates the full tables preserving the last record for each row - * Applies the retention policy to the full tables - * Saves the full tables to the history folder - * Copies the rollup data for the last timestamped event to the 'current' folder. - -## Annex A - Client reports - -Based on the current data, this process creates a set of reports for each client. These are saved to the client folder. - -## Annex A - Prepare shareable reports - -This process creates a set of reports for each client that can be shared with the Pan London group. - -These reports are the same as the client reports with two added columns: - * LA Child ID - * LA Name - -There could also be additional data retention/sharing policies applied here - -Copy to the "shared" folder for each client - this is a folder to which -the pan london central account has access. - -## Annex A - Data Retention 1 - -This process removes old timestamp folders from historic data. It is designed to be run after the rollup process or periodically and can clean either based on a time period, or the number of snapshots we wish to retain. - -## Annex A - Data Retention 2 - -This process removes old session folders. It is designed to be run periodically and can clean either based on a time period, or the number of snapshots we wish to retain. diff --git a/docs/cin-report-assessment-factors.ipynb b/docs/cin-report-assessment-factors.ipynb deleted file mode 100644 index f5516009..00000000 --- a/docs/cin-report-assessment-factors.ipynb +++ /dev/null @@ -1,1255 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "97670f58-8a4f-43d7-a29a-a8f1a942c9b3", - "metadata": {}, - "source": [ - "# Assessment Factors\n", - "\n", - "With this report we want to examine the assessment factors, and making it simpler to analyse by factors. \n", - "\n", - "In the incoming data schema, the factors are a list of text values:\n", - "\n", - "```\n", - "\n", - " 1970-06-03\n", - " 1970-06-22\n", - " 1971-07-18\n", - " \n", - " 2A\n", - " 2B\n", - " \n", - "\n", - "```\n", - "\n", - "The ingest tool converts these to a comma-separated list, e.g. \"2A,2B\".\n", - "\n", - "To make it easier to analyse we want to convert these to \n", - "[dummy-variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)), or \n", - "[one-hot encoding](https://en.wikipedia.org/wiki/One-hot#Machine_learning_and_statistics) as it's often called. \n", - "\n", - "In the actual pipeline we expect the data to come in a very specific format, but as we really only need the factors column to make this work, we are not really concerned about the other columns.\n", - "\n", - "First we create a bit of a messy dataset of the type we're used to. A mix of values and non-values. \n", - "\n", - "For CHILD1 we se there are two rows - we may be looking at a subset of the data and there may be a natural key that has been removed from this subset. \n", - "\n", - "CHILD2 tests resilience with blanks and empties, and CHILD3 has repeat factors. " - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "98c8e635-2cd6-4324-8393-de890e4f1d16", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "from liiatools.cin_census_pipeline.reports import expanded_assessment_factors" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "b4b532d7-1098-4469-a6f1-0b5b334af607", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDFactors
0CHILD1A,B,C
1CHILD1A,B
2CHILD2A
3CHILD2B, C
4CHILD2None
5CHILD2
6CHILD3D,A,D
\n", - "
" - ], - "text/plain": [ - " LAchildID Factors\n", - "0 CHILD1 A,B,C\n", - "1 CHILD1 A,B\n", - "2 CHILD2 A\n", - "3 CHILD2 B, C\n", - "4 CHILD2 None\n", - "5 CHILD2 \n", - "6 CHILD3 D,A,D" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df = pd.DataFrame([\n", - " [\"CHILD1\", \"A,B,C\"],\n", - " [\"CHILD1\", \"A,B\"],\n", - " [\"CHILD2\", \"A\"],\n", - " [\"CHILD2\", \"B, C\"],\n", - " [\"CHILD2\", None],\n", - " [\"CHILD2\", \"\"],\n", - " [\"CHILD3\", \"D,A,D\"]\n", - "], columns=[\"LAchildID\", \"Factors\"])\n", - "df" - ] - }, - { - "cell_type": "markdown", - "id": "9ac41acf-7268-4278-bc0e-dce22e114bef", - "metadata": {}, - "source": [ - "Now, we are only interested in the Factors column - so let's just isolate that column. \n", - "\n", - "We can use `str.split` to convert the comma-separated values into a list." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "a2dee580-6d88-4faf-b817-6c96e73aaf1f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Factors
0[A, B, C]
1[A, B]
2[A]
3[B, C]
4None
5[]
6[D, A, D]
\n", - "
" - ], - "text/plain": [ - " Factors\n", - "0 [A, B, C]\n", - "1 [A, B]\n", - "2 [A]\n", - "3 [B, C]\n", - "4 None\n", - "5 []\n", - "6 [D, A, D]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "factors = df[['Factors']].copy()\n", - "factors['Factors'] = factors['Factors'].str.split(\",\")\n", - "factors" - ] - }, - { - "cell_type": "markdown", - "id": "4f99541a-da4f-442a-a588-b2a79060be21", - "metadata": {}, - "source": [ - "Now, a few things to note. We have quite a lot of control over this dataset, but I have added some whitespace\n", - "to illustrate some of the issue we still may face. In row 3 we have whitespace between the \",\" and the \"C\". \n", - "\n", - "This will become important. " - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "dba9c350-5e5b-4cf5-9a08-7ffb412543bb", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Factors
0A
0B
0C
1A
1B
2A
3B
3C
4None
5
6D
6A
6D
\n", - "
" - ], - "text/plain": [ - " Factors\n", - "0 A\n", - "0 B\n", - "0 C\n", - "1 A\n", - "1 B\n", - "2 A\n", - "3 B\n", - "3 C\n", - "4 None\n", - "5 \n", - "6 D\n", - "6 A\n", - "6 D" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "factors_exploded = factors.explode('Factors')\n", - "factors_exploded" - ] - }, - { - "cell_type": "markdown", - "id": "148e59ed-3ba8-4a5c-9424-ae6df9ba5016", - "metadata": {}, - "source": [ - "The `explode` method turns the list into individal rows. Notice how our index entries are now duplicated. We will use these index entries later to merge the final view back into the original dataset. \n", - "\n", - "We could now do `get_dummies` to get the dummy variables:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "62e4fbe9-0a04-4834-927e-4a9307b8d885", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
CABCD
0001000
0000100
0000010
1001000
1000100
2001000
3000100
3010000
4000000
5100000
6000001
6001000
6000001
\n", - "
" - ], - "text/plain": [ - " C A B C D\n", - "0 0 0 1 0 0 0\n", - "0 0 0 0 1 0 0\n", - "0 0 0 0 0 1 0\n", - "1 0 0 1 0 0 0\n", - "1 0 0 0 1 0 0\n", - "2 0 0 1 0 0 0\n", - "3 0 0 0 1 0 0\n", - "3 0 1 0 0 0 0\n", - "4 0 0 0 0 0 0\n", - "5 1 0 0 0 0 0\n", - "6 0 0 0 0 0 1\n", - "6 0 0 1 0 0 0\n", - "6 0 0 0 0 0 1" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.get_dummies(factors_exploded, columns=['Factors'], prefix=\"\", prefix_sep=\"\")" - ] - }, - { - "cell_type": "markdown", - "id": "7588a510-b535-4176-9567-3bf6266c0a56", - "metadata": {}, - "source": [ - "However, we see that we get both a column with no label, and two Cs. This is in fact a column for the empty string \"\" entry as well as one standard \"C\" and one \" C\" from the row with \"B, C\". So before we convert to dummies, let's strip whitespace and remove empty strings:" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "bcf54d12-2261-4ba1-8b89-2af5f47fedf4", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ABCD
01000
00100
00010
11000
10100
21000
30100
30010
40000
60001
61000
60001
\n", - "
" - ], - "text/plain": [ - " A B C D\n", - "0 1 0 0 0\n", - "0 0 1 0 0\n", - "0 0 0 1 0\n", - "1 1 0 0 0\n", - "1 0 1 0 0\n", - "2 1 0 0 0\n", - "3 0 1 0 0\n", - "3 0 0 1 0\n", - "4 0 0 0 0\n", - "6 0 0 0 1\n", - "6 1 0 0 0\n", - "6 0 0 0 1" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "factors_dummies = factors_exploded.copy()\n", - "factors_dummies['Factors'] = factors_dummies['Factors'].str.strip()\n", - "factors_dummies = factors_dummies[factors_dummies['Factors'] != '']\n", - "\n", - "factors_dummies = pd.get_dummies(factors_dummies, columns=['Factors'], prefix=\"\", prefix_sep=\"\")\n", - "factors_dummies" - ] - }, - { - "cell_type": "markdown", - "id": "39775300-b801-4045-9b8c-cbf6b69a1c70", - "metadata": {}, - "source": [ - "That looks better - but we want all the factors combined on one row so we can merge with original dataset. We can use `groupby` and `max` to effectively do a logical OR between the different rows. " - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "b0c97d23-ee01-428a-8565-1adaa231b984", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ABCD
01110
11100
21000
30110
40000
61001
\n", - "
" - ], - "text/plain": [ - " A B C D\n", - "0 1 1 1 0\n", - "1 1 1 0 0\n", - "2 1 0 0 0\n", - "3 0 1 1 0\n", - "4 0 0 0 0\n", - "6 1 0 0 1" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "factors_grouped = factors_dummies.groupby(factors_dummies.index).max()\n", - "factors_grouped" - ] - }, - { - "cell_type": "markdown", - "id": "ab019276-cd8e-46f6-9009-be309edc3f2a", - "metadata": {}, - "source": [ - "Final step is to merge the data back together:" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "824e8211-67e9-4651-850d-52d40b47661b", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDFactorsABCD
0CHILD1A,B,C1110
1CHILD1A,B1100
2CHILD2A1000
3CHILD2B, C0110
4CHILD2None0000
5CHILD20000
6CHILD3D,A,D1001
\n", - "
" - ], - "text/plain": [ - " LAchildID Factors A B C D\n", - "0 CHILD1 A,B,C 1 1 1 0\n", - "1 CHILD1 A,B 1 1 0 0\n", - "2 CHILD2 A 1 0 0 0\n", - "3 CHILD2 B, C 0 1 1 0\n", - "4 CHILD2 None 0 0 0 0\n", - "5 CHILD2 0 0 0 0\n", - "6 CHILD3 D,A,D 1 0 0 1" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "factors_merged = df.merge(factors_grouped, how='left', left_index=True, right_index=True)\n", - "factors_merged[factors_grouped.columns] = factors_merged[factors_grouped.columns].fillna(0).astype(int)\n", - "factors_merged" - ] - }, - { - "cell_type": "markdown", - "id": "7718eee3-cdeb-40c9-a152-82990440a152", - "metadata": {}, - "source": [ - "We now have dummy variables to simplify further analysis.\n", - "\n", - "The implementation of this functionality can be found in [reports.py](../liiatools/cin_census_pipeline/reports.py). \n", - "\n", - "The function takes a dataframe and a column name (defaults to \"AssessmentFactor\") and returns the dataframe with these extra columns added." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "0d1a47d0-416e-4716-8461-ec9249ee1554", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDFactorsABCD
0CHILD1A,B,C1110
1CHILD1A,B1100
2CHILD2A1000
3CHILD2B, C0110
4CHILD2None0000
5CHILD20000
6CHILD3D,A,D1001
\n", - "
" - ], - "text/plain": [ - " LAchildID Factors A B C D\n", - "0 CHILD1 A,B,C 1 1 1 0\n", - "1 CHILD1 A,B 1 1 0 0\n", - "2 CHILD2 A 1 0 0 0\n", - "3 CHILD2 B, C 0 1 1 0\n", - "4 CHILD2 None 0 0 0 0\n", - "5 CHILD2 0 0 0 0\n", - "6 CHILD3 D,A,D 1 0 0 1" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from liiatools.cin_census_pipeline.reports import expanded_assessment_factors\n", - "expanded_assessment_factors(df, column_name='Factors')" - ] - }, - { - "cell_type": "markdown", - "id": "77c829b9-e1cb-4259-b191-e224a550d830", - "metadata": {}, - "source": [ - "We can also add a prefix to the dummy columns" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "5ec1218b-7b7d-40ac-997b-2cc084dd7fa3", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDFactorsFactors_AFactors_BFactors_CFactors_D
0CHILD1A,B,C1110
1CHILD1A,B1100
2CHILD2A1000
3CHILD2B, C0110
4CHILD2None0000
5CHILD20000
6CHILD3D,A,D1001
\n", - "
" - ], - "text/plain": [ - " LAchildID Factors Factors_A Factors_B Factors_C Factors_D\n", - "0 CHILD1 A,B,C 1 1 1 0\n", - "1 CHILD1 A,B 1 1 0 0\n", - "2 CHILD2 A 1 0 0 0\n", - "3 CHILD2 B, C 0 1 1 0\n", - "4 CHILD2 None 0 0 0 0\n", - "5 CHILD2 0 0 0 0\n", - "6 CHILD3 D,A,D 1 0 0 1" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "expanded_assessment_factors(df, column_name='Factors', prefix=\"Factors_\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.17" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/cin-report-outcomes.ipynb b/docs/cin-report-outcomes.ipynb deleted file mode 100644 index 801c23d4..00000000 --- a/docs/cin-report-outcomes.ipynb +++ /dev/null @@ -1,619 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "83d121c0-a9a1-45f2-ae03-2f4fedab0acc", - "metadata": {}, - "source": [ - "# Referral Outcomes\n", - "\n", - "With this report we want to examine referral outcomes, in particular with regards to Sections 17 and 47 of the Children Act 1989:\n", - "\n", - "## Section 17 (S17) of the Children Act 1989:\n", - "\n", - "This section pertains to the provision of services to children in need, which includes their families and others.\n", - "\n", - "Local authorities have a general duty to safeguard and promote the welfare of children within their area who are in need and to promote the upbringing of such children, wherever possible, by their families through providing a range of services appropriate to those children's needs.\n", - "\n", - "\"Children in need\" are defined as children who:\n", - "a. Are unlikely to achieve or maintain a reasonable level of health or development without the provision of services; or\n", - "b. Their health or development is likely to be significantly impaired, or further impaired, without the provision of such services; or\n", - "c. Are disabled.\n", - "\n", - "## Section 47 (S47) of the Children Act 1989:\n", - "\n", - "This section relates to local authorities' duty to investigate situations where there is a reason to suspect that a child living in their area is suffering, or is likely to suffer, significant harm. \"Significant harm\" is a key concept in child protection and can encompass a wide range of adverse experiences, including neglect, physical, emotional, or sexual abuse.\n", - "\n", - "Where they have reasonable cause to suspect that a child who lives or is found in their area is suffering or likely to suffer significant harm, the authority is required to make inquiries (or cause inquiries to be made) to decide whether they should take action to safeguard or promote the child's welfare.\n", - "\n", - "These inquiries may lead to a child protection conference, where professionals come together to discuss concerns and decide on future actions, which may include a child protection plan.\n", - "\n", - "In the CIN Census data (abbreviated to only the fields we consider here for clarity), this will look like:\n", - "```\n", - "\n", - " 1970-10-06\n", - " \n", - " 1970-06-03\n", - " \n", - " \n", - " 1970-06-02\n", - " \n", - "\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDDateTypeCINreferralDateReferralSourcePrimaryNeedCodeCINclosureDateReasonForClosureDateOfInitialCPCReferralNFA...UPNFormerUPNUPNunknownPersonBirthDateExpectedPersonBirthDateGenderCurrentPersonDeathDatePersonSchoolYearEthnicityDisabilities
0DfEX00000011970-10-06CINreferralDate1970-10-061AN41971-02-27RC11970-12-060...A123456789123X98765432123BUN31966-03-241966-03-2211980-10-08NaNWBRIHAND,HEAR
\n", - "

1 rows × 34 columns

\n", - "" - ], - "text/plain": [ - " LAchildID Date Type CINreferralDate ReferralSource \\\n", - "0 DfEX0000001 1970-10-06 CINreferralDate 1970-10-06 1A \n", - "\n", - " PrimaryNeedCode CINclosureDate ReasonForClosure DateOfInitialCPC \\\n", - "0 N4 1971-02-27 RC1 1970-12-06 \n", - "\n", - " ReferralNFA ... UPN FormerUPN UPNunknown PersonBirthDate \\\n", - "0 0 ... A123456789123 X98765432123B UN3 1966-03-24 \n", - "\n", - " ExpectedPersonBirthDate GenderCurrent PersonDeathDate PersonSchoolYear \\\n", - "0 1966-03-22 1 1980-10-08 NaN \n", - "\n", - " Ethnicity Disabilities \n", - "0 WBRI HAND,HEAR \n", - "\n", - "[1 rows x 34 columns]" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data = \"\"\"\n", - "LAchildID,Date,Type,CINreferralDate,ReferralSource,PrimaryNeedCode,CINclosureDate,ReasonForClosure,DateOfInitialCPC,ReferralNFA,CINPlanStartDate,CINPlanEndDate,S47ActualStartDate,InitialCPCtarget,ICPCnotRequired,AssessmentActualStartDate,AssessmentInternalReviewDate,AssessmentAuthorisationDate,Factors,CPPstartDate,CPPendDate,InitialCategoryOfAbuse,LatestCategoryOfAbuse,NumberOfPreviousCPP,UPN,FormerUPN,UPNunknown,PersonBirthDate,ExpectedPersonBirthDate,GenderCurrent,PersonDeathDate,PersonSchoolYear,Ethnicity,Disabilities\n", - "DfEX0000001,1970-10-06,CINreferralDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "DfEX0000001,1971-02-27,CINclosureDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "DfEX0000001,1970-06-03,AssessmentActualStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,1970-06-03,1970-06-22,1971-07-18,\"2A,2B\",,,,,,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "DfEX0000001,1971-07-18,AssessmentAuthorisationDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,1970-06-03,1970-06-22,1971-07-18,\"2A,2B\",,,,,,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "DfEX0000001,1971-01-24,CINPlanStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,1971-01-24,1971-01-26,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "DfEX0000001,1971-01-26,CINPlanEndDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,1971-01-24,1971-01-26,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "DfEX0000001,1970-06-02,S47ActualStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-06-17,0,,,1970-06-02,1970-06-23,0,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "DfEX0000001,1970-02-17,CPPstartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,1970-02-17,1971-03-14,PHY,PHY,10,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "DfEX0000001,1971-03-14,CPPendDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,1970-02-17,1971-03-14,PHY,PHY,10,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "DfEX0000001,1971-02-15,CPPreviewDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,1970-02-17,1971-03-14,PHY,PHY,10,A123456789123,X98765432123B,UN3,1966-03-24,1966-03-22,1,1980-10-08,,WBRI,\"HAND,HEAR\"\n", - "\"\"\".strip()\n", - "data = pd.read_csv(io.StringIO(data), parse_dates=[\n", - " 'Date', 'CINreferralDate', 'CINclosureDate', 'DateOfInitialCPC', \n", - " 'CINPlanStartDate', 'CINPlanEndDate', 'S47ActualStartDate', 'InitialCPCtarget',\n", - " 'AssessmentActualStartDate', 'AssessmentInternalReviewDate', 'AssessmentAuthorisationDate',\n", - " 'CPPstartDate', 'CPPendDate',\n", - " 'PersonBirthDate', 'ExpectedPersonBirthDate', 'PersonDeathDate'\n", - "])\n", - "data.head(n=1)" - ] - }, - { - "cell_type": "markdown", - "id": "f45711d9-fef9-4dc3-ac2a-a434b246be14", - "metadata": {}, - "source": [ - "So based on the big \"wide\" format, we can narrow things down to the few columns we are interested in. In terms of reports, each referral consists of a unique combination of LAchildID and CINreferralDate, and below that there will be a unique AssessmentActualStartDate for S17 Assessment or S47ActualStartDate for S47 Inquiry.\n", - "\n", - "Starting with S17, let's find the unique combinations of those fields, and calculate the duration between referral and assessment:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "367c6d33-c235-450b-8452-b8cf2c4b1aca", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDCINreferralDateAssessmentActualStartDatedays_to_s17
2DfEX00000011970-10-061970-06-03125
\n", - "
" - ], - "text/plain": [ - " LAchildID CINreferralDate AssessmentActualStartDate days_to_s17\n", - "2 DfEX0000001 1970-10-06 1970-06-03 125" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s17_dates = data[data[\"AssessmentActualStartDate\"].notna()][[\"LAchildID\", \"CINreferralDate\", \"AssessmentActualStartDate\"]].drop_duplicates()\n", - "s17_dates[\"days_to_s17\"] = s17_dates[\"CINreferralDate\"] - s17_dates[\"AssessmentActualStartDate\"]\n", - "s17_dates[\"days_to_s17\"] = s17_dates[\"days_to_s17\"].dt.days\n", - "\n", - "# Remove any that are less than zero - it shouldn't happen, but just in case\n", - "s17_dates = s17_dates[s17_dates[\"days_to_s17\"] >= 0]\n", - "\n", - "s17_dates" - ] - }, - { - "cell_type": "markdown", - "id": "169d1673-b8c0-47ca-b1bc-1e18c4a4073a", - "metadata": {}, - "source": [ - "We can do exactly the same for S47:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "4eabff9b-2246-4edd-9c87-ddeadc583958", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDCINreferralDateS47ActualStartDatedays_to_s47
6DfEX00000011970-10-061970-06-02126
\n", - "
" - ], - "text/plain": [ - " LAchildID CINreferralDate S47ActualStartDate days_to_s47\n", - "6 DfEX0000001 1970-10-06 1970-06-02 126" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s47_dates = data[data[\"S47ActualStartDate\"].notna()][[\"LAchildID\", \"CINreferralDate\", \"S47ActualStartDate\"]].drop_duplicates()\n", - "s47_dates[\"days_to_s47\"] = s47_dates[\"CINreferralDate\"] - s47_dates[\"S47ActualStartDate\"]\n", - "s47_dates[\"days_to_s47\"] = s47_dates[\"days_to_s47\"].dt.days\n", - "\n", - "# Remove any that are less than zero - it shouldn't happen, but just in case\n", - "s47_dates = s47_dates[s47_dates[\"days_to_s47\"] >= 0]\n", - "\n", - "s47_dates" - ] - }, - { - "cell_type": "markdown", - "id": "8f480403-eab3-441c-ba49-a9823d8cafe9", - "metadata": {}, - "source": [ - "We can now merge these back with the CIN record. Since we want to see referrals that led to neither S17 or S47, we create a unique view of all referrals as a base to merge the others into:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "6e95b6c5-cd75-4b03-8f60-f42690433e37", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDCINreferralDateAssessmentActualStartDatedays_to_s17S47ActualStartDatedays_to_s47
0DfEX00000011970-10-061970-06-031251970-06-02126
\n", - "
" - ], - "text/plain": [ - " LAchildID CINreferralDate AssessmentActualStartDate days_to_s17 \\\n", - "0 DfEX0000001 1970-10-06 1970-06-03 125 \n", - "\n", - " S47ActualStartDate days_to_s47 \n", - "0 1970-06-02 126 " - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "merged = data[[\"LAchildID\", \"CINreferralDate\"]].drop_duplicates()\n", - "merged = merged.merge(s17_dates, how=\"left\", on=[\"LAchildID\", \"CINreferralDate\"])\n", - "merged = merged.merge(s47_dates, how=\"left\", on=[\"LAchildID\", \"CINreferralDate\"])\n", - "merged" - ] - }, - { - "cell_type": "markdown", - "id": "b8cd0b8c-363e-4b0c-829e-770a20a58fbe", - "metadata": {}, - "source": [ - "Finally, we add a \"referral_outcome\" giving us one of NFA, S17, S47 or BOTH depending on which records were found. " - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "9f88430f-5d74-4b75-89b1-17a9f82275f8", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDCINreferralDateAssessmentActualStartDatedays_to_s17S47ActualStartDatedays_to_s47referral_outcome
0DfEX00000011970-10-061970-06-031251970-06-02126BOTH
\n", - "
" - ], - "text/plain": [ - " LAchildID CINreferralDate AssessmentActualStartDate days_to_s17 \\\n", - "0 DfEX0000001 1970-10-06 1970-06-03 125 \n", - "\n", - " S47ActualStartDate days_to_s47 referral_outcome \n", - "0 1970-06-02 126 BOTH " - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "neither = merged['AssessmentActualStartDate'].isna() & merged['S47ActualStartDate'].isna()\n", - "s17_set = merged['AssessmentActualStartDate'].notna() & merged['S47ActualStartDate'].isna()\n", - "s47_set = merged['AssessmentActualStartDate'].isna() & merged['S47ActualStartDate'].notna()\n", - "both_set = merged['AssessmentActualStartDate'].notna() & merged['S47ActualStartDate'].notna()\n", - "\n", - "merged['referral_outcome'] = np.select([neither, s17_set, s47_set, both_set], ['NFA', 'S17', 'S47', 'BOTH'], default=None)\n", - "merged" - ] - }, - { - "cell_type": "markdown", - "id": "d8827b83-e4a8-42a7-8b88-5aa4d6bc3c19", - "metadata": {}, - "source": [ - "We can import and run this as a report:" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "4b835541-e068-4996-b89b-d1566e31424a", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LAchildIDCINreferralDateAssessmentActualStartDatedays_to_s17S47ActualStartDatedays_to_s47referral_outcome
0DfEX00000011970-10-061970-06-031251970-06-02126BOTH
\n", - "
" - ], - "text/plain": [ - " LAchildID CINreferralDate AssessmentActualStartDate days_to_s17 \\\n", - "0 DfEX0000001 1970-10-06 1970-06-03 125 \n", - "\n", - " S47ActualStartDate days_to_s47 referral_outcome \n", - "0 1970-06-02 126 BOTH " - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from liiatools.cin_census_pipeline.reports import referral_outcomes\n", - "\n", - "referral_outcomes(data)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.17" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/cin_census.md b/docs/cin_census.md deleted file mode 100644 index f74f0c4b..00000000 --- a/docs/cin_census.md +++ /dev/null @@ -1,288 +0,0 @@ -# Data Specification: CIN Census - -Test Coverage: 10% - -Three CLI options: - -* cleanfile(input, la_code, la_log_dir, output) -* la_agg(input, flat_output, analysis_output) -* pan_agg(input, la_code, flat_output, analysis_output) - - -## CLI COMMAND: Cleanfile - -* cleanfile - * check_file_type [shared] - check if input file is supported (REQ AASUPTYPE) - * Uses pathlib to get the file stem and suffix - * Returns None if suffix is an exact match (case sensitive) of one of the `file_types` - * Raises AssertionError if suffix is one of the `supported_file_types` (but not one of the `file_types`) - * If neither of those conditions are met: - * writes a warning to a dynamic file name **DAGSTER WARNING** - * returns the string "incorrect file type" - - * the return value of this is now checked and if the value is "incorrect file type" then the pipeline is stopped - - * **NEEDS REWRITE FOR DAGSTER** - - * dom_parse **DAGSTER WARNING** - - * list(stream) - loads stream to memory ?!? - - * filename from inputfile using Path.stem - - * check_year(filename) - get year from filename - * Supports a terrifying number of year formats - is there a better way? - - * ON (AttributeError, ValueError): save_year_error - EXIT **DAGSTER WARNING** - - * years_to_go_back / year_start_month - BOTH HARDCODED TO 6 - - * check_year_within_range - * if False: save_incorrect_year_error - EXIT **DAGSTER WARNING** - - * Config() - - Tries very hard to be configurable - but doesn't seem to be used? - - Uses environment variables to set the config - do we need this? - - * flip_dict - see warnings - - * strip_text - * Removes leading and trailing whitespace from TextNode events as well as removing empty strings alltogether - - * add_context - * Adds the context 'path' to each event - - * add_schema - * Adds the schema to each event - - WARNING: Late failure if schema for year is not found - - * inherit_LAchildID - * Caches each series and attempts to find the LAchildID from the cached series - - WARNING: Inconsistent try/catch could lead to unpredictable results - - * validate_elements - * Validates the local context and adds validation errors - - This is presumably quite costly and repeats a lot of validation as it validates at each level of the tree - - * _get_validation_error - * Tries to extract the validation information from the error message - - WARNING: This is very fragile and will break if the error message changes when ET/lxml changes - - a lot of nested ifs - very difficult to understand - - * counter - * Counts validation errors - adds these to shared context - - * convert_true_false - * Converts text nodes 'true' and 'false' to "0" and "1" based on schema - - * remove_invalid - * Removes subtrees if they are not valid and the list is on a hardcoded list of tags - - Q: List of tags is passed into this function - but it's hardcoded - if it *HAS* to be hardcoded, could it at least sit in this function rather than - in the CLI code which now requires duplication? - - Believe this is quite inefficient due to the use of collector which will inspect multiple sub-trees (potentially) - - Several of these functions would be so much simpler using DOM instead of stream - - * message_collector - * Collects Header and CINEvent - - We know that this is confusing and difficult to follow - and should hopefully be possible to replace with `sfdata` once complete - - * export_table - * uses event_to_records to convert the stream to tablib - - * add_fields - transforms - WARNING: Misleading name - - * convert_to_dataframe - * Just an alias for tablib.export('df') - could be removed - - * get_year - sets the year column - WARNING: Misleading name - * Just an alias for dataframe['column'] = value - - - * convert_to_datetime - * hardcoded list of columns to convert to datetime - no error handling - - * add_school_year - * applies _get_person_school_year using hardcoded column names - - * _get_person_school_year - * Returns year or None - although I'm not sure when none could be returned - - * add_la_name - * Another one liner: `dataframe['LA'] = la_name` - - * la_prefix - * `data["LAchildID"] = data["LAchildID"] + "_" + la_code` - - * degrade_dob - * Uses hardcoded column name to apply `to_month_only_dob` - - * degrade_expected_dob - * Repetition of above - - * degrade_death_date - * Repetition of above - love that death_date calls dob function - - * export_file - * Hardcoded output file name - * Uses dataframe.to_csv **DAGSTER WARNING** - - * save_errors_la - * Uses open **DAGSTER WARNING** - -## CLI COMMAND: la_agg - -* la_agg - - * Config() - - * read_file - `pd.read_csv(input, parse_dates=dates, dayfirst=True)` **DAGSTER WARNING** - - * merge_la_files - * Reads the 'archive' file and merges it with the 'current' file - * Hardocded filename - * Uses pd.read_csv **DAGSTER WARNING** - - * deduplicate - * Removes duplicate entries based on primary keys - - * remove_old_data - * Removes data older than six years - * June hardcoded as month - * "today" hardcoded as reference - different from AA which uses "now" - * `year` and `years` as argument names are very dangerous - must use more descriptive names - - * export_flatfile - * Writes the dataframe to a csv file - alias for `dataframe.to_csv` **DAGSTER WARNING** - * Hardcoded filename - should at least be constant - - * filter_flatfile - * Filters the dataframe based on column 'Type' == 'AssessmentAuthorisationDate' and drops columns that now are empty - - * IF len(factors) > 0 - factors is output of filter_flatfile - * split_factors - * If I understand this correctly, there is a column called Factors which has a list of strings in it. This - function translates this column to a One-Hot encoded vuew with a columns for each factor. - - * export_factfile - * Writes the dataframe to a csv file - alias for `dataframe.to_csv` **DAGSTER WARNING** - * Hardcoded filename - - * referral_inputs - * Returns a tuple of three dataframes all having been individually filtered by filter_flatfile - * ref = "CINreferralDate", s17 = "AssessmentActualStartDate", s47 = "S47ActualStartDate" - - * IF len(s17) AND len(s47) - s17 and s47 are output of referral_inputs and both must have values to proceed - - * merge_ref_s17 - * Merges the two dataframes on the 'LAchildID' - * Calculates the dates between AssessmentActualStartDate and CINreferralDate - - * merge_ref_s47 - * Merges the two dataframes on the 'LAchildID' - * Calculates the dates between S47ActualStartDate and CINreferralDate - - *This is identical to merge_ref_s17 just acting on different columns* - - * ref_outcomes - - **Inputs**: - - `ref`: Primary dataframe. - - `ref_s17`: Dataframe for S17 outcomes. - - `ref_s47`: Dataframe for S47 outcomes. - - - **Operations**: - - Merge `ref` with `ref_s17` based on `Date` and `LAchildID`. - - Merge the resulting dataframe with `ref_s47` based on the same keys. - - Set a default outcome in a column named `referral_outcome` to "NFA" for all records. - - Change outcome to "S17" when an `AssessmentActualStartDate` is present. - - Change outcome to "S47" when a `S47ActualStartDate` is present. - - Set outcome to "Both S17 & S47" when both start dates are present. - - Calculate age of the child at the time of referral using the function `_time_between_date_series()` and store in `Age at referral`. - - - **Output**: - - Dataframe (`ref_outs`) containing merged views with outcomes and child's age at referral. - - * export_reffile - * Saves the merged file to a csv file - alias for `dataframe.to_csv` **DAGSTER WARNING** - - * journey_inputs - * Returns a tuple of two dataframes all having been individually filtered by filter_flatfile - * s47_j = "S47ActualStartDate", cpp = "CPPstartDate" - - * IF len(s47_j) AND len(cpp) - s47_j and cpp are output of journal_inputs and both must have values to proceed - - * journey_merge - - Merge `s47_j` with `CPPstartDate` from `cpp` based on `LAchildID` to get `s47_cpp`. - - Calculate days from ICPC to CPP start: - - Add a new column `icpc_to_cpp` to `s47_cpp`. - - Use helper function `_time_between_date_series` to calculate the days difference. - - Calculate days from S47 to CPP start: - - Add a new column `s47_to_cpp` to `s47_cpp`. - - Use helper function `_time_between_date_series` to calculate the days difference. - - Filter `s47_cpp` to keep only logically consistent events: - - Based on constraints defined for `icpc_to_cpp` and `s47_to_cpp` using the config variables `icpc_cpp_days` and `s47_cpp_days`. - - Merge filtered events from `s47_cpp` back to `s47_j` to get `s47_outs`: - - Keep columns ["Date", "LAchildID", "CPPstartDate", "icpc_to_cpp", "s47_to_cpp"]. - - Merge based on ["Date", "LAchildID"]. - - Return `s47_outs`. - - * s47_paths - - **Purpose**: Creates an output that can generate a Sankey diagram of outcomes from S47 events. - - - **Step 1: Define Date Window for S47 events** - - For each year in `s47_outs["YEAR"]`: - - Define the date for the 'cin_census_close' as March 31st of that year. - - Compute the 's47_max_date' by subtracting the 's47_day_limit' from the 'cin_census_close'. - - Compute the 'icpc_max_date' by subtracting the 'icpc_day_limit' from the 'cin_census_close'. - - - **Step 2: Setting up the Sankey diagram source for S47 events** - - Create a copy of `s47_outs` named `step1`. - - Set the "Source" column values to "S47 strategy discussion". - - Initialize the "Destination" column with NaN values. - - Update the "Destination" for rows where 'DateOfInitialCPC' is not null to "ICPC". - - Update the "Destination" for rows where 'DateOfInitialCPC' is null but 'CPPstartDate' is not null to "CPP start". - - Update the "Destination" for rows where 'S47ActualStartDate' is on or after 's47_max_date' to "TBD - S47 too recent". - - For remaining rows with null "Destination", set the value to "No ICPC or CPP". - - - **Step 3: Setting up the Sankey diagram source for ICPC events** - - Filter `step1` where the "Destination" is "ICPC" and assign to `step2`. - - Set the "Source" column values of `step2` to "ICPC". - - Initialize the "Destination" column with NaN values. - - Update the "Destination" for rows where 'CPPstartDate' is not null to "CPP start". - - Update the "Destination" for rows where 'DateOfInitialCPC' is on or after 'icpc_max_date' to "TBD - ICPC too recent". - - For remaining rows with null "Destination", set the value to "No CPP". - - - **Step 4: Merge the steps together** - - Concatenate `step1` and `step2` into `s47_journey`. - - - **Step 5: Calculate Age of Child at S47** - - Compute the child's age at the time of the S47 event by finding the difference between 'S47ActualStartDate' and 'PersonBirthDate' in terms of years. - - - **Return**: The function finally returns the `s47_journey` dataframe. - - * export_journeyfile - * Saves the merged file to a csv file - alias for `dataframe.to_csv` **DAGSTER WARNING** - -## CLI COMMAND: pan_agg - -* pan_agg - - * Config - - * read_file - * Reads the file from the filepath - * Alias for `pandas.read_csv` **DAGSTER WARNING** - - * merge_agg_files - * Reads the pan flatfile using pandas.read_csv **DAGSTER WARNING** - - * _merge_dfs - * Drops the LA column - * Merges the new columns to the pan flatfile - * *ONLY CALLED FROM PARENT* - should probably be inline - - * export_flatfile - * Saves the merged file to a csv file - alias for `dataframe.to_csv` **DAGSTER WARNING** - - - * filter_flatfile - - * At this stage it pretty much exactly follows the steps from la_agg diff --git a/docs/cin_census_questions.md b/docs/cin_census_questions.md deleted file mode 100644 index 6a65ec8f..00000000 --- a/docs/cin_census_questions.md +++ /dev/null @@ -1,574 +0,0 @@ -# Questions about de-duplicating CIN Census - -Whilst documenting these processes I came accross a lot of functions that seemed to be doing the same thing. And in porting the code to Dagster, any amount of deduplication -we can do will signficantly reduce the amount of work. - -We also have the race condition issue, and if we can simplify parts of the process we can further minimise that risk. - -The current CIN process (as well as the two other 'standard' ones) are all based on the same pattern: - -1. cleandata -2. la_agg -3. pan_agg - -**Step 1** loads the input file (XML) and ensures that it is clean and outputs as a single CSV file in broad format. - -**Step 2** loads this CSV file and creates a number of views of this: - -* 'CIN_Census_merged_flatfile' - this is -* 'CIN_Census_factors.csv' - a child-level one-high view of the risk factors found in the data -* 'CIN_Census_referrals' - a child-level view of the referrals found in the data -* 'CIN_Census_S47_journey' - a child-level view of the journeys - -The merged flatfile involves loading the merged flatfile from any previous run and merging it with the current input file, then deduplicating it. - * I'm inferring from this that the data provided in each run is not 'complete' and so we need to merge it with the previous run to get a complete picture. - * However, we overwrite this file on each run - so a corrupt run will remove any history and there is no backup facility. This could lead to data loss. - -Note also there is a fair amount of duplication in this data as these are all views of the same data and so include all the child-level data for each child instead of -having a child file and then much smaller views. - -Most analysis tools are able to do a join in child_id so this small change could potentially make a big difference to performance of any dashboards and later analysis, as -well as avoiding potentially conflicting data. - -**Step 3** loads the data from all las and merges the data together. This is then used to create a number of views: - -* 'pan_London_CIN_flatfile' - this is -* 'CIN_Census_factors.csv' - a child-level one-high view of the risk factors found in the data -* 'CIN_Census_referrals' - a child-level view of the referrals found in the data -* 'CIN_Census_S47_journey' - a child-level view of the journeys - -The pan_London_CIN_flatfile is created from the current LA file as well as the previous version of the pan_London_CIN_flatfile. Based on the LA Code (provided by CLI) all -columns for that LA are dropped and the new file merged in. - * The fact that the LA code is provided by the CLI leads me to suspect that, although not specified, the input file is the cleaned file for the LA, not the merged - flatfile as this has the la code in it. **CORRECTION:** The is intended to run from the merged flatfile. The LA code is used to drop the LA data from the pan-London file. - * This means that the pan london flatfile will only have the latest data - not the historic data. It's not specified if this is intentional or accidental. - The code will happily accept either the merged flatfile or the cleaned file as input, so it's not clear which is the correct one. - -Using the now merged view, this step proceeds to calculate all the child-level views as before. - -My impression is that these "aggregate" files really are "concatenated" files as the data is still child-level, and that rather than running all the code again an approach that is simpler and less like to suffer from race conditions would be to: - -* Clean the incoming LA file -* Using the clean filed - merged any history from timestamped 'state' files within the LA -* Create the child-level views for this LA - -Any change to the child-level views for the LA would trigger the 'pan london' task to run, which would then: - -* Load and concatenate all the child-level views for all the LAs - no calculation is required as the data is already in the right format -* Apply any privacy rules to the data that exist on the pan-london scale - -This is computationally (and therefore code-maintenance wise) much simpler and less likely to suffer from any overwriting of files as each part of the code writes to a unique file. -There is still a slight risk in the 'state' files - so if it's essential that we don't drop data from earlier runs, we probably need a locking mechanism there. - -Similarly the pan london files could be triggered by two different LAs at the same time, and thus produce slightly different outputs. Again, we could probably ensure only one merge task at the time can run, and the issue is less severe as the problem would "recover" on the next run as none of the source data is overwritten. - -Finally it would remove large amounts of duplicate code thus making the code much simpler to maintain and making the migration to Dagster much simpler. See below. - -## Examples - -Here are some examples based on the sample data file provided - -**<inputfile>_clean.csv** - -```csv -LAchildID,Date,Type,CINreferralDate,ReferralSource,PrimaryNeedCode,CINclosureDate,ReasonForClosure,DateOfInitialCPC,ReferralNFA,CINPlanStartDate,CINPlanEndDate,S47ActualStartDate,InitialCPCtarget,ICPCnotRequired,AssessmentActualStartDate,AssessmentInternalReviewDate,AssessmentAuthorisationDate,Factors,CPPstartDate,CPPendDate,InitialCategoryOfAbuse,LatestCategoryOfAbuse,NumberOfPreviousCPP,UPN,FormerUPN,UPNunknown,PersonBirthDate,ExpectedPersonBirthDate,GenderCurrent,PersonDeathDate,PersonSchoolYear,Ethnicity,Disabilities,YEAR,LA -DfEX0000001_BAD,1970-10-06,CINreferralDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-02-27,CINclosureDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1970-06-03,AssessmentActualStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,1970-06-03,1970-06-22,1971-07-18,"2A,2B",,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-07-18,AssessmentAuthorisationDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,1970-06-03,1970-06-22,1971-07-18,"2A,2B",,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-01-24,CINPlanStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,1971-01-24,1971-01-26,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-01-26,CINPlanEndDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,1971-01-24,1971-01-26,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1970-06-02,S47ActualStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-06-17,0,,,1970-06-02,1970-06-23,0,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1970-02-17,CPPstartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,1970-02-17,1971-03-14,PHY,PHY,10,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-03-14,CPPendDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,1970-02-17,1971-03-14,PHY,PHY,10,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-02-15,CPPreviewDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,1970-02-17,1971-03-14,PHY,PHY,10,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -``` - -**CIN_Census_merged_flatfile.csv** - -```csv -LAchildID,Date,Type,CINreferralDate,ReferralSource,PrimaryNeedCode,CINclosureDate,ReasonForClosure,DateOfInitialCPC,ReferralNFA,CINPlanStartDate,CINPlanEndDate,S47ActualStartDate,InitialCPCtarget,ICPCnotRequired,AssessmentActualStartDate,AssessmentInternalReviewDate,AssessmentAuthorisationDate,Factors,CPPstartDate,CPPendDate,InitialCategoryOfAbuse,LatestCategoryOfAbuse,NumberOfPreviousCPP,UPN,FormerUPN,UPNunknown,PersonBirthDate,ExpectedPersonBirthDate,GenderCurrent,PersonDeathDate,PersonSchoolYear,Ethnicity,Disabilities,YEAR,LA -DfEX0000001_BAD,1970-06-02,S47ActualStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-06-17,0,,,1970-06-02,1970-06-23,0.0,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1970-06-03,AssessmentActualStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,1970-06-03,1970-06-22,1971-07-18,"2A,2B",,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-07-18,AssessmentAuthorisationDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,1970-06-03,1970-06-22,1971-07-18,"2A,2B",,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1970-10-06,CINreferralDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-02-27,CINclosureDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-01-24,CINPlanStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,1971-01-24,1971-01-26,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-01-26,CINPlanEndDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,1971-01-24,1971-01-26,,,,,,,,,,,,,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1970-02-17,CPPstartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,1970-02-17,1971-03-14,PHY,PHY,10.0,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-03-14,CPPendDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,1970-02-17,1971-03-14,PHY,PHY,10.0,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -DfEX0000001_BAD,1971-02-15,CPPreviewDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,,,,,,,,,,1970-02-17,1971-03-14,PHY,PHY,10.0,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham -``` - -**CIN_Census_factors.csv** - -```csv -LAchildID,Date,Type,CINreferralDate,ReferralSource,PrimaryNeedCode,CINclosureDate,ReasonForClosure,DateOfInitialCPC,ReferralNFA,AssessmentActualStartDate,AssessmentInternalReviewDate,AssessmentAuthorisationDate,Factors,UPN,FormerUPN,UPNunknown,PersonBirthDate,ExpectedPersonBirthDate,GenderCurrent,PersonDeathDate,PersonSchoolYear,Ethnicity,Disabilities,YEAR,LA,2A,2B -DfEX0000001_BAD,1971-07-18,AssessmentAuthorisationDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,1970-06-03,1970-06-22,1971-07-18,"2A,2B",A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham,1,1 -``` - -**CIN_Census_referrals.csv** - -```csv -LAchildID,Date,Type,CINreferralDate,ReferralSource,PrimaryNeedCode,CINclosureDate,ReasonForClosure,DateOfInitialCPC,ReferralNFA,UPN,FormerUPN,UPNunknown,PersonBirthDate,ExpectedPersonBirthDate,GenderCurrent,PersonDeathDate,PersonSchoolYear,Ethnicity,Disabilities,YEAR,LA,AssessmentActualStartDate,days_to_s17,S47ActualStartDate,days_to_s47,referral_outcome,Age at referral -DfEX0000001_BAD,1970-10-06,CINreferralDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-12-06,0,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham,,,,,NFA,4 -``` - -**CIN_Census_S47_journey.csv** - -```csv -LAchildID,Date,Type,CINreferralDate,ReferralSource,PrimaryNeedCode,CINclosureDate,ReasonForClosure,DateOfInitialCPC,ReferralNFA,S47ActualStartDate,InitialCPCtarget,ICPCnotRequired,UPN,FormerUPN,UPNunknown,PersonBirthDate,ExpectedPersonBirthDate,GenderCurrent,PersonDeathDate,PersonSchoolYear,Ethnicity,Disabilities,YEAR,LA,CPPstartDate,icpc_to_cpp,s47_to_cpp,cin_census_close,s47_max_date,icpc_max_date,Source,Destination,Age at S47 -DfEX0000001_BAD,1970-06-02,S47ActualStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-06-17,0,1970-06-02,1970-06-23,0.0,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham,,,,2022-03-31,2022-01-30,2022-02-14,S47 strategy discussion,ICPC,4 -DfEX0000001_BAD,1970-06-02,S47ActualStartDate,1970-10-06,1A,N4,1971-02-27,RC1,1970-06-17,0,1970-06-02,1970-06-23,0.0,A123456789123,X98765432123B,UN3,1966-03-01,1966-03-01,1,1980-10-01,1965,WBRI,"HAND,HEAR",2022,Barking and Dagenham,,,,2022-03-31,2022-01-30,2022-02-14,ICPC,No CPP,4 -``` - -## Code Comparison - -### CLI commands - -Side by side comparison for the LA and PAN views. - -Beyond the initial loading steps, the code is identially apart from the package from which the functions are loaded. - -```sdiff -@cin_census.command() @cin_census.command() -@click.option( @click.option( - "--i", "--i", - "input", "input", - required=True, required=True, - type=str, type=str, - help="A string specifying the input file location, includ help="A string specifying the input file location, includ -) ) -@click.option( @click.option( - > "--la_code", - > required=True, - > type=click.Choice(la_list, case_sensitive=False), - > help="A three letter code, specifying the local authority - > ) - > @click.option( - "--flat_output", "--flat_output", - required=True, required=True, - type=str, type=str, - help="A string specifying the directory location for the help="A string specifying the directory location for the -) ) -@click.option( @click.option( - "--analysis_output", "--analysis_output", - required=True, required=True, - type=str, type=str, - help="A string specifying the directory location for the help="A string specifying the directory location for the -) ) -def la_agg(input, flat_output, analysis_output): | def pan_agg(input, la_code, flat_output, analysis_output): - """ """ - Joins data from newly cleaned CIN Census file (output of | Joins data from newly merged CIN Census file (output of l - :param input: should specify the input file location, inc :param input: should specify the input file location, inc - > :param la_code: should be a three-letter string for the l - :param flat_output: should specify the path to the folder :param flat_output: should specify the path to the folder - :param analysis_output: should specify the path to the fo :param analysis_output: should specify the path to the fo - :return: None :return: None - """ """ - - # Configuration # Configuration - config = agg_config.Config() | config = pan_config.Config() - - # Open file as Dataframe | # Create flat file - dates = config["dates"] dates = config["dates"] - flatfile = agg_process.read_file(input, dates) | flatfile = pan_process.read_file(input, dates) - - # Merge with existing data, de-duplicate and apply data r | # Merge with existing pan-London data - flatfile = agg_process.merge_la_files(flat_output, dates, | la_name = flip_dict(config["data_codes"])[la_code] - sort_order = config["sort_order"] | flatfile = pan_process.merge_agg_files(flat_output, dates - dedup = config["dedup"] < - flatfile = agg_process.deduplicate(flatfile, sort_order, < - flatfile = agg_process.remove_old_data(flatfile, years=6) < - - # Output flatfile # Output flatfile - agg_process.export_flatfile(flat_output, flatfile) | pan_process.export_flatfile(flat_output, flatfile) - - # Create and output factors file # Create and output factors file - factors = agg_process.filter_flatfile( | factors = pan_process.filter_flatfile( - flatfile, filter="AssessmentAuthorisationDate" flatfile, filter="AssessmentAuthorisationDate" - ) ) - if len(factors) > 0: if len(factors) > 0: - factors = agg_process.split_factors(factors) | factors = pan_process.split_factors(factors) - agg_process.export_factfile(analysis_output, factors) | pan_process.export_factfile(analysis_output, factors) - - # Create referral file # Create referral file - ref, s17, s47 = agg_process.referral_inputs(flatfile) | ref, s17, s47 = pan_process.referral_inputs(flatfile) - if len(s17) > 0 and len(s47) > 0: if len(s17) > 0 and len(s47) > 0: - ref_assessment = config["ref_assessment"] ref_assessment = config["ref_assessment"] - ref_s17 = agg_process.merge_ref_s17(ref, s17, ref_ass | ref_s17 = pan_process.merge_ref_s17(ref, s17, ref_ass - ref_s47 = agg_process.merge_ref_s47(ref, s47, ref_ass | ref_s47 = pan_process.merge_ref_s47(ref, s47, ref_ass - ref_outs = agg_process.ref_outcomes(ref, ref_s17, ref | ref_outs = pan_process.ref_outcomes(ref, ref_s17, ref - agg_process.export_reffile(analysis_output, ref_outs) | pan_process.export_reffile(analysis_output, ref_outs) - - # Create journey file # Create journey file - icpc_cpp_days = config["icpc_cpp_days"] icpc_cpp_days = config["icpc_cpp_days"] - s47_cpp_days = config["s47_cpp_days"] s47_cpp_days = config["s47_cpp_days"] - s47_j, cpp = agg_process.journey_inputs(flatfile) | s47_j, cpp = pan_process.journey_inputs(flatfile) - if len(s47_j) > 0 and len(cpp) > 0: if len(s47_j) > 0 and len(cpp) > 0: - s47_outs = agg_process.journey_merge(s47_j, cpp, icpc | s47_outs = pan_process.journey_merge(s47_j, cpp, icpc - s47_day_limit = config["s47_day_limit"] s47_day_limit = config["s47_day_limit"] - icpc_day_limit = config["icpc_day_limit"] icpc_day_limit = config["icpc_day_limit"] - s47_journey = agg_process.s47_paths(s47_outs, s47_day | s47_journey = pan_process.s47_paths(s47_outs, s47_day - agg_process.export_journeyfile(analysis_output, s47_j | pan_process.export_journeyfile(analysis_output, s47_j -``` - -### Process Comparison - -We can also look at the initial process files. Again we see that these files are identical beyond the initial loading of the files and a few comments that I have added. - -This violates the principles of DRY as well as making the code much harder to understand and maintain. - -```sdiff - > import logging -from pathlib import Path from pathlib import Path -import pandas as pd import pandas as pd -from datetime import datetime from datetime import datetime -import numpy as np import numpy as np -import logging < - -log = logging.getLogger(__name__) log = logging.getLogger(__name__) - - -def read_file(input, dates): def read_file(input, dates): - """ """ - Reads the csv file as a pandas DataFrame Reads the csv file as a pandas DataFrame - """ """ - flatfile = pd.read_csv(input, parse_dates=dates, dayfirst flatfile = pd.read_csv(input, parse_dates=dates, dayfirst - return flatfile return flatfile - - -def merge_la_files(flat_output, dates, flatfile): | def _merge_dfs(flatfile, old_df, la_name): - """ """ - Looks for existing file of the same type and merges with | Deletes existing data for new LA from pan file - > Merges new LA data to pan file - """ """ - old_file = Path(flat_output, f"CIN_Census_merged_flatfile | old_df = old_df.drop(old_df[old_df["LA"] == la_name].inde - if old_file.is_file(): | flatfile = pd.concat([flatfile, old_df], axis=0, ignore_i - old_df = pd.read_csv(old_file, parse_dates=dates, day < - merged_df = pd.concat([flatfile, old_df], axis=0) < - else: < - merged_df = flatfile < - return merged_df < - < - < -def deduplicate(flatfile, sort_order, dedup): < - """ < - Sorts and removes duplicate records from merged files fol < - """ < - flatfile = flatfile.sort_values(sort_order, ascending=Fal < - flatfile = flatfile.drop_duplicates(subset=dedup, keep="f < - return flatfile return flatfile - - -def remove_old_data(flatfile, years): | def merge_agg_files(flat_output, dates, la_name, flatfile): - """ """ - Removes data older than a specified number of years | Checks if pan file exists - > Passes old and new file to function to be merged - """ """ - year = pd.to_datetime("today").year | output_file = Path(flat_output, f"pan_London_CIN_flatfile - month = pd.to_datetime("today").month | if output_file.is_file(): - if month <= 6: | old_df = pd.read_csv(output_file, parse_dates=dates, - year = year - 1 | flatfile = _merge_dfs(flatfile, old_df, la_name) - flatfile = flatfile[flatfile["YEAR"] >= year - years] < - return flatfile return flatfile - - -def export_flatfile(flat_output, flatfile): def export_flatfile(flat_output, flatfile): - """ | output_path = Path(flat_output, f"pan_London_CIN_flatfile - Writes the flatfile output as a csv < - """ < - output_path = Path(flat_output, f"CIN_Census_merged_flatf < - flatfile.to_csv(output_path, index=False) flatfile.to_csv(output_path, index=False) - - -def filter_flatfile(flatfile, filter): def filter_flatfile(flatfile, filter): - """ """ - Filters rows to specified events Filters rows to specified events - Removes redundant columns that relate to other types of e Removes redundant columns that relate to other types of e - """ """ - filtered_flatfile = flatfile[flatfile["Type"] == filter] filtered_flatfile = flatfile[flatfile["Type"] == filter] - filtered_flatfile = filtered_flatfile.dropna(axis=1, how= filtered_flatfile = filtered_flatfile.dropna(axis=1, how= - return filtered_flatfile return filtered_flatfile - - -def split_factors(factors): def split_factors(factors): - """ """ - Creates a new set of columns from the flatfile with a col Creates a new set of columns from the flatfile with a col - Rows correspond the the rows of the flatfile and should h Rows correspond the the rows of the flatfile and should h - """ """ - factor_cols = factors.Factors factor_cols = factors.Factors - factor_cols = factor_cols.str.split(",", expand=True) factor_cols = factor_cols.str.split(",", expand=True) - factor_cols = factor_cols.stack() factor_cols = factor_cols.stack() - factor_cols = factor_cols.str.get_dummies() factor_cols = factor_cols.str.get_dummies() - factor_cols = factor_cols.groupby(level=0).sum() factor_cols = factor_cols.groupby(level=0).sum() - assert factor_cols.isin([0, 1]).all(axis=None) assert factor_cols.isin([0, 1]).all(axis=None) - factors = pd.concat([factors, factor_cols], axis=1) factors = pd.concat([factors, factor_cols], axis=1) - return factors return factors - - -def export_factfile(analysis_output, factors): def export_factfile(analysis_output, factors): - """ """ - Writes the factors output as a csv Writes the factors output as a csv - """ """ - output_path = Path(analysis_output, f"CIN_Census_factors. output_path = Path(analysis_output, f"CIN_Census_factors. - factors.to_csv(output_path, index=False) factors.to_csv(output_path, index=False) - - -def referral_inputs(flatfile): def referral_inputs(flatfile): - """ """ - Creates three inputs for referral journeys analysis file Creates three inputs for referral journeys analysis file - """ """ - ref = filter_flatfile(flatfile, filter="CINreferralDate") ref = filter_flatfile(flatfile, filter="CINreferralDate") - s17 = filter_flatfile(flatfile, filter="AssessmentActualS s17 = filter_flatfile(flatfile, filter="AssessmentActualS - s47 = filter_flatfile(flatfile, filter="S47ActualStartDat s47 = filter_flatfile(flatfile, filter="S47ActualStartDat - return ref, s17, s47 return ref, s17, s47 - - -# FIXME: This function defaults to returning nothing - should < -# or simply just returning the series and then doing ` < -def _time_between_date_series(later_date_series, earlier_date def _time_between_date_series(later_date_series, earlier_date - days_series = later_date_series - earlier_date_series days_series = later_date_series - earlier_date_series - days_series = days_series.dt.days days_series = days_series.dt.days - - if days == 1: if days == 1: - return days_series return days_series - - elif years == 1: elif years == 1: - years_series = (days_series / 365).apply(np.floor) years_series = (days_series / 365).apply(np.floor) - years_series = years_series.astype('Int32') years_series = years_series.astype('Int32') - return years_series return years_series - - -def _filter_event_series(dataset: pd.DataFrame, days_series: | def _filter_event_series(dataset, days_series, max_days): - """ < - Filters a dataframe to only include rows where the column < - is 0 <= days_series <= max_days < - - Args: < - dataset (pd.DataFrame): The dataset to filter < - days_series (str): The name of the column containing < - max_days (int): The maximum number of days between th < - < - Returns: < - pd.DataFrame: The filtered dataset < - < - """ < - dataset = dataset[ dataset = dataset[ - ((dataset[days_series] <= max_days) & (dataset[days_s ((dataset[days_series] <= max_days) & (dataset[days_s - ] ] - return dataset return dataset - - -def merge_ref_s17(ref, s17, ref_assessment): def merge_ref_s17(ref, s17, ref_assessment): - """ """ - Merges ref and s17 views together, keeping only logically Merges ref and s17 views together, keeping only logically - """ """ - # Merges referrals and assessments # Merges referrals and assessments - ref_s17 = ref.merge( ref_s17 = ref.merge( - s17[["LAchildID", "AssessmentActualStartDate"]], how= s17[["LAchildID", "AssessmentActualStartDate"]], how= - ) ) - - # Calculates days between assessment and referral # Calculates days between assessment and referral - ref_s17["days_to_s17"] = _time_between_date_series( ref_s17["days_to_s17"] = _time_between_date_series( - ref_s17["AssessmentActualStartDate"], ref_s17["CINref ref_s17["AssessmentActualStartDate"], ref_s17["CINref - ) ) - - # Only assessments within config-specifed period followin # Only assessments within config-specifed period followin - ref_s17 = _filter_event_series(ref_s17, "days_to_s17", re ref_s17 = _filter_event_series(ref_s17, "days_to_s17", re - - # Reduces dataset to fields required for analysis # Reduces dataset to fields required for analysis - ref_s17 = ref_s17[["Date", "LAchildID", "AssessmentActual ref_s17 = ref_s17[["Date", "LAchildID", "AssessmentActual - - return ref_s17 return ref_s17 - - -def merge_ref_s47(ref, s47, ref_assessment): def merge_ref_s47(ref, s47, ref_assessment): - """ """ - Merges ref and s47 views together, keeping only logically Merges ref and s47 views together, keeping only logically - """ """ - # Merges referrals and S47s # Merges referrals and S47s - ref_s47 = ref.merge( ref_s47 = ref.merge( - s47[["LAchildID", "S47ActualStartDate"]], how="left", s47[["LAchildID", "S47ActualStartDate"]], how="left", - ) ) - - # Calculates days between S47 and referral # Calculates days between S47 and referral - ref_s47["days_to_s47"] = _time_between_date_series( ref_s47["days_to_s47"] = _time_between_date_series( - ref_s47["S47ActualStartDate"], ref_s47["CINreferralDa ref_s47["S47ActualStartDate"], ref_s47["CINreferralDa - ) ) - - # Only S47s within config-specifed period following refer # Only S47s within config-specifed period following refer - ref_s47 = _filter_event_series(ref_s47, "days_to_s47", re ref_s47 = _filter_event_series(ref_s47, "days_to_s47", re - - # Reduces dataset to fields required for analysis # Reduces dataset to fields required for analysis - ref_s47 = ref_s47[["Date", "LAchildID", "S47ActualStartDa ref_s47 = ref_s47[["Date", "LAchildID", "S47ActualStartDa - - return ref_s47 return ref_s47 - - -def ref_outcomes(ref, ref_s17, ref_s47): def ref_outcomes(ref, ref_s17, ref_s47): - """ """ - Merges views together to give all outcomes of referrals i Merges views together to give all outcomes of referrals i - Outcomes column defaults to NFA unless there is a relevan Outcomes column defaults to NFA unless there is a relevan - Calculates age of child at referral Calculates age of child at referral - """ """ - # Merge databases together # Merge databases together - ref_outs = ref.merge(ref_s17, on=["Date", "LAchildID"], h ref_outs = ref.merge(ref_s17, on=["Date", "LAchildID"], h - ref_outs = ref_outs.merge(ref_s47, on=["Date", "LAchildID ref_outs = ref_outs.merge(ref_s47, on=["Date", "LAchildID - - # Set default outcome to "NFA" # Set default outcome to "NFA" - ref_outs["referral_outcome"] = "NFA" ref_outs["referral_outcome"] = "NFA" - - # Set outcome to "S17" when there is a relevant assessmen # Set outcome to "S17" when there is a relevant assessmen - ref_outs.loc[ ref_outs.loc[ - ref_outs["AssessmentActualStartDate"].notnull(), "ref ref_outs["AssessmentActualStartDate"].notnull(), "ref - ] = "S17" ] = "S17" - - # Set outcome to "S47" when there is a relevant S47 # Set outcome to "S47" when there is a relevant S47 - ref_outs.loc[ref_outs["S47ActualStartDate"].notnull(), "r ref_outs.loc[ref_outs["S47ActualStartDate"].notnull(), "r - - # Set outcome to "Both S17 & S47" when there are both # Set outcome to "Both S17 & S47" when there are both - ref_outs.loc[ ref_outs.loc[ - ( ( - (ref_outs["AssessmentActualStartDate"].notnull()) (ref_outs["AssessmentActualStartDate"].notnull()) - & (ref_outs["S47ActualStartDate"].notnull()) & (ref_outs["S47ActualStartDate"].notnull()) - ), ), - "referral_outcome", "referral_outcome", - ] = "Both S17 & S47" ] = "Both S17 & S47" - - # Calculate age of child at referral # Calculate age of child at referral - ref_outs["Age at referral"] = _time_between_date_series( ref_outs["Age at referral"] = _time_between_date_series( - ref_outs["CINreferralDate"], ref_outs["PersonBirthDat ref_outs["CINreferralDate"], ref_outs["PersonBirthDat - ) ) - - return ref_outs return ref_outs - - -def export_reffile(analysis_output, ref_outs): def export_reffile(analysis_output, ref_outs): - """ """ - Writes the referral journeys output as a csv Writes the referral journeys output as a csv - """ """ - output_path = Path(analysis_output, f"CIN_Census_referral output_path = Path(analysis_output, f"CIN_Census_referral - ref_outs.to_csv(output_path, index=False) ref_outs.to_csv(output_path, index=False) - - -def journey_inputs(flatfile): def journey_inputs(flatfile): - """ """ - Creates inputs for the journey analysis file | Creates the input for the journey analysis file - """ """ - # Create inputs from flatfile and merge them # Create inputs from flatfile and merge them - s47_j = filter_flatfile(flatfile, "S47ActualStartDate") s47_j = filter_flatfile(flatfile, "S47ActualStartDate") - cpp = filter_flatfile(flatfile, "CPPstartDate") cpp = filter_flatfile(flatfile, "CPPstartDate") - return s47_j, cpp return s47_j, cpp - - -def journey_merge(s47_j, cpp, icpc_cpp_days, s47_cpp_days): def journey_merge(s47_j, cpp, icpc_cpp_days, s47_cpp_days): - """ """ - Merges inputs to produce outcomes file Merges inputs to produce outcomes file - """ """ - s47_cpp = s47_j.merge( s47_cpp = s47_j.merge( - cpp[["LAchildID", "CPPstartDate"]], how="left", on="L cpp[["LAchildID", "CPPstartDate"]], how="left", on="L - ) ) - - # Calculate days from ICPC to CPP start # Calculate days from ICPC to CPP start - s47_cpp["icpc_to_cpp"] = _time_between_date_series( s47_cpp["icpc_to_cpp"] = _time_between_date_series( - s47_cpp["CPPstartDate"], s47_cpp["DateOfInitialCPC"], s47_cpp["CPPstartDate"], s47_cpp["DateOfInitialCPC"], - ) ) - - # Calculate days from S47 to CPP start # Calculate days from S47 to CPP start - s47_cpp["s47_to_cpp"] = _time_between_date_series( s47_cpp["s47_to_cpp"] = _time_between_date_series( - s47_cpp["CPPstartDate"], s47_cpp["S47ActualStartDate" s47_cpp["CPPstartDate"], s47_cpp["S47ActualStartDate" - ) ) - - # Only keep logically consistent events (as defined in co # Only keep logically consistent events (as defined in co - s47_cpp = s47_cpp[ s47_cpp = s47_cpp[ - ((s47_cpp["icpc_to_cpp"] >= 0) & (s47_cpp["icpc_to_cp ((s47_cpp["icpc_to_cpp"] >= 0) & (s47_cpp["icpc_to_cp - | ((s47_cpp["s47_to_cpp"] >= 0) & (s47_cpp["s47_to_cp | ((s47_cpp["s47_to_cpp"] >= 0) & (s47_cpp["s47_to_cp - ] ] - - # Merge events back to S47_j view # Merge events back to S47_j view - s47_outs = s47_j.merge( s47_outs = s47_j.merge( - s47_cpp[["Date", "LAchildID", "CPPstartDate", "icpc_t s47_cpp[["Date", "LAchildID", "CPPstartDate", "icpc_t - how="left", how="left", - on=["Date", "LAchildID"], on=["Date", "LAchildID"], - ) ) - - return s47_outs return s47_outs - - -def s47_paths(s47_outs, s47_day_limit, icpc_day_limit): def s47_paths(s47_outs, s47_day_limit, icpc_day_limit): - """ """ - Creates an output that can generate a Sankey diagram of o Creates an output that can generate a Sankey diagram of o - """ """ - # Dates used to define window for S47 events where outcom # Dates used to define window for S47 events where outcom - for y in s47_outs["YEAR"]: for y in s47_outs["YEAR"]: - s47_outs["cin_census_close"] = datetime(int(y), 3, 31 s47_outs["cin_census_close"] = datetime(int(y), 3, 31 - s47_outs["s47_max_date"] = s47_outs["cin_census_close"] - s47_outs["s47_max_date"] = s47_outs["cin_census_close"] - - s47_day_limit s47_day_limit - ) ) - s47_outs["icpc_max_date"] = s47_outs["cin_census_close"] s47_outs["icpc_max_date"] = s47_outs["cin_census_close"] - icpc_day_limit icpc_day_limit - ) ) - - # Setting the Sankey diagram source for S47 events # Setting the Sankey diagram source for S47 events - step1 = s47_outs.copy() step1 = s47_outs.copy() - step1["Source"] = "S47 strategy discussion" step1["Source"] = "S47 strategy discussion" - - # Setting the Sankey diagram destination for S47 events # Setting the Sankey diagram destination for S47 events - step1["Destination"] = np.nan step1["Destination"] = np.nan - - step1.loc[step1["DateOfInitialCPC"].notnull(), "Destinati step1.loc[step1["DateOfInitialCPC"].notnull(), "Destinati - - step1.loc[ step1.loc[ - step1["DateOfInitialCPC"].isnull() & step1["CPPstartD step1["DateOfInitialCPC"].isnull() & step1["CPPstartD - "Destination", "Destination", - ] = "CPP start" ] = "CPP start" - - step1.loc[ step1.loc[ - ( ( - (step1["Destination"].isnull()) (step1["Destination"].isnull()) - & (step1["S47ActualStartDate"] >= step1["s47_max_ & (step1["S47ActualStartDate"] >= step1["s47_max_ - ), ), - "Destination", "Destination", - ] = "TBD - S47 too recent" ] = "TBD - S47 too recent" - - step1.loc[step1["Destination"].isnull(), "Destination"] = step1.loc[step1["Destination"].isnull(), "Destination"] = - - # Setting the Sankey diagram source for ICPC events # Setting the Sankey diagram source for ICPC events - step2 = step1[step1["Destination"] == "ICPC"] step2 = step1[step1["Destination"] == "ICPC"] - step2["Source"] = "ICPC" step2["Source"] = "ICPC" - - # Setting the Sankey diagram destination for ICPC events # Setting the Sankey diagram destination for ICPC events - step2["Destination"] = np.nan step2["Destination"] = np.nan - - step2.loc[step2["CPPstartDate"].notnull(), "Destination"] step2.loc[step2["CPPstartDate"].notnull(), "Destination"] - - step2.loc[ step2.loc[ - ( ( - (step2["Destination"].isnull()) (step2["Destination"].isnull()) - & (step2["DateOfInitialCPC"] >= step2["icpc_max_d & (step2["DateOfInitialCPC"] >= step2["icpc_max_d - ), ), - "Destination", "Destination", - ] = "TBD - ICPC too recent" ] = "TBD - ICPC too recent" - - step2.loc[step2["Destination"].isnull(), "Destination"] = step2.loc[step2["Destination"].isnull(), "Destination"] = - - # Merge the steps together # Merge the steps together - s47_journey = pd.concat([step1, step2]) s47_journey = pd.concat([step1, step2]) - - # Calculate age of child at S47 # Calculate age of child at S47 - s47_journey["Age at S47"] = _time_between_date_series( s47_journey["Age at S47"] = _time_between_date_series( - s47_journey["S47ActualStartDate"], s47_journey["Perso s47_journey["S47ActualStartDate"], s47_journey["Perso - ) ) - - return s47_journey return s47_journey - - -def export_journeyfile(analysis_output, s47_journey): def export_journeyfile(analysis_output, s47_journey): - """ """ - Writes the S47 journeys output as a csv Writes the S47 journeys output as a csv - """ """ - output_path = Path(analysis_output, f"CIN_Census_S47_jour output_path = Path(analysis_output, f"CIN_Census_S47_jour - s47_journey.to_csv(output_path, index=False) s47_journey.to_csv(output_path, index=False) - -``` \ No newline at end of file diff --git a/docs/fix_episodes.md b/docs/fix_episodes.md new file mode 100644 index 00000000..3b0d64b1 --- /dev/null +++ b/docs/fix_episodes.md @@ -0,0 +1,9 @@ +|Stage|Rule|Action|Note +|---|---|---|--- +|Stage1|RULE_1|Set DEC = DECOM_next
Set REC = 'X1'
If RNE_next in list (P,B,T,U) then
Set REASON_PLACE_CHANGE = 'LIIAF' (LIIA fix)| Child remains LAC but episode changes +|Stage1|RULE_1A|"Set DEC = min(31/03/YEAR, DECOM_next - 1day)
Set REC = 'E99' (LIIA fix)" Child ceases LAC but re-enters care later| Child ceases LAC but re-enters care later +|Stage1|RULE_2|"Set DEC = 31/03/YEAR
Set REC = 'E99' (LIIA fix)" Child no longer LAC|Child no longer LAC +|Stage1|RULE_3|Delete episode|Duplicate episode +|Stage1|RULE_3A|Delete episode|Episode replaced in later submission +|Stage2|RULE_4|Set DEC = DECOM_next|Overlaps next episode +|Stage2|RULE_5|Set DEC = DECOM_next|Gap between episodes which should be continuous (X1) \ No newline at end of file diff --git a/docs/general_pipeline.md b/docs/general_pipeline.md new file mode 100644 index 00000000..39043e96 --- /dev/null +++ b/docs/general_pipeline.md @@ -0,0 +1,142 @@ +# General Pipeline + +This general outline documents the steps that most of these pipelines will follow. For some pipelines, some steps may be skipped or additional steps may be added, however, by sticking to these general steps we can ensure that the data is processed in a consistent way, as well as making it easier to maintain the codebase + +The high-level steps can be summarised as follows: + +1. Prep data - move file and collect metadata +2. Clean data - ensure file is in a consistent format +3. Enrich data - add additional data from other sources, e.g. LA or year +4. Apply Privacy Policy - degrade data to meet data minimisation rules +5. History - sessions data +6. Concatenate data - concatenate data for each LA +7. Prepare reports + +But we will go into more detail below. + +For steps 1-6, there will be: + +* an 'input' file area, where the files are uploaded to. +* a 'workspace' file area, containing 'current' and 'sessions' folders. + * the 'current' folder contains a copy of the processed data appropriately cleaned and minimised. + * the 'sessions' folder contains a history of each session, including the incoming, cleaned, enriched and degraded files as well as an error report. + * these folders are only visible to the pipeline but can be accessed by technical staff in case of troubleshoothing. +* a 'shared' file area, containing 'current', 'concatenated' and 'error_report' folders. + * the 'current' folder contains a copy of the data from the 'workspace/current' folder. + * the 'concatenated' folder contains the concatenated data produced in step 6. + * the 'error_report' folder contains a copy of the error report from the 'input/sessions' folder. + * this folder can seen by the LA account and accessed by central pipelines for creating reports. + +For step 7, there will be: + +* an 'input' file area, which will be the previous steps 'shared/concatenated' folder. +* a 'workspace' file area, containing 'current' and 'sessions' folders. + * the 'current' folder contains a copy of the reports created for each use case. + * the 'sessions' folder contains a history of each session, including the incoming files. + * these folders are only visible to the pipeline but can be accessed by technical staff in case of troubleshoothing. +* a 'shared' file area, containing a copy of the reports created for each use case and an 'error_report' folder. + * the 'error_report' folder contains a copy of the error report from the 'input/sessions' folder from the previous steps. + * this folder can seen by the Organisation account. + +## Prep data + +Initial setup & configuration including creating a new session folder, moving the incoming file to the session folder, and collecting metadata. + +Expect to be a standard task to cover all pipelines. + +If no new files are found, this step simply exits. + +Returns: + +* Session folder +* List of files +* Metadata - if provided in incoming folder, such as folder name + +## Clean data + +* Detects the year. +* Checks the year is within the retention policy. +* Reads and parses the incoming files. +* Ensures that the data is in a consistent format and that all required fields are present. +* Collects "error" information of any quality problems identified such as: + + * File older than retention policy + * Unknown files + * Blank files + * Missing headers + * Missing fields + * Unknown fields + * Incorrectly formatted data / categories + * Missing data + +* Creates dataframes for the indetified tables +* Applies retention policy to dataframes, including file names, headers and year. + +Inputs: + + * Session folder + * Incoming files + +Outputs: + + * Dataframes for retained tables + * Error report + * Relevant metadata + +## Enrich data + +Adds standard enrichments to the data, these include: + + * Adds suffix to ID fields to ensure uniqueness + * Adds LA name + * Adds detected year + +Tables and columns names can be provided through configuration. Other functions can be added to the enrichment pipeline as required. + +Inputs: + + * Dataframes for each table + * Metadata + +Outputs: + + * Dataframes for each table + +## Apply Privacy Policy + +Removes sensitive columns and data, or masks / blanks / degrades the data to meet data minimisation rules. + +Working on each of the tables in turn, this process will degrade the data to meet data minimisation rules: + + * Dates all set to the first of the month + * Postcodes all set to the first 4 characters (excluding spaces) + +Inputs: + + * Dataframes for each table + +Outputs: + + * Dataframes for each table + +## History - sessions data + +Creates a historical archive of the data after each processing step into a unique session_id folder. This includes a subfolder for: + +* Incoming data +* Cleaned data +* Enriched data +* Degraded data - privacy policy applied + +The session_id folder also contains an error report detailing the errors for all files processed during the session. + +This process is structured so that an archive of the historical data is preserved in case of data corruption, and should allows the steps to be re-run to build as much +history as is retained. + +## Concatenate data + +Concatenates the data of multiple years into a single dataframe for each LA and file type. For example, six years of SSDA903 episodes files for one LA are concatenated together. During this process data is also deduplicated using the unique_key and sort arguments in the pipeline schemas. + +## Prepare reports + +Use the concatenated data to create reports to be shared. These can vary from a further concatenated dataset, combining multiple LAs data, to specific analyses built around several datasets. \ No newline at end of file diff --git a/docs/library-dependencies.md b/docs/library-dependencies.md deleted file mode 100644 index cc106a3b..00000000 --- a/docs/library-dependencies.md +++ /dev/null @@ -1,47 +0,0 @@ -# Pipeline Tools - Dependencies - -## Overview - -pass - - -## Libraries - -### sf-fons-pipeline-tools - -* logging -* standard data objects - * filelocator - * datacontainer -* archive - -### sf-fons-pipeline-tools-dagster - -Dagster specific utilities - such as changing the main -flow by adding stages etc. - -### sfdata - -Tools for additional data processing - allows use of stream parser to add detailed custom cleaning. - -### Client Implementation - -Holds configuration and implementation of the client pipeline. - -## sf-fons-runtime - -The runtime is the main entry point for the pipeline. It is responsible for loading the configuration and running the pipeline within the sf-fons environment. - -``` mermaid - -graph TD; - sf-fons-pipeline-tools --> client - - sf-fons-pipeline-tools-dagster -..-> client - - sfdata -..-> client - - client[Client Implementation] --> sf-fons-runtime; - - -``` diff --git a/docs/pipeline.md b/docs/pipeline.md deleted file mode 100644 index 3905f1f5..00000000 --- a/docs/pipeline.md +++ /dev/null @@ -1,103 +0,0 @@ -# General Pipeline - -This general outline documents the steps that most of these pipelines will follow. For some pipelines, some steps may be skipped or additional steps may be added, however, by sticking to these general steps we can ensure that the data is processed in a consistent way, as well as making it easier to maintain the codebase - -The high-level steps can be summarised as follows: - -1. Prepfile - move file and collect metadata -2. Cleanfile - ensure file is in a consistent format -3. Enrich data - add additional data from other sources, e.g. LA or year -4. Apply Privacy Policy - degrade data to meet data minimisation rules -5. History 1 - Archive data -6. History 2 - Rollup data -7. Client reports -8. Prepare shareable reports -9. Data Retention 1 - Clear old history data -10. Data Retention 2 - Clear old session data -11. Data Retention 3 - Clear incoming data - -But we will go into more detail below. - -For all of these, there will be an 'incoming' file area, where the files are uploaded to, and a 'processed' file area, where the files are moved to after processing. In addition, there is a 'session' folder which is only visible to the pipeline but can be accessed by technical staff in case of troubleshoothing. - -To allow for data sharing, there is also a 'shareable' folder, which is a copy of the processed data appropriately minimised for sharing. This folder can only be seen by the client but also accessed by central pipelines for merging with other shared data. - - -## Prepfile - -Initial setup & configuration including creating a new session folder, moving the incoming file to the session folder, and collecting metadata. - -Expect to be a standard task to cover all pipelines. - -If no new files are found, this step simply exits. - -Returns: -* Session folder -* List of files -* Metadata - if provided in incoming folder, such as folder name - -Questions: -* What if there are multiple folders? E.g. say we use folder per-year, but what if there are multiple files accross several years, do we run this multiple times or do we handle multiple years? - -## Cleanfile - -* Reads and parses the incoming files. -* Detects the year if the format is year dependent. -* Ensures that the data is in a consistent format and that all required fields are present. -* Creates "issues" lists of any quality problems identified such as: - * Unknown files - * Missing fields - * Unknown fields - * Incorrectly formatted data / categories - * Missing data -* Creates dataframes for the indetified tables - -Inputs: - * Session folder - -Outputs: - * Dataframes for each table - * Issues lists (format TBC) - * Relevant metadata - -Recommend that return object follows a standard format so it can be serialized in a standard way for all pipelines - -## Enrich data - -Adds standard enrichments to the data, these include: - - * Adds suffix to ID fields to ensure uniqueness - * Adds LA name - * Adds detected year - -Tables and columns names can be provided through configuration. Other functions can be added to the enrichment pipeline as required. - -Inputs: - * Dataframes for each table - * Metadata - -Outputs: - * Dataframes for each table - -## Apply Privacy Policy - -Removes sensitive columns and data, or masks / blanks / degrades the data to meet data minimisation rules. - -Working on each of the tables in turn, this process will degrade the data to meet data minimisation rules: - * Dates all set to the first of the month - * Postcodes all set to the first 4 characters (excluding spaces) - * Some tables need rows deleted if there are blanks in a specific column - -Inputs: - * Dataframes for each table - -Outputs: - * Dataframes for retained tables - -## History 1 - Archive data -## History 2 - Rollup data -## Client reports -## Prepare shareable reports -## Data Retention 1 - Clear old history data -## Data Retention 2 - Clear old session data -## Data Retention 3 - Clear incoming data \ No newline at end of file diff --git a/docs/pipeline_creation.md b/docs/pipeline_creation.md new file mode 100644 index 00000000..4f0420f3 --- /dev/null +++ b/docs/pipeline_creation.md @@ -0,0 +1,333 @@ +# Creating a pipeline + +This creating a pipeline document outlines the steps required to create a pipeline compatible with the data platform. This document will go into detail of steps 1-7 (all steps) in the general pipeline document. It will outline how to achieve a data spanning several years across multiple LAs as highlighted in step 7, but will not include specific/bespoke analyses. + +## 1. Create a new pipeline folder with the following subfolders (example files have also been included) + +``` +liia_tools/ +├─ new_pipeline/ +│ ├─ __init__.py +│ ├─ stream_filters.py (only for xml) +│ ├─ stream_pipeline.py +│ ├─ stream_record.py (only for xml) +│ ├─ spec/ +│ │ ├─ __init__.py +│ │ ├─ pipeline.json +│ │ ├─ new_data_schema_2024.yml or new_data_schema_2024.xsd +│ │ ├─ new_data_schema_2025.diff.yml or new_data_schema_2025.xsd +│ │ ├─ samples/ +│ │ │ ├─ __init__.py +│ │ │ ├─ new_data_sample.csv +``` + +## 2. Create the schemas: use a .yml schema for .csv and .xlsx files, use an .xsd schema for .xml files + +The first .yml schema will be a complete schema for the easliest year of data collection. Afterwards you can create .yml.diff schemas which just contain the differences in a given year and will be applied to the initial .yml schema. \ +For .xml files there is no equivalent .xsd.diff so each year will need a complete schema. + +* The .yml schema should follow this pattern: + +```yaml +column_map: + file_name: + header_name: + cell_type: "property" + canbeblank: yes/no + header_name: + ... +``` + +* Details of the cell_type and "property" can be found in the Column class in the [__data_schema.py](/liiatools/common/spec/__data_schema.py) file and below are details of how to apply the available types: + +```yaml +string: "alphanumeric" + +string: "postcode" + +string: "regex" + +numeric: + type: "integer" + min_value: 0 (optional) + max_value: 10 (optional) + +numeric: + type: "float" + decimal_places: 2 (optional) + +date: "%d/%m/%Y" + +category: + - code: b) Female + name: F (optional) + cell_regex: (optional) + - /.*fem.*/i + - /b\).*/i + +header_regex: + - /.*child.*id.*/i +``` + +* Details of how the .diff.yml files are applied can be found in the load_schema function of the [spec/_\_init__.py](/liiatools/ssda903_pipeline/spec/__init__.py) file. The .diff.yml schema should follow this pattern: + +```yaml +column_map.header.uniquepropertyreferencenumber: + type: add + description: Adds new column for header + value: + numeric: + type: "integer" + min_value: 1 + max_value: 999999999999 + canbeblank: yes + +column_map.social_worker: + type: add + description: Adds new table for the 2024 return + value: + CHILD: + string: "alphanumeric" + canbeblank: no + +column_map.episodes.REC: + type: modify + description: Updated REC codes for 2019 schema + value: + category: + - code: "X1" + - code: "E11" + - code: "E12" + +column_map.uasc.sex: + type: rename + description: Rename column in uasc + value: + gender + +column_map.prev_perm: + type: remove + description: Remove column from prev_perm + value: + - DOB +``` + +* The .xsd schema should follow the standard xml schema patterns as described in this [XML Schema Tutorial](https://www.w3schools.com/xml/schema_intro.asp). The cleaning performed on the .xsd shchema mimics the .yml schema e.g. postcode, integer, float etc. Details of how the different cleaning functions are implemented can be found in the add_column_spec function in the [cin_pipeline/stream_filters.py](/liiatools/cin_census_pipeline/stream_filters.py) file and below are details of how to apply the available cleaning functions: + +```xml +Category + + + Male (optional) + + + Female(optional) + + + +Numeric (integer) + + + (optional) + + + +Numeric (float) + + + (optional) + (optional) + (optional) + + + +Regex + + + + + + +Date + + +String + +``` + +## 3. Create the pipeline.json file, these follow the same pattern across all pipelines + +* The .json schema should follow this pattern: + +```json +{ + "retention_columns": { + "year_column": "YEAR_COLUMN", + "la_column": "LA_COLUMN" + }, + "retention_period": { + "USE_CASE_1": 12, + "USE_CASE_2": 7, + }, + "la_signed": { + "LA_name": { + "USE_CASE_1": "Yes", + "USE_CASE_2": "No" + } + } + "table_list": [ + { + "id": "table_name", + "retain": [ + "USE_CASE_TO_RETAIN_FOR" (optional) + ] + "columns": [ + { + "id": "column_1", + "type": "type", + "unique_key": true, (optional) + "enrich": [ + "enrich_function_1", (optional) + "enrich_function_2" (optional) + ] + "degrade": "degrade_function" (optional) + "sort": sort_order_integer (optional) + "exclude": [ + "USE_CASE_TO_EXCLUDE_FROM" (optional) + ] + } + ] + } + ] +} +``` +* Details of the different available values can be found in the PipelineConfig class in the [__config.py](/liiatools/common/data/__config.py) file. +* Details of the different possible enrich and degrade functions can be found in the [_transform_functions.py](/liiatools/common/_transform_functions.py) file. +* The retention columns should align with the columns created for year and LA in the table_list section of this pipeline.json file. + +## 4. Create the new_pipeline/spec/_\_init__.py file which will load the schema and the pipeline + +* For .csv or .xlsx files use the [ssda903_pipeline/spec/_\_init__.py](/liiatools/ssda903_pipeline/spec/__init__.py) as a template. The only changes needed are two SSDA903 references in the load_schema function, these should be renamed to reflect the new use case name and therefore schema names. + +* For .xml files use the [cin_census_pipeline/spec/_\_init__.py](/liiatools/cin_census_pipeline/spec/__init__.py) as a template. Similarly the only changes needed are the two CIN references in the load_schema function, these should be renamed to reflect the new use case name and therefore schema names. + +## 5. Create the stream_pipeline.py file. These will vary slighly depending on the type of data you wish to produce a pipeline for + +* For .csv files use the [ssda903_pipeline/stream_pipeline.py](/liiatools/ssda903_pipeline/stream_pipeline.py) file. You can simply copy the file and update the docstrings to refer to the new dataset. + +* For .xlsx files use the [annex_a_pipeline/stream_pipeline.py](/liiatools/annex_a_pipeline/stream_pipeline.py) file. Again you can copy this file and update the docstrings to refer to the new dataset. For these file types you will also need to copy the [annex_a_pipeline/stream_filters.py](/liiatools/annex_a_pipeline/stream_filters.py) to the same location as the stream_pipeline.py file. + +* For .xml files use the [cin_pipeline/stream_pipeline.py](/liiatools/cin_pipeline/stream_pipeline.py) file. Again you can copy this file and update the docstrings to refer to the new dataset. For these file types you will also need to copy the [cin_pipeline/stream_filters.py](/liiatools/cin_pipeline/stream_filters.py) and [cin_pipeline/stream_record.py](/liiatools/cin_pipeline/stream_record.py) files. It will be easiest to copy over these file initially and make small adjustments as needed. These adjustments are described in detail below. + +## 6. Create stream_filters.py and stream_record.py files for .xml pipelines. These files will both vary slighly depending on the data within the pipeline + +* The stream_filters.py file will need changes in the add_column_spec function. This is where we convert the values in the .xsd schema to mimic the values in a .yml schema, allowing the cleaning of the pipelines to function in the same way. Examples of changes are: + +```python +if config_type is not None: + if config_type[-4:] == "type": # no change needed, looks in the .xsd for categories which should all end with "type" + column_spec.category = _create_category_spec(config_type, schema_path) + if config_type in ["positiveinteger"]: # include all nodes that should be turned into numeric values + column_spec.numeric = _create_numeric_spec(config_type, schema_path) + if config_type in ["upn"]: # include all nodes that should be cleaned with regex values + column_spec.string = "regex" + column_spec.cell_regex = _create_regex_spec(config_type, schema_path) + if config_type == "{http://www.w3.org/2001/XMLSchema}date": # no change needed, looks in the .xsd for dates + column_spec.date = "%Y-%m-%d" + if config_type == "{http://www.w3.org/2001/XMLSchema}dateTime": # no change needed, looks in the .xsd for datetimes + column_spec.date = "%Y-%m-%dT%H:%M:%S" + if config_type in [ # keep these values but may need to add other default integer values + "{http://www.w3.org/2001/XMLSchema}integer", + "{http://www.w3.org/2001/XMLSchema}gYear", + ]: + column_spec.numeric = Numeric(type="integer") + if config_type == "{http://www.w3.org/2001/XMLSchema}string": # no change needed, looks in the .xsd for strings + column_spec.string = "alphanumeric" +``` + +* The stream_record.py will need changes to the classes and collectors. The current cin_pipeline/stream_record.py is specific to the cin census use case and should not be applied in general terms. This module tries to create one .csv of cin events, in general we want to create one .csv file per .xml node with just one layer of subnodes (no xs:complexType within an xs:complexType) e.g. for Children's Social Work Workforce we have: + +```xml + + + + + + + +``` +* Here Header, LALevelVacancies and CSWWWorker just contain one layer of subnodes each, so we create a .csv for each of these (although Header doesn't actually contain any useful information so we ignore that). + +* Changes that are needed for the stream_record.py file include: + +```python +Classes +class CSWWEvent(events.ParseEvent): # rename this class to the align with the xs:element you are interested in, e.g. CSWWWorker -> CSWWEvent + @staticmethod + def name(): + return "worker" # rename this to a sensible name that will be used to name the .csv file + + pass + +Collector +def message_collector(stream): + """ + Collect messages from XML elements and yield events + + :param stream: An iterator of events from an XML parser + :yield: Events of type HeaderEvent, CSWWEvent or LALevelEvent + """ + stream = peekable(stream) + assert stream.peek().tag == "Message", f"Expected Message, got {stream.peek().tag}" + while stream: + event = stream.peek() + if event.get("tag") == "Header": # Rename this to match the .xsd schema + header_record = text_collector(stream) + if header_record: + yield HeaderEvent(record=header_record) # Rename this to match the class you have created + elif event.get("tag") == "CSWWWorker": # Rename this to match the .xsd schema + csww_record = text_collector(stream) + if csww_record: + yield CSWWEvent(record=csww_record) # Rename this to match the class you have created + elif event.get("tag") == "LALevelVacancies": # Rename this to match the .xsd schema + lalevel_record = text_collector(stream) + if lalevel_record: + yield LALevelEvent(record=lalevel_record) # Rename this to match the class you have created + else: + next(stream) + +Export +@generator_with_value +def export_table(stream): # This function should stay the same but is not currently used for the cin census so I have created a version here + """ + Collects all the records into a dictionary of lists of rows + + This filter requires that the stream has been processed by `message_collector` first + + :param stream: An iterator of events from message_collector + :yield: All events + :return: A dictionary of lists of rows, keyed by record name + """ + dataset = {} + for event in stream: + event_type = type(event) + dataset.setdefault(event_type.name(), []).append(event.as_dict()["record"]) + yield event + return dataset +``` +## 7. Adjust the [assets/common.py](/liiatools_pipeline/assets/common.py) file to include the new pipeline.json file you have created + +* You just need to add an import like so: + +```python +from liiatools.new_pipeline.spec import load_pipeline_config as load_pipeline_config_dataset +``` + +## 8. Adjust the [ops/common_la.py](/liiatools_pipeline/ops/common_la.py) file to include the new schema and task_cleanfile you have created + +* You just need to add an import like so: + +```python +from liiatools.new_pipeline.spec import load_schema as load_schema_dataset +from liiatools.new_pipeline.stream_pipeline import task_cleanfile as task_cleanfile_dataset +``` \ No newline at end of file diff --git a/docs/ssda903.md b/docs/ssda903.md deleted file mode 100644 index dde79bd9..00000000 --- a/docs/ssda903.md +++ /dev/null @@ -1,215 +0,0 @@ -# Data Specification: SSDA903 Return - -Test Coverage: 48% - -Structure here is slightly different in that the CLI just calls functions in `s903_main_functions.py`. I'm not sure why this is the case. - -Four CLI options: - -* cleanfile(input, la_code, la_log_dir, output) -* la_agg(input, output) -* pan_agg(input, la_code, output) -* suffiency_output(input, output) - -**QUESTIONS** - * Part of this process is to filter out tables that should not be shared. However, nothing removes these files if they already exist at source. Is this a problem? - - -## CLI COMMAND: Cleanfile - -* cleanfile - * delete_unrequired_files - * takes a list of filenames to delete - * loops over each configured filename, and if the current filename matches the configured filename, deletes the input file - * save_unrequired_file_error - * Opens a dyanmic filaneme based on the input filename **DAGSTER WARNING** - - **Q:** Why doesn't this just check if the input filename is IN the filelist? - - * check_blank_file - * Attempts to open the input file using pd.read_csv **DATSTER WARNING** - * If opening the files raises pandas.errors.EmptyDataError: - * Opens a dynamic filename based on the input filename **DAGSTER WARNING** - * **RETURNS**: "empty" (str) - - * IF the previous return == "empty", exits operation - - - **NOTE**: Although if successful the above file has now been read as a dataframe, it is not kept - - * drop_empty_rows - * Opens the input file using pd.read_csv **DAGSTER WARNING** - * Drops any rows that are completely empty - * Saves the file to the output filename using df.to_csv **DAGSTER WARNING** - - * Calls the `check_year` function to get the year from the filename (see comments in [cin_census.md](cin_census)) - * if the above fails, calls `save_year_error` which writes a single line to a dynamically named logfile **DAGSTER WARNING** - - * Calls `check_year_within_range` function to check if file should be processed (see comments in [cin_census.md](cin_census)) - * if the above fails, calls `save_incorrect_year_error` which writes a single line to a dynamically named logfile **DAGSTER WARNING** - - * Calls `check_file_type` function to check if file should be processed (see comments in [cin_census.md](cin_census)) - * this function is a bit more complex, and writes the logfile itself **DAGSTER WARNING** - - - **FROM HERE WE SWITCH FROM PANDAS TO STREAM** - - * parse_csv - * Uses tablib to open the input file **DAGSTER WARNING** - * Returns stream of table events - - * add_year_column - - * configure_stream - * add_table_name - * inherit_property(table_name) - * inherit_property(expected_columns) **NOTE** This is set by `add_table_name` - * match_config_to_cell - - * clean - * clean_dates - * clean_categories - * clean_integers - * clean_postcodes - - * degrade - * degrade_postcodes - * degrade_dob - - * log_errors - * blank_error_check - * create_formatting_error_count - * create_blank_error_count - * create_file_match_error - * create_extra_column_error - - * create_la_child_id - - * save_stream - * coalesce_row - * create_tables - * save_tables - - * save_errors_la - - -## CLI COMMAND: la_agg - -* la_agg - - * Config() - - * read_file - alias for pd.read_csv **DAGSTER WARNING** - - * match_load_file - * determines the current table name based on exactly mathing the column headers - - * merge_la_files - * Uses the table name to look for a file with the name `SSDA903_{table_name}_merged.csv` **DAGSTER WARNING** - * If this file exists - opens and merges it with the current dataframe - - * convert_datetimes - * Finds the expected date columns for the current table_name - * Converts these columns to datetime format using fixed format "%Y/%m/%d" - - NOTE: Would be more efficient if it did all columns in one go - - * deduplicate - * Sorts and drops duplicates based on the list of primary keys from the config file - - * remove_old_data - The function `remove_old_data` is designed to filter out rows from a dataframe based on the "YEAR" column value, removing any rows where the "YEAR" is older than a specified number of years as at a given reference date. - - Here's a breakdown of what the function does: - - 1. **Parameters:** - - `s903_df`: This is a pandas DataFrame, presumably containing data from a CSV file, and there's a column named "YEAR" that has year values in it. - - `num_of_years`: This specifies how many years of data you want to retain, counting back from a given reference date. - - `new_year_start_month`: This denotes the month which signifies the start of a new year for the data retention policy. It allows the function to handle cases where the start of the "data year" is not necessarily January. - - `as_at_date`: This is the reference date against which the valid range is determined. - - 2. **Calculate Current Year and Month:** The function extracts the year and month from the `as_at_date` using pandas' `to_datetime` function. - - 3. **Determine the Earliest Allowed Year:** - - If the current month is before the specified `new_year_start_month`, then the `earliest_allowed_year` is simply the current year minus `num_of_years`. - - Otherwise, it rolls forward one year by subtracting `num_of_years` from the current year and adding 1. This is to account for scenarios where, let's say, the data year starts in July and the `as_at_date` is in August. In that case, the current data year is considered to be in the retention period. - - 4. **Filter the DataFrame:** The function then filters the input dataframe `s903_df` to retain only the rows where the "YEAR" column is greater than or equal to the `earliest_allowed_year`. - - 5. **Return the Filtered DataFrame:** Finally, the filtered dataframe is returned. - - ### Example Usage: - - Consider you have a dataframe (`s903_df`) like this: - ``` - YEAR VALUE - 0 2019 10 - 1 2020 20 - 2 2021 30 - 3 2022 40 - ``` - And your data retention policy is such that the data year starts in July, and you want to keep only the last 2 years of data as of August 2022. - - If you call the function like this: - ```python - filtered_df = remove_old_data(s903_df, 2, 7, "2022-08-01") - ``` - The function would return a dataframe like this: - ``` - YEAR VALUE - 1 2020 20 - 2 2021 30 - 3 2022 40 - ``` - This means the function has removed data from 2019 since it's older than 2 years based on the retention policy starting from July. - - * IF FILE STILL HAS DATA - - * convert_dates - * Same as `convert_datetimes` with the addition of `.dt.date` - - * export_la_file - * Saves the dataframe to a CSV file using df.to_csv **DAGSTER WARNING** - - -## CLI COMMAND: pan_agg - -* pan_agg - - * Config() - - * read_file - alias for pd.read_csv **DAGSTER WARNING** - - * match_load_file - * DUPLICATE of `match_load_file` in `la_agg` - - * IF current file is a table to be kept (from config): - - * merge_agg_files - * Loads the old pan london file **DAGSTER WARNING** - - * _merge_dfs - - * Drops old entries by LA code - * Loads the current file - - * export_pan_file - - * Saves the dataframe to a CSV file using df.to_csv **DAGSTER WARNING** - - -## CLI COMMAND: sufficiency_output - -* sufficiency_output - - * Config() - - * read_file - alias for pd.read_csv **DAGSTER WARNING** - - * match_load_file - * DUPLICATE of `match_load_file` in `la_agg` - - * IF current file is a table to be kept (from config): - - * data_min - * Removes columns specified in config - - **Q:** Principles of data minimisation would say that one should specify columns to be preserved - - * export_suff_file - * Saves the dataframe to a CSV file using df.to_csv **DAGSTER WARNING** \ No newline at end of file