Skip to content

Calculates statistical neighbours (aka peers) for Local Authorities in England.

License

Notifications You must be signed in to change notification settings

NHSDigital/ASC_LA_Peer_Groups

Repository files navigation

Peer Groups for Local Authorities

Calculates statistical neighbours (aka peers) for Local Authorities in England, for use in Adult Social Care statistics.

Contact

This repository is maintained by the NHS England Adult Social Care Statistics Team.

To contact us raise an issue on Github or via email at [email protected]. See our (and our colleagues') other work here: NHS England Analytical Services.

Description

This repository was developed by the Data Science team for the Adult Social Care Statistics team, to provide a way of comparing statistics between 'similar' Local Authorities.

We have calculated a metric of similarity (Euclidean distance) based on standardised, normalised input features from Census 2021 data, including population demographics such as age, ethnicity and educational attainment.

Setup

  • This project was developed using Python 3.10.5
  • Required Python libraries are listed in requirements.txt
  • Optional:: Python libraries used for linting are included in dev-requirements.txt. See the developing the pipeline section for more details about linting configuration.

Set up a virtual environment

Clone this project and ensure you're in the root directory, ASC_LA_Peer_Groups. You can change your current directory in the terminal e.g.

cd ASC_LA_Peer_Groups

Set up a virtual environment and install requirements:

py -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Getting started

Configuring the pipeline

The configuration for the pipeline is defined in config.toml. If you want to adjust the weights of any of the inputs features (including adding or removing features), change the UTLA definitions etc., make the required edits in the config.toml.

There are four sections:

  1. [LOCATION] - locations used by the pipeline. This includes where the input data is stored, as well as where logs and outputs are saved. The location of your input and output directories need setting up in the config.toml.

    Notes on file path
    • Data files should be downloaded and stored to the location specified in config.toml. This must be outside of this repository, and can be a shared drive location
    • Also note that file paths should contain forward slashes e.g. "C:/Users/username/Documents/data"
  2. [LOCAL_AUTHORITY] - defines the local authority codes to use

  3. [MODEL_OUTPUT] - defines changeable characteristics of the output. Currently this only includes n_peers which limits the number of closest nearest neighbours output by the model.

    Model output defaults
    n_peers = 15
  4. [FEATURE_WEIGHTS] - lists features along with their associated weights. Note that a weight of zero reduces the effect of the feature to zero and thereby excludes it completely.

    Feature weight defaults
    "Over 15 Population" = 1
    "85 and over Population %" = 1
    "Aged 65 to 84 Population %" = 0
    "black african %" = 0.5
    "black caribbean %" = 0.5
    "bangladeshi %" = 0.5
    "indian %" = 0.5
    "chinese %" = 0.5
    "pakistani %" = 0.5
    "mixed %" = 0
    "white %" = 0
    "home_owners %" = 0
    "social_renters %" = 1
    "student %" = 1
    "routine_manual %" = 0
    "low_english_proficiency %" = 1
    "People per square km" = 1
    "higher_level_qualifications %" = 1
    "few_rooms %" = 0
    "Distance to Sea (km)" = 0.5
    "Sparsity (% population living in low density areas)" = 1
  5. [REMOVE_LAS] - lists UTLAs to be excluded from analysis. For example ["Isles of Scilly", "City of London"]. Please use the name of the Local Authority, using the la_name field defined in [LOCAL_AUTHORITY] above.

    Default removed UTLAsNHS England remove Isle of Scilly and City of London from the default model. I.e.:
    las_to_remove = ["Isles of Scilly", "City of London"]

Data

Data files should be downloaded and stored to the location specified in config.toml. This must be outside of this repository, and can be a shared drive location.

  1. Download ten CSV files and save them to the input location specified in the config.toml. The names of the files and their sources are provided below. Some files are only available as part of a collection, in which case the source is listed as a zip file containing more than one csv. Where this is the case, download and extract the zip file saving the file version which ends 'lsoa'.
Save as file name Source Details
area_sqkm.csv https://geoportal.statistics.gov.uk/datasets/a488cb8fc9a74accb63cb52961e456ef/about Click the Download button at the top of the page. Within the subfolder "Measurements", rename the file "SAM_LSOA_DEC_2021_EW_in_KM" to "area_sqkm.csv"
distance_to_sea.csv https://digital.nhs.uk/supplementary-information/2024/distance-to-sea-calculations Download the csv data file and rename to distance_to_sea.csv
english_proficiency.csv https://www.nomisweb.co.uk/output/census/2021/census2021-ts029.zip Download the zip folder. Rename the csv ending in "lsoa" to "english_proficiency"
ethnicity.csv https://www.nomisweb.co.uk/output/census/2021/census2021-ts021.zip Download the zip folder. Rename the csv ending in "lsoa" to "ethnicity"
housing_tenure.csv https://www.nomisweb.co.uk/output/census/2021/census2021-ts054.zip Download the zip folder. Rename the csv ending in "lsoa" to "housing_tenure"
ns-sec.csv https://www.nomisweb.co.uk/output/census/2021/census2021-ts062.zip Download the zip folder. Rename the csv ending in "lsoa" to "ns-sec"
population_data.csv https://www.nomisweb.co.uk/output/census/2021/census2021-ts007a.zip Download the zip folder. Rename the csv ending in "lsoa" to "population_data"
qualification_level.csv https://www.nomisweb.co.uk/output/census/2021/census2021-ts067.zip Download the zip folder. Rename the csv ending in "lsoa" to "qualification_level"
rooms.csv https://www.nomisweb.co.uk/output/census/2021/census2021-ts051.zip Download the zip folder. Rename the csv ending in "lsoa" to "rooms"
LSOA21_to_UTLA22.csv https://www.data.gov.uk/dataset/14d8efd0-b14c-46ac-b2fe-a7892ea51ca5/lsoa-2021-to-utlas-december-2022-best-fit-lookup-in-ew-v2 Under "Data links", click the CSV hyperlink to download the file. Rename this file to LSOA21_to_UTLA22

A note on lookups: The final CSV file listed above, LSOA21_to_UTLA22.csv , maps LSOAs to UTLAs (local authorities).

E.g.:

| LSOA21_CODE | LSOA21_NAME       | UTLA_CODE | UTLA_NAME     |
|-------------|-------------------|-----------|---------------|
| E01012052   |Middlesbrough 014D | E06000002 | Middlesbrough |

Running the pipeline

NOTE: Please edit the LOCATION in config.py before running the pipeline.

Once you've initially setup the virtual environment in the previous steps, ensure you're in the virtual environment by running the code .venv\Scripts\activate in the terminal.

Once you've activated your virtual environment, run the following code from the terminal:

python main.py

If you want to adjust the weights of any of the inputs features (including adding or removing features), change the UTLA definitions etc., make the required edits in the config.toml.

(Optional) Adding a custom hash:

To make your pipeline run easier to identify, it is possible to pass a custom hash to name your pipeline. This means log names and your output pipeline folder name will include the hash.

The hash length is set in config.toml- if you supply a shorter hash this is fine, but be aware that a longer hash will be cropped to the first n characters using the hash length.

python main.py --hash my_run

Where my_run is the custom hash you have supplied.

Outputs

This pipeline produces the following as final outputs, saved to the outputs directory:

  • features.csv - The final features used to produce the distances
  • distances.csv - Distance between each pair of local authorities
  • peers.csv - N most similar peers for each local authority (n defined in config.toml)
  • example_peers.csv - The above but limited to a subset of local authorities specified in src/params.py

Reports to accompany these outputs, including details of correlation between features and feature distributions, are saved to the reports directory.

Final outputs and reports are saved to a pipeline folder saved in the output directory defined in config.toml. The name of each pipeline folder corresponds to the time the pipeline was initialised, and any custom hash that was provided.

Interim data processing produces files saved to the data/ directory- these are NOT copied to the pipeline output location.

Updating the data/lookup files

New data and lookups can be added easily to the pipeline. All new data and lookups should be stored in the input directory, as specified in the config file as input_dir.

First check that the format of the new/updated file matches the old one (see the earlier Data section for links). Move the new file into the input directory (you may want to archive the old file).

In src/params.py, check the "Data File Names" section, and ensure the name of the replaced file matches the corresponding file name in the params. Further to this, check the column values in the “Columns” section of params for the feature you have changed, and ensure these match the columns within the new data.

If updating the lookup file, open config.toml and check that la_code and la_name point to the correct columns in the new lookup.

Example: updating the LSOA to UTLA lookup

As of 2024 the latest lookup can be found here: https://www.data.gov.uk/dataset/801d40f6-fa98-40ef-ba16-0193ef04cff0/lsoa-2021-to-utlas-april-2023-best-fit-lookup-in-ew

Copy across the new lookup to the input directory, ensure it has a unique name (e.g. LSOA21_to_UTLA23) and that it is saved as a CSV. Ensure the new lookup has no blank space above the headers, and make a note of the header names.

Navigate to src\params.py, go to the "Data File Names" section and change the name of the LSOA_UTLA_lookup_file to the new lookup file e.g.

LSOA_UTLA_lookup_file = "LSOA21_to_UTLA23.csv"

If the LSOA code column name in the new lookup has changed, you will also need to update LSOA_code (in the "Pathway Parameters" section) to point to the correct column name.

Navigate to config.toml and ensure the la_code and la_name match the names of the relevant columns in your new lookup e.g.

la_code = "UTLA23CD"
la_name = "UTLA23NM"

You can now run the pipeline with the updated lookup.

Boundary changes

If there are boundary changes, LSOA_AREA_KM.csv will need updating (if there is an available update), along with the LSOA to UTLA lookup. See the above section on how to update the data/lookup. The code will then use the new boundary definitions when calculating the Euclidean Distances for each of the variables.

Project structure

| .gitignore                <- ignores data and virtual environment files
| config.toml               <- options for modelling, e.g. output location, k etc.
| requirements.txt          <- python libraries required
| dev_requirements.txt      <- python libraries required for development (optional, includes linting libraries)
| LICENSE                   <- license info for public distribution
|
+---reports                 <- This is a placeholder which the pipeline populates with report outputs (e.g. histograms showing feature distributions)
|
+---output                 <- This is a placeholder which the pipeline populates with output data
|
| main.py                   <- Runs the pipeline
|
+---data                    <- This is a placeholder which the pipeline populates with data
|   +---raw
|   +---interim
|   +---primary
|
+--- src                    <- Scripts with functions used in main.py
|   |   __init__.py         <- Makes the scripts importable python modules
|   |   params.py           <- configures column names, file paths etc.
|   |   load.py             <- Copies input files from location specified in config
|   |   clean.py            <- Cleans input data to LSOA level
|   |   process.py          <- Aggregates cleaned data to UTLA level
|   |   model.py            <- Calculates distance metric
|   |   report.py           <- Produces accompanying reports e.g. correlation
|   |   utils.py            <- Useful functions used across modules
|

Developing the pipeline

(Optional) Install dev requirements:

pip install -r dev_requirements.txt

You can also run the testing suite once these requirements have been installed:

pytest

Contributors

This codebase was originally developed by data scientists at NHS England: Harriet Sands and Will Poulett, with help from the Adult Social Care Team at NHS England.

Licence

This codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

Any HTML or Markdown documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

About

Calculates statistical neighbours (aka peers) for Local Authorities in England.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages