Calculates statistical neighbours (aka peers) for Local Authorities in England, for use in Adult Social Care statistics.
This repository is maintained by the NHS England Adult Social Care Statistics Team.
To contact us raise an issue on Github or via email at [email protected]. See our (and our colleagues') other work here: NHS England Analytical Services.
This repository was developed by the Data Science team for the Adult Social Care Statistics team, to provide a way of comparing statistics between 'similar' Local Authorities.
We have calculated a metric of similarity (Euclidean distance) based on standardised, normalised input features from Census 2021 data, including population demographics such as age, ethnicity and educational attainment.
- This project was developed using Python 3.10.5
- Required Python libraries are listed in
requirements.txt
- Optional:: Python libraries used for linting are included in
dev-requirements.txt
. See the developing the pipeline section for more details about linting configuration.
Clone this project and ensure you're in the root directory, ASC_LA_Peer_Groups. You can change your current directory in the terminal e.g.
cd ASC_LA_Peer_Groups
Set up a virtual environment and install requirements:
py -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
The configuration for the pipeline is defined in config.toml
. If you want to adjust the weights of any of the inputs features (including adding or removing features), change the UTLA definitions etc., make the required edits in the config.toml
.
There are four sections:
-
[LOCATION]
- locations used by the pipeline. This includes where the input data is stored, as well as where logs and outputs are saved. The location of your input and output directories need setting up in theconfig.toml
.Notes on file path
- Data files should be downloaded and stored to the location specified in
config.toml
. This must be outside of this repository, and can be a shared drive location - Also note that file paths should contain forward slashes e.g. "C:/Users/username/Documents/data"
- Data files should be downloaded and stored to the location specified in
-
[LOCAL_AUTHORITY]
- defines the local authority codes to use -
[MODEL_OUTPUT]
- defines changeable characteristics of the output. Currently this only includesn_peers
which limits the number of closest nearest neighbours output by the model.Model output defaults
n_peers = 15
-
[FEATURE_WEIGHTS]
- lists features along with their associated weights. Note that a weight of zero reduces the effect of the feature to zero and thereby excludes it completely.Feature weight defaults
"Over 15 Population" = 1 "85 and over Population %" = 1 "Aged 65 to 84 Population %" = 0 "black african %" = 0.5 "black caribbean %" = 0.5 "bangladeshi %" = 0.5 "indian %" = 0.5 "chinese %" = 0.5 "pakistani %" = 0.5 "mixed %" = 0 "white %" = 0 "home_owners %" = 0 "social_renters %" = 1 "student %" = 1 "routine_manual %" = 0 "low_english_proficiency %" = 1 "People per square km" = 1 "higher_level_qualifications %" = 1 "few_rooms %" = 0 "Distance to Sea (km)" = 0.5 "Sparsity (% population living in low density areas)" = 1
-
[REMOVE_LAS]
- lists UTLAs to be excluded from analysis. For example ["Isles of Scilly", "City of London"]. Please use the name of the Local Authority, using thela_name
field defined in[LOCAL_AUTHORITY]
above.Default removed UTLAs
NHS England remove Isle of Scilly and City of London from the default model. I.e.:las_to_remove = ["Isles of Scilly", "City of London"]
Data files should be downloaded and stored to the location specified in config.toml
. This must be outside of this repository, and can be a shared drive location.
- Download ten CSV files and save them to the input location specified in the
config.toml
. The names of the files and their sources are provided below. Some files are only available as part of a collection, in which case the source is listed as a zip file containing more than one csv. Where this is the case, download and extract the zip file saving the file version which ends 'lsoa'.
Save as file name | Source | Details |
---|---|---|
area_sqkm.csv | https://geoportal.statistics.gov.uk/datasets/a488cb8fc9a74accb63cb52961e456ef/about | Click the Download button at the top of the page. Within the subfolder "Measurements", rename the file "SAM_LSOA_DEC_2021_EW_in_KM" to "area_sqkm.csv" |
distance_to_sea.csv | https://digital.nhs.uk/supplementary-information/2024/distance-to-sea-calculations | Download the csv data file and rename to distance_to_sea.csv |
english_proficiency.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts029.zip | Download the zip folder. Rename the csv ending in "lsoa" to "english_proficiency" |
ethnicity.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts021.zip | Download the zip folder. Rename the csv ending in "lsoa" to "ethnicity" |
housing_tenure.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts054.zip | Download the zip folder. Rename the csv ending in "lsoa" to "housing_tenure" |
ns-sec.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts062.zip | Download the zip folder. Rename the csv ending in "lsoa" to "ns-sec" |
population_data.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts007a.zip | Download the zip folder. Rename the csv ending in "lsoa" to "population_data" |
qualification_level.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts067.zip | Download the zip folder. Rename the csv ending in "lsoa" to "qualification_level" |
rooms.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts051.zip | Download the zip folder. Rename the csv ending in "lsoa" to "rooms" |
LSOA21_to_UTLA22.csv | https://www.data.gov.uk/dataset/14d8efd0-b14c-46ac-b2fe-a7892ea51ca5/lsoa-2021-to-utlas-december-2022-best-fit-lookup-in-ew-v2 | Under "Data links", click the CSV hyperlink to download the file. Rename this file to LSOA21_to_UTLA22 |
A note on lookups: The final CSV file listed above, LSOA21_to_UTLA22.csv
, maps LSOAs to UTLAs (local authorities).
E.g.:
| LSOA21_CODE | LSOA21_NAME | UTLA_CODE | UTLA_NAME |
|-------------|-------------------|-----------|---------------|
| E01012052 |Middlesbrough 014D | E06000002 | Middlesbrough |
NOTE: Please edit the
LOCATION
inconfig.py
before running the pipeline.
Once you've initially setup the virtual environment in the previous steps, ensure you're in the virtual environment by running the code .venv\Scripts\activate
in the terminal.
Once you've activated your virtual environment, run the following code from the terminal:
python main.py
If you want to adjust the weights of any of the inputs features (including adding or removing features), change the UTLA definitions etc., make the required edits in the config.toml
.
(Optional) Adding a custom hash:
To make your pipeline run easier to identify, it is possible to pass a custom hash to name your pipeline. This means log names and your output pipeline folder name will include the hash.
The hash length is set in config.toml
- if you supply a shorter hash this is fine, but be aware that a longer hash will be cropped to the first n characters using the hash length.
python main.py --hash my_run
Where my_run
is the custom hash you have supplied.
This pipeline produces the following as final outputs, saved to the outputs
directory:
features.csv
- The final features used to produce the distancesdistances.csv
- Distance between each pair of local authoritiespeers.csv
- N most similar peers for each local authority (n defined inconfig.toml
)example_peers.csv
- The above but limited to a subset of local authorities specified insrc/params.py
Reports to accompany these outputs, including details of correlation between features and feature distributions, are saved to the reports
directory.
Final outputs and reports are saved to a pipeline folder saved in the output directory defined in config.toml
. The name of each pipeline folder corresponds to the time the pipeline was initialised, and any custom hash that was provided.
Interim data processing produces files saved to the data/
directory- these are NOT copied to the pipeline output location.
New data and lookups can be added easily to the pipeline. All new data and lookups should be stored in the input directory, as specified in the config file as input_dir
.
First check that the format of the new/updated file matches the old one (see the earlier Data section for links). Move the new file into the input directory (you may want to archive the old file).
In src/params.py
, check the "Data File Names" section, and ensure the name of the replaced file matches the corresponding file name in the params. Further to this, check the column values in the “Columns” section of params for the feature you have changed, and ensure these match the columns within the new data.
If updating the lookup file, open config.toml
and check that la_code
and la_name
point to the correct columns in the new lookup.
As of 2024 the latest lookup can be found here: https://www.data.gov.uk/dataset/801d40f6-fa98-40ef-ba16-0193ef04cff0/lsoa-2021-to-utlas-april-2023-best-fit-lookup-in-ew
Copy across the new lookup to the input directory, ensure it has a unique name (e.g. LSOA21_to_UTLA23) and that it is saved as a CSV. Ensure the new lookup has no blank space above the headers, and make a note of the header names.
Navigate to src\params.py
, go to the "Data File Names" section and change the name of the LSOA_UTLA_lookup_file
to the new lookup file e.g.
LSOA_UTLA_lookup_file = "LSOA21_to_UTLA23.csv"
If the LSOA code column name in the new lookup has changed, you will also need to update LSOA_code
(in the "Pathway Parameters" section) to point to the correct column name.
Navigate to config.toml
and ensure the la_code
and la_name
match the names of the relevant columns in your new lookup e.g.
la_code = "UTLA23CD"
la_name = "UTLA23NM"
You can now run the pipeline with the updated lookup.
If there are boundary changes, LSOA_AREA_KM.csv will need updating (if there is an available update), along with the LSOA to UTLA lookup. See the above section on how to update the data/lookup. The code will then use the new boundary definitions when calculating the Euclidean Distances for each of the variables.
| .gitignore <- ignores data and virtual environment files
| config.toml <- options for modelling, e.g. output location, k etc.
| requirements.txt <- python libraries required
| dev_requirements.txt <- python libraries required for development (optional, includes linting libraries)
| LICENSE <- license info for public distribution
|
+---reports <- This is a placeholder which the pipeline populates with report outputs (e.g. histograms showing feature distributions)
|
+---output <- This is a placeholder which the pipeline populates with output data
|
| main.py <- Runs the pipeline
|
+---data <- This is a placeholder which the pipeline populates with data
| +---raw
| +---interim
| +---primary
|
+--- src <- Scripts with functions used in main.py
| | __init__.py <- Makes the scripts importable python modules
| | params.py <- configures column names, file paths etc.
| | load.py <- Copies input files from location specified in config
| | clean.py <- Cleans input data to LSOA level
| | process.py <- Aggregates cleaned data to UTLA level
| | model.py <- Calculates distance metric
| | report.py <- Produces accompanying reports e.g. correlation
| | utils.py <- Useful functions used across modules
|
(Optional) Install dev requirements:
pip install -r dev_requirements.txt
You can also run the testing suite once these requirements have been installed:
pytest
This codebase was originally developed by data scientists at NHS England: Harriet Sands and Will Poulett, with help from the Adult Social Care Team at NHS England.
This codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.
Any HTML or Markdown documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.