Skip to content

Commit

Permalink
v1.0 Beta
Browse files Browse the repository at this point in the history
  • Loading branch information
robjharrison committed Aug 8, 2024
0 parents commit 8e68500
Show file tree
Hide file tree
Showing 52 changed files with 2,782 additions and 0 deletions.
86 changes: 86 additions & 0 deletions .github/workflows/gh_refresh_gpage.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Workflow runs the scrape process to refresh/update daily

# Run events
on:
# On push
push:
branches: ["main"]
On pull
pull_request:
branches: ["main"]
Add manual trigger from Actions tab
workflow_dispatch:
# Schedule run at 9 AM UTC every day
schedule:
- cron: '0 9 * * *'

# Sets permissions of the GITHUB_TOKEN for deployment to GitHub Pages
permissions:
contents: write # changed from read to allow repo updates
pages: write
id-token: write

# Define workflow job
jobs:
build:
# Runs on the latest version of Ubuntu
runs-on: ubuntu-latest
steps:
# Checks out a copy of repo
- name: Checkout
uses: actions/checkout@v3

# Set up Python
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'

# Show Current Directory and List Files
- name: Show Current Directory
run: |
echo "Current directory: $(pwd)"
echo "Listing files:"
ls -la
# Install dependencies
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
# Ensure Script is Executable
- name: Ensure Script is Executable
run: chmod +x ofsted_childrens_services_inspection_scrape.py

# Run the scrape
- name: Run Python script
run: |
echo "Running scrape script"
python ofsted_childrens_services_inspection_scrape.py
# Configure Git and Commit changes
- name: Commit and Push changes
# if: github.event_name == 'schedule' # Use on testing, to avoid inf loop for on push workflow event trigger
run: |
git config --local user.email "[email protected]"
git config --local user.name "GitHub Action"
git add index.html
git commit -m "Update index.html via workflow" || echo "No changes to commit"
git push
# Deploy job
deploy:
# Run on the latest version of Ubuntu
runs-on: ubuntu-latest
# Build job must complete successfully
needs: build
steps:
# Deploy to GitHub Pages
- name: Deploy
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
# Directory deployed to GitHub Pages
publish_dir: ./
51 changes: 51 additions & 0 deletions .github/workflows/static.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Simple workflow for deploying static content to GitHub Pages
# name: Deploy static content to Pages

# # Run events
# on:
# # On push
# push:
# branches: ["main"]
# # On pull
# pull_request:
# branches: ["main"]
# Add manual trigger from Actions tab
# workflow_dispatch:
# # Schedule run at 9 AM UTC every day
# schedule:
# - cron: '0 9 * * *'
# # Allows you to run this workflow manually from the Actions tab
# workflow_dispatch:

# # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
# permissions:
# contents: read
# pages: write
# id-token: write

# # Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# # However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
# concurrency:
# group: "pages"
# cancel-in-progress: false

# jobs:
# # Single deploy job since we're just deploying
# deploy:
# environment:
# name: github-pages
# url: ${{ steps.deployment.outputs.page_url }}
# runs-on: ubuntu-latest
# steps:
# - name: Checkout
# uses: actions/checkout@v3
# - name: Setup Pages
# uses: actions/configure-pages@v3
# - name: Upload artifact
# uses: actions/upload-pages-artifact@v2
# with:
# # Upload entire repository
# path: '.'
# - name: Deploy to GitHub Pages
# id: deployment
# uses: actions/deploy-pages@v2
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 data-to-insight

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
73 changes: 73 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Ofsted-SEND-Scrape-Tool
On demand Ofsted SEND results summary via inspection reports scrape from the Ofsted.gov pages
Published: https://data-to-insight.github.io/ofsted-send-scrape-tool/
-
### The automated daily update of this SEND summary page is not currently running; in the intrim we're running it manually on a weekly basis.

## Brief overview
This project is based on a proof-of-concept, 'can we do this' basis. As such it's supplied very much with the disclaimer of please check the vitals if you're embedding it into something more critical, and likewise pls feel free to feedback into the project with suggestions. The structure of the code and processes have much scope for improvement, but some of the initial emphasis was on maintaining a level of readability so that others might have an easier time of taking it further. That said, we needed to take some of the scrape/cleaning processes further than anticipated due to inconsistencies in the source site/data and this has ultimately impacted the intended 're-usable mvp' approach to codifying a solution for the original problem.

The results structure and returned data is based almost entirely on the originating SEND Summary produced/refreshed periodically by the ADCS; the use of which has previously underpinned several D2I projects. We're aware of several similar collections of longer-term work on and surrounding the Ofsted results theme, and would be happy to hear from those who perhaps also have bespoke ideas for changes here that would assist their own work.

The scrape process is completed by running a single Python script: ofsted_childrens_services_inspection_scrape.py


## Export(s)
There are currently three exports from the script.
### Results HTML page
Generated (as ./index.html) to display a refreshed subset of the SEND results summary.

### Results Overview Summary
The complete SEND overview spreadsheet, exported to the git project root ./ as an .xlsx file for ease and also accessible via a download link from the generated results page (index.html)

### All CS inspections reports
During the scrape process, because we scan all the related CS inspection pdf reports for each LA; these can be/are packaged up into tidy LA named folders (urn_LAname) within the git repo (./export_data/inspection_reports/). There is a lot of data here, but if you download the entire export_data folder after the script has run, with the overview summary sheet then the local_inspection_reports column active links will work and you can then easily access each LA's previous reports all in once place via the supplied hyperlink(s). *Note:* This is currently not an option when viewing the results on the web page/Git Pages.

## Known Bugs
Some LA's inspection reports have PDF encoding or inconsistent data in the published reports that is causing extraction issues & null data.
We're working to address these, current known issues are:
- tbc


## Imports(s)
There are currently two flat file(.csv) imports used. (/import_data/..)
### LA Lookup (/import_data/la_lookup/)
Allows us to add further LA related data including such as the historic LA codes still in use for some areas, but also enablers for further work, for example ONS region identifiers, and which CMS system LA's are using.
### Geospatial (/import_data/geospatial/)
This part of some ongoing work to access data we can use to enrich the Ofsted data with location based information, thus allowing us to visualise results on a map/choropleth. Some of the work towards this is completed, however because LA's geographical deliniations don't always map to ONS data, we're in the process of finding some work-arounds. The code and the reduced* GeoJSON data are there if anyone would like to fork the project and suggestion solutions. *GeoJSON data has been pre-processed to reduce the usually large file size and enable it within this repo/processing.


## Future work

- Some of the in-progress efforts are included as a point of discuss or stepping stone for others to develop within the download .xlsx file. For example a set of columns detailing simplistic inspection sentiment analysis based on the language used in the most recent report (ref cols: sentiment_score, inspectors_median_sentiment_score, sentiment_summary, main_inspection_topics). *Note that the inclusion of these columns does not dictate that the scores are accurate, these additions are a starting point for discussion|suggestions and development!!*

- Geographical/Geospatial visualisations of results by region, la etc. are in progress. The basis for this is aready in place but some anomolies with how LA/counties boundary data is configured is an issue for some and thus the representation requires a bit more thought.

- Improved automated workflow. We're currently still running the script manually until fixes can be applied to enable the Git Workflow(s) to run automatically/on a daily basis. We have the needed workflow scripts in place, but there is an ongoing issue in getting the py script to auto-run. Manual runs of the py script(+push/pull action) do correctly initiate the refresh of the html/GitPage.

- Provide active link access to all previous reports via the web front end. This currently only available when all post-script run files/folders are downloaded(this a v.large download if all LA folders included).

- Further development|bespoke work to improve potential tie-in with existing LA work that could use this tool or the resultant data.


#### Contact via : datatoinsight.enquiries AT gmail.com


## Script admin notes
Simplified notes towards repo/script admin processes and enabling/instructions for non-admin running.
### Script run intructions (User)
If looking to obtain a full instant refresh of the SEND output, the ofsted_childrens_services_inspection_scrape.PY should be run. These instructions for running in the cloud/Github.
- Create a new Codespace (on main)
- Type run the following bash script at Terminal prompt to set up './setup.sh'
- Run the script (can right click script file and select 'run in python....'
- Download the now refreshed ofsted_childrens_services_inspection_scrape.XLSX (Right click, download)
- Close codespace (Github will auto-remove unused spaces later)

### Run notes (Admin)
If you experience a permissions error running the setup bash file.

/workspaces/ofsted-send-scrape-tool (main) $ ./setup.sh
bash: ./setup.sh: Permission denied

then type the following, and try again:
chmod +x setup.sh
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 8e68500

Please sign in to comment.