This repository contains submissions pertaining to final project work for IDCE 376.
The assignment description can be viewed in the Final_Project-Rubric.pdf document.
Combine infomation from census and ASC data to create an HDI by census tract layer for Worcester, and compare this to metrics of urban canopy/greenness.
- How does urban canopy correlate with a combined measure of economic mobility, educational attainment, and population health?
- Has this relationship changed over time?
The study area is the city limits of Worcester, and the spatial scale of the analysis will be the 44 census tracts in Worcester, as visualized below:
For information on where data was obtained, please see DataSources.md
2010 and 2020 Census
These layers contain information about total population, and racial composition for each census tract.
See Cenus2010_cleaning.ipynb and Cenus2020_cleaning.ipynb in the Scripts folder.
This is a subset of the 2010 census data:
CDC Life Expectancy by Census Tract
No data for tracts comprising Clark University, College of the Holy Cross, and Great Brook Valley public housing.
See CDC_LE_Processing.ipynb
American Community Survey-- Percent of Population with Income below Federal Poverty Line
Both ASC layers were manually cleaned in QGIS, with further cleaning in python. The most tedious part of this entire process was merging census areas that were split betwen the 2010 and 2020 census.
See ACS_2020_Merge.ipynb
American Community Survey-- Percent of Population with Bachelor's Degrees
Tree Planting Points
Point layer of nearly 10,000 trees planted in Worcester County in 2010-2012 through a Massachussets Department of Conservation and Recreation (MA DCR) tree planting program.
Canopy Cover Maps
Worcester canopy cover vector layers from 2008, 2010, and 2015. See DataSources.md for information on how these were created.
Canopy cover 2015:
Canopy Cover 2015 detail:
Raster images include NDBI (Normalized Differenced Built Index), NDVI (Normalized Differenced Vegetation Index), and UVI (Urban Vegetation Index) from 5 different time points: 2007, 2011, 2015, 2019, and 2023, clipped to Worcester city limits.
Raster images were aquired with PySTAC. Each is a median composite image between March 15 and September 15 of the chosen year. See GetWorcesterImages.ipynb.
All data is kept in, or reprojected to EPSG 32619 WGS 84/ UTM Zone 19N.
Examples:
NDBI 2007
NDVI 2007
See ShellCommands.sh for more details
This is the script I used in CLI to batch import vector files into my database, which is called finalProj:
for shapefile in ./*.shp; do
filename=$(basename "$shapefile" .shp)
shp2pgsql -s 32619 -I "$shapefile" "$schema_name.$filename" >> vectorImport.sql
done
This is the script I used in CLI to batch import raster files into my database, which is called finalProj:
for raster in ./*.tif; do
filename=$(basename "$raster" .tif)
raster2pgsql -s 32619 -I -C -M "$raster" "$schema_name.$filename" >> rasterImport.sql
done
Once the rasters are added to the database they must be converted to vector point data in order to be used in spatial queries. This is the SQL batch command used to convert all ndbi, ndvi, and uvi images in the database to points. See finalProj_SQL.sql for more details.
DO $$
DECLARE
raster_table_name TEXT;
vector_table_name TEXT;
BEGIN
FOR raster_table_name IN SELECT table_name FROM information_schema.tables WHERE table_name LIKE 'ndvi_%' OR table_name LIKE 'ndbi_%' OR table_name LIKE 'uvi_%' LOOP
vector_table_name := raster_table_name || '_points';
EXECUTE FORMAT('CREATE TABLE %I AS
SELECT CAST((ST_PixelAsPoints(rast)).val AS DECIMAL) AS float_val,
(ST_PixelAsPoints(rast)).*
FROM %I;', vector_table_name, raster_table_name);
END LOOP;
END $$;
The result, at this point, is that we have quite a lot of tables, 41 to be exact.
Now we are going to use the life expectancy estimate, percent of population in poverty, and percent of population with a bachelor's degree to create an HDI measure for each tract. To do this we will normalize each measure to an index between 0 and 1, which will then be used to weight each evenly in our final composite index.
See FinalProj_SQL.sql for more details, but here is the fun part, combining the relevant columns to a single table:
CREATE TABLE hdi_calc AS
SELECT
l.tract,
l."life expec",
p.povper,
e.perbach
FROM
le_tracts l
JOIN
woo_poverty_2020 p ON l.tract = p.tract
JOIN
woo_education_2020 e ON l.tract = e.tract;
This is what our table looks like:
Now we will calculate a 0-1 normalized value for each
SELECT
MIN(life_exp) AS min_life_exp,
MAX(life_exp) AS max_life_exp,
MIN(povper) AS min_povper,
MAX(povper) AS max_povper,
MIN(perbach) AS min_perbach,
MAX(perbach) AS max_perbach
INTO
min_max_values
FROM
hdi_calc;
ALTER TABLE hdi_calc
ADD COLUMN pov_norm NUMERIC,
ADD COLUMN ed_norm NUMERIC,
ADD COLUMN le_norm NUMERIC;
UPDATE hdi_calc
SET
le_norm = (life_exp - (SELECT min_life_exp FROM min_max_values)) /
((SELECT max_life_exp FROM min_max_values) - (SELECT min_life_exp FROM min_max_values)),
pov_norm = (povper - (SELECT min_povper FROM min_max_values)) /
((SELECT max_povper FROM min_max_values) - (SELECT min_povper FROM min_max_values)),
ed_norm = (perbach - (SELECT min_perbach FROM min_max_values)) /
((SELECT max_perbach FROM min_max_values) - (SELECT min_perbach FROM min_max_values));
Which brings us to here:
Notice that there are a few tracts without life expectancy data. We could subsitute an average value in those cases, but for now we will leave them as NaN.
One thing we need to do is reverse the poverty index. Poverty is bad, higher rates should bring down HDI.
-- Update the hdi_calc table with reversed normalized poverty rates:
UPDATE hdi_calc AS h
SET
pov_norm = 1 - (h.pov_norm);
Now lets calculate our HDI index by tract:
ALTER TABLE hdi_calc
ADD COLUMN hdi NUMERIC;
UPDATE hdi_calc
SET hdi = (pov_norm + ed_norm + le_norm) / 3;
We now have HDI by tract calculated. Let's bring in canopy data.
Start by calculating percent canopy per tract, in a new table
-- Get canopy area by tract in meters2
CREATE TABLE canopy_cover_by_tract AS
SELECT
h.tract,
SUM(ST_Area(ST_Intersection(h.geom, canopy.geom))) AS total_canopy_area
FROM
hdi_calc h
LEFT JOIN
canopy_2015 canopy ON ST_Intersects(h.geom, canopy.geom)
GROUP BY
h.tract;
And bring in tract area
-- Create tract_area column in our new table with on the fly area calculation from hdi_calc 'geom'
UPDATE canopy_cover_by_tract AS c
SET tract_area = h.area_sqm
FROM (
SELECT
tract,
ST_Area(geom) AS area_sqm
FROM
hdi_calc
) AS h
WHERE c.tract = h.tract;
And calculate percent canopy by tract
ALTER TABLE canopy_cover_by_tract
ADD COLUMN per_canopy NUMERIC;
-- Update the per_canopy column with the calculated percent canopy cover
UPDATE canopy_cover_by_tract
SET per_canopy = (total_canopy_area / tract_area) * 100;
Now bring percent canopy in as a column in 'hdi_calc'
ALTER TABLE hdi_calc ADD COLUMN per_canopy NUMERIC;
UPDATE hdi_calc AS h
SET per_canopy = c.per_canopy
FROM canopy_cover_by_tract AS c
WHERE h.tract = c.tract;
And normalize it as a 0-1 index
SELECT
MIN(per_canopy) AS min_per_canopy,
MAX(per_canopy) AS max_per_canopy
INTO
min_max_per_canopy
FROM
canopy_cover_by_tract;
ALTER TABLE hdi_calc
ADD COLUMN per_canopy_norm NUMERIC;
Finally, use the percent canopy normalized index to creat H Tree I
-- Update hdi_calc with normalized per_canopy values
UPDATE
hdi_calc
SET
per_canopy_norm = (per_canopy - (SELECT min_per_canopy FROM min_max_per_canopy)) /
((SELECT max_per_canopy FROM min_max_per_canopy) - (SELECT min_per_canopy FROM min_max_per_canopy));
ALTER TABLE hdi_calc
ADD COLUMN h_tree_i NUMERIC;
UPDATE hdi_calc
SET h_tree_i = (pov_norm + ed_norm + le_norm+per_canopy_norm) / 4;
Brilliant!
HDI
H Tree I (HDI + normalized canopy cover percentage by tract)
HDI Diff
This image depicts how the HDI for each tract changes with the addition of canopy as 1/4 of the composite index.
Comparing this map to HDI (sans canopy) seems to suggest that areas with low initial HDI actually drop when adding the canopy index. Or in other words, that canopy cover is, on average, less equal than HDI, by tract in Worcester.
Let's check out a regression between HDI and HDI change:
And this initial impression is, on average, correct, though it is a weak correlation. As initial HDI increases, so does the amount that the composite index increases when you add canopy cover.
This suggests, again, that canopy cover is less equal than the population level metrics of well being that comprise HDI, on average.
How do we measure well-being, socio-economic or otherwise, by census tract? What do we care about? No matter what we use, it is a proxy for something that is very hard to define, human flourishing. Environmental justice criteria are often for this purpose. According to mass.gov, Environmental Justice communities are defined as a community where one or more of the following criteria are met:
- The annual median household income is 65 percent or less of the statewide annual median household income
- Minorities make up 40 percent or more of the population
- 25 percent or more of households identify as speaking English less than "very well"
- Minorities make up 25 percent or more of the population and the annual median household income of the municipality in which the neighborhood is located does not exceed 150 percent of the statewide annual median household income.
I am curious to what extent this lense on well being would correlate with the HDI measure that I calculated here. There is also the CDC's social vulerability index, which includes as many as 15 different metrics. It is a complex sociological question to consider which of these is the most 'valid' measure of community level well being, or even more generally what metrics we care about most and how we measure them. I want to keep thinking about this.
The biggest adjustment that I could make to my statistical methods to increase the robustness of this analysis is using z-scores (0-1 indexes) from a wider dataset than just Worcester.
For example:
The canopy index here compares only Worcester cenusus tracts with other Worcester census tracts. However, Worcester may have a higher canopy cover percentage than most comparable cities. In which case a canopy index score on the low end of this analysis may not be so bad, when stacked up against similar neighborhoods in other cities, when controlling for HDI.
In other words, low HDI areas that dropped even more when the local canopy z-score was factored in, may in fact be more leafy than comparable neighborhoods in other cities. An analysis bringing in data from other cities would contextualize the H Tree I index by comparing it to tracts beyond just Worcester.
I've been largely stymied in my attempts to directly analyze rasters in SQL. Null values pop up where there are in fact valid values in the pre-import raster. Being able to use rasters in this analysis would be illuminating, particularly being able to compare how closely NDBI, NDVI, or UVI lines up with canopy cover percentage, and potentially combining these into a composite urban greenery index. In the short term I will move forward with the analysis in Python, until I can understand what the issue is with SQL.