Skip to content

Latest commit

 

History

History

covid19-refine

covid19-refine.sh

This script automates the creation/population of a fully normalized, non-sparse, geo-enriched version of JHU's COVID-19 time-series data.

  1. Be sure to follow steps 1 through 4 of the main README.

  2. Run openrefine-batch.sh to initialize the environment. It will automatically download OpenRefine and openrefine-client. You need to do this only once.

./openrefine-batch.sh

  1. Modify the covid19-refine.sh file to add your Postgres connection parameters, and your Geocode.earth API key.

  2. Run covid19-refine.sh. It will automate several OpenRefine projects to normalize and enrich the latest JHU's time-series data and insert it into Timescale. It will be a relatively long-running automation (6 minutes) as it will run OpenRefine in headless mode, and geographically enrich the location data to support filtering on additional facets (continent, for the US - locality, county, state).

./covid19-refine.sh

  1. Slice and dice the data using Postgres/Timescale! Note these tables/hypertables/continuous aggregates:
  • covid19_loclookup we assign a loc_id to help with joins. We also geocode the data, adding continent; and for the US - locality, county and state for additional aggregrations (sample).
		CREATE TABLE IF NOT EXISTS covid19_loclookup (
		  loc_id INTEGER PRIMARY KEY,
		  province_state TEXT,
		  country_region TEXT NOT NULL,
		  latitude NUMERIC NOT NULL,
		  longitude NUMERIC NOT NULL,
		  us_locality TEXT,
		  us_state TEXT,
		  us_county TEXT,
		  continent TEXT,
		  location_geom geometry(POINT, 2163),
		  geocode_earth_json JSONB);
  • covid19_normalized_ts from the running totals compiled by JHU, we also derive the daily incidents for any specific date/location (e.g. how many confirmed, deaths, recoveries for each day/location). This will allow you to do aggregations for arbitrary date ranges, compute rates of confirmed/deaths/recoveries, and do benchmarking across locations.
		CREATE TABLE IF NOT EXISTS covid19_normalized_ts (
		  loc_id INTEGER NOT NULL,
		  observation_date TIMESTAMPTZ NOT NULL,
		  confirmed_total INTEGER NOT NULL DEFAULT 0,
		  deaths_total INTEGER NOT NULL DEFAULT 0,
		  recovered_total INTEGER NOT NULL DEFAULT 0,
		  confirmed_daily INTEGER NOT NULL DEFAULT 0,
		  deaths_daily INTEGER NOT NULL DEFAULT 0,
		  recovered_daily INTEGER NOT NULL DEFAULT 0,
		  PRIMARY KEY(loc_id, observation_date));
There are several Timescale-powered continuous aggregates as well:
		CREATE VIEW confirmed_3days
		WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
		AS
		SELECT
		  loc_id,
		  time_bucket('3 days', observation_date) as bucket,
		  max(confirmed_total) as running_total,
		  sum(confirmed_daily) as sum,
		  avg(confirmed_daily) as avg,
		  max(confirmed_daily) as max,
		  min(confirmed_daily) as min
		FROM
		  covid19_normalized_ts a
		GROUP BY loc_id, bucket;

		CREATE VIEW confirmed_weekly
		WITH (timescaledb.continuous, timescaledb.refresh_lag = '-14 days')
		AS
		SELECT
		  loc_id,
		  time_bucket('7 days', observation_date) as bucket,
		  max(confirmed_total) as running_total,
		  sum(confirmed_daily) as sum,
		  avg(confirmed_daily) as avg,
		  max(confirmed_daily) as max,
		  min(confirmed_daily) as min
		FROM
		  covid19_normalized_ts a
		GROUP BY loc_id, bucket;

With the continuous aggregates, you can ask questions like:

		SELECT b.*,  province_state, country_region
		FROM confirmed_weekly b, covid19_loclookup a 
		WHERE a.loc_id = b.loc_id 
		   AND country_region = 'Mainland China'
		ORDER BY loc_id, bucket asc;

you can view the result here (as of Mar 5, noon EST).