Skip to content

States COVID-19 Time Series History 20210531

Compare
Choose a tag to compare
@space-buzzer space-buzzer released this 01 Jun 21:55
· 7336 commits to master since this release

States COVID-19 Time Series History 2021-05-31

This release includes the raw time series data fetched by The COVID Tracking Project from states that provide such data directly (through data portals, CSV/excel files, etc.).

Description

This is a snapshot of the CTP States COVID-19 Time Series history dataset, taken on June 1, 2021 and including all data up to May 31, 2021.
This dataset includes full time series for cases, tests and death metrics, from states that provide such data, that are fetched daily.

This is an append-only dataset, meaning that when a time series is fetched, it's tagged with the date on which it was fetched, and the data will not be overwritten again. The next day, when the time series for the same metric is fetched, it's tagged with a different fetch timestamp. This allows us to examine changes in daily values as new data is amended to previous values.

The data is tagged and organized into the same field names used by CTP APIs.

Content

This release comes in 2 varients: statescovid19.zip with a single CSV file containing all data for all states, and statescovid19_by_state.zip with the same data, broken down to files by state.

The files are:
Avocado was the internal codename for the project of snapshoting historic data.

  • avocado_schema.sql: DB schema for avocado table
  • avocado_complete.csv or {state}_avocado_complete.csv (per state files): data for avocado table

We stored it in a relational database, and the schema is provided in avocado_schema.sql, but it's not a requirement to use the data. The data is in avocado_complete.csv (or the individual state file), which can be processed with any library or tool that supports CSV (pandas, uploadnig to BigQuery, etc.).

CSV fields are:

state              -- 2 letter state abbreviation (e.g., MA)
date_used          -- string representing the dating scheme (e.g., Specimen Collection)
timestamp          -- date this data point refers to
fetch_timestamp    -- date this data point was fetched on
date               -- CTP-style string date (e.g., 20200513)
-- The rest of the fields are the same as CTP API 
positive
positiveCasesViral
probableCases
death
deathConfirmed
deathProbable
total
totalTestsAntibody
positiveTestsAntibody
negativeTestsAntibody
totalTestsViral
positiveTestsViral
negativeTestsViral
totalTestEncountersViral
totalTestsAntigen
positiveTestsAntigen
negativeTestsAntigen

Each metric value is tagged with its state, timestamp, date_used and fetch_timestamp.

  • date_used is the dating scheme that defines the metric. For testing, common dating schemes are: Specimen Collection and Test Result, for cases, common dating schemes are Specimen Collection, Test Result and Illness Onset, and for death, a common dating scheme Death specifying date of death.
  • timestamp is the state assigned timestamp to this datapoint
  • fetch_timestamp is the timestamp when we collected the data from the state. For each fetch_timestamp we'll have the entire time series as it was fetched on that day.

Processing

The processing that went into the metrics presented here were minimal:

  • Mapping states metric names into CTP names
  • Calculating cumulative sums when the state provides only daily values. This is a limitation for states that provide only daily numbers without the beginning of the time series (e.g., ID)
  • Cleanup of dates that happened before year 2019, as it's likely a mistake in data input -- reporting them on 2020-01-01 (e.g., MO)

Examples

Test Results

Number of daily tests and test results is a metric that continuously udpates because of different lab reporting schedules, reporting delays, and processing (getting the test results) times.

We can use this data to show the continuous updates to daily testing numbers. In this example, PCR testing in Washington state.

from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from matplotlib import cm
from matplotlib import rc
import numpy as np
import pandas as pd


df = pd.read_csv('wa_avocado_complete.csv', parse_dates=['fetch_timestamp', 'timestamp'])

# Use only the tests
tests = df[df['date_used'] == 'Specimen Collection']
tests['tests'] = tests['positiveTestsViral'] + tests['negativeTestsViral']

# Look at the data for February, 2021 reported on February and March of 2021
tests = tests[(tests['fetch_timestamp'] < datetime(2021,4,1)) &
              (tests['fetch_timestamp'] >= datetime(2021, 2, 1))
             ].pivot_table(index='timestamp', columns='fetch_timestamp', values='tests'
                          ).loc[datetime(2021,2,1):datetime(2021,3,1)]

# Animate the results
class LineAnimation:
  def __init__(self, ax, data):
    self.lines = [ax.plot([], [], color=cm.Blues_r(np.linspace(0, 1, 10)[l]))[0] for l in range(10)]

    self.ax = ax
    self.data = data
    self.x = data.index

    # Set up plot parameters
    self.ax.set_xlim(self.data.index.min(), self.data.index.max())
    self.ax.set_ylim(4000000, 6000000)
    self.fetch_timestamp = ax.text(0.05, 0.9, '', transform=ax.transAxes)

  def __call__(self, i):
    # fill all lines:
    for line_index, col_index in enumerate(range(i, max(i-10, -1), -1)):
      self.lines[line_index].set_data(self.data.index, self.data.iloc[:, col_index])

    self.fetch_timestamp.set_text(self.data.columns[i])
    return self.lines[0],

fig, ax = plt.subplots(figsize = (21, 9))
all_these_lines = FuncAnimation(fig, LineAnimation(ax, tests), frames=len(tests.columns),
                                                     interval=240, repeat=False, blit=True)
all_these_lines.save('wa_february_2021_testing.gif')

wa_february_2021_testing

Death Reporting

Accurate death reporting takes time, and the real-time data states report is always incomplete.
We can compare this preliminary data reported by states and collected by the COVID Tracking Projct to the revised data states publish

from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib import rc
from matplotlib.animation import FuncAnimation
import matplotlib.dates as mdates

import numpy as np
import pandas as pd


ctp_df = pd.read_csv(
    'https://api.covidtracking.com/v1/states/oh/daily.csv',
    parse_dates=['date'], index_col='date', usecols=['date', 'death'])

latest_df = pd.read_csv(
    'oh_avocado_complete.csv',
    parse_dates=['timestamp', 'fetch_timestamp'], usecols=['fetch_timestamp', 'timestamp', 'death', 'date_used'])
# Get only the most recent time series for death by day of death
latest_df = latest_df[(latest_df['fetch_timestamp'] == latest_df['fetch_timestamp'].max()) & (latest_df['date_used'] == 'Death')
                     ].drop(columns=['fetch_timestamp', 'date_used']).set_index('timestamp')

# Concat the two series, look at daily diff, and use only 2020 data
df = pd.concat([latest_df, ctp_df], axis=1).diff().loc[:datetime(2021,1,1)]
df.columns = ['Latest', 'Reported']

fig, ax = plt.subplots(figsize=(21, 9))
ax.bar(df.index, df['Reported'], width=1, linewidth=0, color=cm.Blues(0.3))
ax.plot(df.index, df['Latest'], lw=3, color=cm.Blues(0.8))

ax.set_title('Daily COVID19 deaths in Ohio in 2020', ha='center', size=22)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %Y'))
ax.legend(loc='upper left', labels=['Death by date of death (Most Recent)', 'Reported by the state and collected by CTP'])
plt.margins(x=0)
plt.savefig('oh_deaths.png')

oh_deaths

Other Resources