Analyzing Power Outages in The Continental US

Matteo Perona

The dataset I will be using covers major outages observed in the continental U.S. These data contain geographical location, regional climatic info, land-use characteristics, electricity consumption patterns, and economic characteristis. I will be using these data to answer a fundamental question about outages in the continental United States.

Do states with a high number of outages per capita have higher instances of intentional attack?

Now, this question may sound a little bit contrived at first, but it attempts to distinguish a relevant issue from the umbrella-term power outage. It brings us one step closer to understanding why some states have disproportionately high instances of power outage for their population. I hope this incentivises you enough to read on.

Relevant Information About these Data

Shape: 1534 rows × 54 columns

Relevant Columns: U.S._STATE, CAUSE.CATEGORY

DF Head with Relevant Columns

YEAR	MONTH	U.S._STATE	CAUSE.CATEGORY
2011	7	Minnesota	severe weather
2014	5	Minnesota	intentional attack
2010	10	Minnesota	severe weather
2012	6	Minnesota	severe weather
2015	7	Minnesota	severe weather

Relevant Python Packages

import pandas as pd
import numpy as np
import os

import plotly.express as px
pd.options.plotting.backend = 'plotly'

I opted for plotly as pandas' plotting backend over matplotlib since it has a more modern look and it's functions are better suited to this project data visualization needs.

Data Cleaning and EDA

Fortunately, data cleaning was relatively simple for this project.

Step 1. Pruning by Hand

I downloaded all the data from this link and opened it with numbers on my Mac. Then, I simply removed the first column, which contained unnecessary descriptive information about the dataset, and the first three rows, which contained no data with the first column removed. Additionally, I removed the column beneath the row containing column names (column 7 in the unchanged file); it contained units for columns with quantitative data. Once I was done pruning the dataset by hand I exported it as a CSV file and saved it to my project folder.

Setp 2. Reformatting Outage Start and Restoration Dates and Times

I first imported the data like so:

outage_fp = os.path.join('data', 'outage.csv')
outage = pd.read_csv(outage_fp)

The relevant columns to modify were:

OUTAGE.START.DATE	OUTAGE.START.TIME	OUTAGE.RESTORATION.DATE	OUTAGE.RESTORATION.TIME
Friday, July 01, 2011	5:00:00 PM	Sunday, July 03, 2011	8:00:00 PM
Sunday, May 11, 2014	6:38:00 PM	Sunday, May 11, 2014	6:39:00 PM
Tuesday, October 26, 2010	8:00:00 PM	Thursday, October 28, 2010	10:00:00 PM
Tuesday, June 19, 2012	4:30:00 AM	Wednesday, June 20, 2012	11:00:00 PM
Saturday, July 18, 2015	2:00:00 AM	Sunday, July 19, 2015	7:00:00 AM

I wanted to combine the outage start and restoration dates with their respective times and convert them to timestamps assigning these new series to two columns -- 'OUTAGE.START' and 'OUTAGE.RESTORATION' respectively -- while dropping the old ones. The process went as follows.

First, I added the series' date and time strings together:

(outage['OUTAGE.START.DATE'] + ", " + outage['OUTAGE.START.TIME'])

Next, I applied a pd.Timestamp() to each element of the resulting series to convert each string to a timestamp:

.apply(lambda x: pd.Timestamp(x))

Finally, we assign the new steries to new columns and drop the old columns from out original dataframe:

outage['OUTAGE.START'] = ...
outage['OUTAGE.RESTORATION'] = ...
outage = outage.drop(columns=['OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE', 'OUTAGE.RESTORATION.TIME'])

Overall, the operations follow like so:

outage['OUTAGE.START'] = (outage['OUTAGE.START.DATE'] + ", " + outage['OUTAGE.START.TIME']) \
    .apply(lambda x: pd.Timestamp(x))

outage['OUTAGE.RESTORATION'] = (outage['OUTAGE.RESTORATION.DATE'] + ", " + outage['OUTAGE.RESTORATION.TIME']) \
    .apply(lambda x: pd.Timestamp(x))

outage = outage.drop(columns=['OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE', 'OUTAGE.RESTORATION.TIME'])

Output of these operations:

OUTAGE.RESTORATION	OUTAGE.START
1.30972e+18	1.30954e+18
1.39983e+18	1.39983e+18
1.2883e+18	1.28812e+18
1.34023e+18	1.34008e+18
1.43729e+18	1.43718e+18

The values above are of type datetime64[ns].

Step 3. Checking Column Dtypes and Assessing Null Values

This final step of cleaning was the most painless because the data were already formatted very cleanly. I first looked through each column's datatype using outage.dtypes:

	0
OBS	int64
YEAR	int64
MONTH	float64
U.S._STATE	object
...
PCT_WATER_TOT	float64
PCT_WATER_INLAND	float64
OUTAGE.START	datetime64[ns]
OUTAGE.RESTORATION	datetime64[ns]

I went through each column and made sure that the data types were appropriate for the values in each. I did not have to make any changes to datatypes. Next, I looked through the unique values for each column using outage.apply(lambda col: col.unique()) to find any nan placeholders like '-' and to, again, make sure that the data types for each column suited the values. I would include the output here, but it was far too long and unappealing (please visit the notebook if you are curious).

EDA

The EDA I will cover here is abridged to focus on points relevant to the question posed above. If you are curious please check out the notebook in my repo.

Warm Up

To start our EDA let's do some univariate analysis on columns that aren't necessarily related to out question to warm up our analytical muscles.

Outages by State

Here we will break down outage data for each state.

Outages by Cause Category

Breaking down the distributions for the causes of each observed power outage.

States and Causes Together

For this section we need to create a pivot table.

cause_by_state = pd.pivot_table(outage, columns=['CAUSE.CATEGORY'], index=['U.S._STATE'], values='OBS', aggfunc='count')
cause_by_state = cause_by_state.assign(total=cause_by_state.sum(axis=1))
cause_by_state = cause_by_state.sort_values(by='total').drop(columns=['total'])

Output dataframe head:

U.S._STATE	equipment failure	fuel supply emergency	intentional attack	islanding	public appeal	system operability disruption
Alaska	1	0	0	0	0	0
South Dakota	0	0	0	2	0	0
North Dakota	0	1	0	0	1	0
Montana	0	0	1	2	0	0
Mississippi	0	0	3	0	0	1

Assessment of Missingness

In this section, I will describe the missingness of a few key columns in our data.

Visualize Missingness

# counts all null values in each column outages
null_counts = outage.isna().sum().sort_values()
# plots hbar of missing value conts for each column with missing values
fig = null_counts[null_counts > 0].plot(
    kind='barh', 
    width=1100, 
    height=600,
    title='Number of Missing Values by Column in the Outage Dataframe',
)

NMAR Analysis

You might be asking yourself: Why are there so many missing values in the HURRICANE.NAMES column? Well, one phenomenon that could explain this is NMAR (not missing at random): when reason for missingness in a column can be determined by looking at the column itself. The HURRICANE.NAMES column jumps out as being NMAR since, when collecting the data, if the outage did not occur as a result of a hurricane there will not be a name to report. In this way, the missingness is dependent on the column itself. Note: It is also on columns like CAUSE.CATEGORY.DETAILS of CAUSE.CATEGORY which both contain information about the cause of the outage, but -- crucially -- the missing values in this column can be described by looking at the column itself.

Missingness Dependency

In this section, we will try to show that the CAUSE.CATEGORY.DETAILS column is MAR dependent on the CAUSE.CATEGORY column using permutation tests. We will also attempt and fail at finding a column that it is independent from CAUSE.CATEGORY.DETAILS using permutation tests.

Testing Dependent Case

First we will attempt to show depencency between CAUSE.CATEGORY.DETAILS and CAUSE.CATEGORY.

Hypothesis

Null Hypothesis: The distribution of CAUSE.CATEGORY is the same when CAUSE.CATEGORY.DETAIL name is missing and when it is not missing
Alt Hypothesis: The distribution of CAUSE.CATEGORY is different when CAUSE.CATEGORY.DETAIL name is missing as opposed to when it is not missing
Significance Level: 0.05

Code for generating the observed distributions

# Variables for columns we are testing to enable hot switching
indep_var = 'CAUSE.CATEGORY'
q_var = 'CAUSE.CATEGORY.DETAIL'

# generate pivot table to display missingness of q_var with relation to indep_var
df_indep = (
    outage
    .assign(missing = outage[q_var].isna())
    .pivot_table(index=indep_var, columns='missing', aggfunc='size')
).fillna(0)

# normalize the df
df_indep = df_indep / df_indep.sum()
# plot the distribution
df_indep.plot(
    kind='barh', 
    title='Observed Distribution of Hurricane Names Conditional on Cause Category Detail',
    barmode='group',
    height=1100  
    )

Output DF

CAUSE.CATEGORY	False	True
equipment failure	0.0451552	0.0254777
fuel supply emergency	0.0301035	0.0403397
intentional attack	0.348071	0.101911
islanding	0	0.0976645
public appeal	0	0.146497
severe weather	0.541863	0.397028
system operability disruption	0.0348071	0.191083

Output Plot

Here, we can see the observed distributions of both missing and non missing CAUSE.CATEGORY.DETAILS across CAUSE.CATEGORY.

Calculate Observed TVD

observed_tvd = df_indep.diff(axis=1).iloc[:, -1].abs().sum() / 2
observed_tvd

We use total variation distance as our test statistic since it is an effective way to measure the difference between two distributions. Read more about TVD here Observed TVD: 0.41067323382726845

Simulation

The python script below shuffles the CAUSE.CATEGORY column n_repetitions times. Each time, it calculates the permutation's TVD and adds it to the tvds list. This gives us an idea of what the TVD's would look like if the distrinution of null/non-null values were completely random.

# number of times to repeat permutations and calculate TVD
n_repetitions = 500
shuffled = outage.copy()

tvds = []
for _ in range(n_repetitions):
    # Create permutation
    shuffled[indep_var] = np.random.permutation(shuffled[indep_var])
    
    # Computing and storing the TVD.
    pivoted = (
        shuffled
        .assign(missing = shuffled[q_var].isna())
        .pivot_table(index=indep_var, columns='missing', aggfunc='size')
        .apply(lambda x: x / x.sum())
    )
    
    tvd = pivoted.diff(axis=1).iloc[:, -1].abs().sum() / 2
    tvds.append(tvd)

Plotting Empirical Distribution of Calculated TVDs

Calculating p-value

np.mean(np.array(tvds) >= observed_tvd)

Out: 0.0

Conclusions

P-val is less than our significance level of 0.05, so we reject the null hypothesis. The CAUSE.CATEGORY.DETAILS column is likely to be dependent on CAUSE.CATEGORY.

Testing Independent Case

I could not find a single independent case for CAUSE.CATEGORY.DETAILS with any other column in the outage dataframe.

Hypothesis Testing

In this section I will be attempting to answer my original question: Do states with a high number of outages per capita have higher instances of intentional attack? I will be using a permutation show that it is very unlikely for outages per capita by state and proportion of outages cause by intentional attack by state to come from the same distribution. Hence, there is a likely a relationship between the number of outages per capita and the proportion of outages cause by intentional attack in each state.
Note: I already went over how permutation tests work in the previous section so I will let the plots do most of the talking in this last section.

Hypothesis

Null: The number of outages by state per capita comes from the same distribution as the proportion of outages that are caused by intentional attacks by state.
Alt: The number of outages by state per capita and the proportion of outages that are caused by intentional attacks by state come from different distributions.
Significance level: 0.05

Generate Observed Distributions

Code

# number of outages by state per capita
X = outage.groupby(by='U.S._STATE').count().OBS / outage.groupby(by='U.S._STATE').mean().POPULATION

# proportion of outages that are caused by intentional attack by state 
cause_by_state = pd.pivot_table(outage, columns=['CAUSE.CATEGORY'], index=['U.S._STATE'], values='OBS', aggfunc='count').fillna(0)
Y = cause_by_state['severe weather'] / cause_by_state.sum(axis=1)

# put X and Y into a df
df = pd.DataFrame().assign(out_per_cap=X, prop_attack=Y)
# normalize the df
df = df / df.sum(axis=0)

# plot the distributions
fig = df.plot(kind='barh', height=1000, barmode='group', title='Observed Distributions of Outages by State Per Capita and Prop Outages Caused by Intentional Attack')
fig.write_html('./assets/hyp-test-observed.html', include_plotlyjs='cdn')
fig

Out Graph

The plot above shows the distributions of outages per capita and proportion of outages caused by intentional attack by state (each normalized).

Out Table Head

U.S._STATE	out_per_cap	prop_attack
Alabama	0.00414147	0.0367439
Alaska	0.0051098	0
Arizona	0.0142132	0.00629895
Arkansas	0.0274285	0.0176371
California	0.0181408	0.0146976

Observed TVD

observed_tvd = df.diff(axis=1).iloc[:, -1].abs().sum() / 2

Out: 0.48123557430777736

Permutations

Code

n_repetitions = 500
shuffled = outage.copy()

tvds = []
for _ in range(n_repetitions):
    shuffled['U.S._STATE'] = np.random.permutation(shuffled['U.S._STATE'])

    # number of outages by state per capita
    X = shuffled.groupby(by='U.S._STATE').count().OBS / shuffled.groupby(by='U.S._STATE').mean().POPULATION

    # proportion of outages that are caused by intentional attack by state 
    cause_by_state = pd.pivot_table(shuffled, columns=['CAUSE.CATEGORY'], index=['U.S._STATE'], values='OBS', aggfunc='count').fillna(0)
    Y = cause_by_state['severe weather'] / cause_by_state.sum(axis=1)

    # put X and Y into a df
    df = pd.DataFrame().assign(out_per_cap=X, prop_attack=Y)
    # normalize the df
    df = df / df.sum(axis=0)
    
    tvd = df.diff(axis=1).iloc[:, -1].abs().sum() / 2
    tvds.append(tvd)

Empirical Distribution of TVD

The graph above

p-value

np.mean(np.array(tvds) >= observed_tvd)

Out: 0.0

Conclusion

Our p-value 0.0 is less than our significance level of 0.05, so we reject the null hypothesis. We can conclude that the number of outages per capita and the proportion of outages caused by intentional attack in each state are likely drawn from different distributions. What does this tell us about out original question?

Do states with a high number of outages per capita have higher instances of intentional attack?

While we can say nothing for certain, the permutation tests above indicate that there is likely a relationship between the number of outages per capita and seeing higher instances of intentional attack. In the future I hope to explore this subject in greater depth, and hopefully gain some insight into why Delaware's proportion of outages caused by intentional attack and their outage per capita are both so high compared to their respecitve means.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.vscode		.vscode
assets		assets
.DS_Store		.DS_Store
PowerOutageAnalysis.pdf		PowerOutageAnalysis.pdf
README.md		README.md
_config.yml		_config.yml

MatteoPerona/usa-power-outage-analysis

Folders and files

Latest commit

History

Repository files navigation

Analyzing Power Outages in The Continental US

Matteo Perona

Do states with a high number of outages per capita have higher instances of intentional attack?

Relevant Information About these Data

DF Head with Relevant Columns

Relevant Python Packages

Data Cleaning and EDA

Step 1. Pruning by Hand

Setp 2. Reformatting Outage Start and Restoration Dates and Times

The relevant columns to modify were:

First, I added the series' date and time strings together:

Next, I applied a pd.Timestamp() to each element of the resulting series to convert each string to a timestamp:

Finally, we assign the new steries to new columns and drop the old columns from out original dataframe:

Overall, the operations follow like so:

Output of these operations:

Step 3. Checking Column Dtypes and Assessing Null Values

EDA

Warm Up

Outages by State

Outages by Cause Category

States and Causes Together

Output dataframe head:

Assessment of Missingness

Visualize Missingness

NMAR Analysis

Missingness Dependency

Testing Dependent Case

Hypothesis

Code for generating the observed distributions

Output DF

Output Plot

Calculate Observed TVD

Simulation

Plotting Empirical Distribution of Calculated TVDs

Calculating p-value

Conclusions

Testing Independent Case

Hypothesis Testing

Hypothesis

Generate Observed Distributions

Code

Out Graph

Out Table Head

Observed TVD

Permutations

Code

Empirical Distribution of TVD

p-value

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages