Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ELEX-2763-estimandizer #59

Merged
merged 48 commits into from
Sep 13, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
0d523c3
set up class
rishasurana Jun 29, 2023
68719ef
set-up for linter
rishasurana Jun 29, 2023
25b2311
update cli
rishasurana Jun 29, 2023
bc5aea6
base concept
rishasurana Jul 5, 2023
144d403
test updates
rishasurana Aug 2, 2023
4ed0b0f
Merge branch 'develop' into estimandizer
rishasurana Aug 2, 2023
83f9e66
update pre-commit
rishasurana Aug 2, 2023
dd2688b
Update cli.py
rishasurana Aug 3, 2023
fe66b0b
Merge branch 'estimandizer' of https://github.com/washingtonpost/elex…
rishasurana Aug 3, 2023
3760bf4
test update
rishasurana Aug 3, 2023
b4b1c21
updates
rishasurana Aug 4, 2023
bf93a44
column updates
rishasurana Aug 7, 2023
d32648e
spacing
rishasurana Aug 7, 2023
24ff8f2
pre-commit
rishasurana Aug 7, 2023
85e6fcd
adding to client
rishasurana Aug 9, 2023
e3d56ad
int tests
rishasurana Aug 9, 2023
9ad2aea
naming and type updates
rishasurana Aug 10, 2023
72d459b
Merge branch 'develop' into estimandizer
rishasurana Aug 14, 2023
ea8be2a
cli updates
rishasurana Aug 14, 2023
45062de
linter
rishasurana Aug 14, 2023
76440cb
cli given func
rishasurana Aug 14, 2023
865ed78
linter
rishasurana Aug 14, 2023
85393e2
election type updates
rishasurana Aug 18, 2023
f1f1427
comments
rishasurana Aug 18, 2023
722be3c
Merging in changes from develop
dmnapolitano Sep 5, 2023
aac2db0
Fixing unit tests in tests/handlers/test_config.py
dmnapolitano Sep 5, 2023
50086d8
Fixing bad formatting
dmnapolitano Sep 5, 2023
36ba5c3
Setting a seed on the shuffle of the data happening during test_winso…
dmnapolitano Sep 5, 2023
acc8a97
Removing stray print statement in cli.py
dmnapolitano Sep 6, 2023
3754661
Removing test_winsorize_intervals() from tests/test_client.py since i…
dmnapolitano Sep 7, 2023
717147d
Squashing some tox warnings
dmnapolitano Sep 7, 2023
321ab4d
Removing estimand_fns
dmnapolitano Sep 7, 2023
32b3577
Silencing a tox warning in src/elexmodel/cli.py
dmnapolitano Sep 7, 2023
c858ff9
Cut down on redundancy by using class members in LiveData and Preproc…
dmnapolitano Sep 7, 2023
9d5c101
Some progress consolidating estimand-creation and estimand-checking c…
dmnapolitano Sep 7, 2023
0954f84
Some flake8 code cleanup
dmnapolitano Sep 7, 2023
a9a2354
Now creating estimands from functions :D
dmnapolitano Sep 7, 2023
ff1ec13
Cleaning up comments; adding Estimandizer unit tests that reflect thi…
dmnapolitano Sep 7, 2023
f51bf1c
Adding the Custom Estimands section to the README
dmnapolitano Sep 7, 2023
51bd46f
Reducing the ValueError for when unknown estimands appear that aren't…
dmnapolitano Sep 7, 2023
39ed3bc
Correct leftover spelling mistake in tests/test_client.py
dmnapolitano Sep 8, 2023
90e959d
Getting the new Estimandizer stuff to work with historical elections …
dmnapolitano Sep 8, 2023
1772f08
Fixed runs with CombinedDataHandler
dmnapolitano Sep 8, 2023
9b2deeb
Rolling back changes to test_combined_data and test_featurizer made i…
dmnapolitano Sep 11, 2023
b512324
If we had to add the baseline column (for example when merging two da…
dmnapolitano Sep 11, 2023
99b0a52
Fixing logic in the columns being created during estimandization
dmnapolitano Sep 12, 2023
aacf4cc
Handle case of the estimand existing but not the required results_est…
dmnapolitano Sep 12, 2023
77a14d8
Correcting some estimandizer instructions in the README
dmnapolitano Sep 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@ repos:
# these are errors that will be ignored by flake8
# definitions here
# https://flake8.pycqa.org/en/latest/user/error-codes.html
- "--ignore=E266,E501,W503"
- "--ignore=E266,E501,W503,F811,C901"
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ max-line-length = 120
[pylint]
max-line-length = 120
good-names= on, x, df, NonparametricElectionModel, GaussianElectionModel,
BaseElectionModel, qr, X, y, f, LiveData, n, Featurizer, fe, PreprocessedData, CombinedData,
BaseElectionModel, qr, X, y, f, LiveData, n, Featurizer, Estimandizer, fe, PreprocessedData, CombinedData,
ModelResults, GaussianModel, MODEL_THRESHOLD, LOG, w, df_X, df_y, v, n, g, a, b
disable=missing-function-docstring, missing-module-docstring, missing-class-docstring, #missing
too-many-arguments, too-many-locals, too-many-branches, too-many-instance-attributes, too-many-statements, #structure: too-many
Expand Down
184 changes: 184 additions & 0 deletions src/elexmodel/handlers/data/Estimandizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
import numpy as np


class Estimandizer:
"""
Estimandizer. Generate estimands explicitly.
"""

def __init__(self, data_handler, estimands, given_function_dict={}):
self.data_handler = data_handler
self.estimands = estimands
rishasurana marked this conversation as resolved.
Show resolved Hide resolved
self.transformations = []
self.given_function_dict = given_function_dict
self.transformation_map = {
"margin": [self.calculate_margin],
jchaskell marked this conversation as resolved.
Show resolved Hide resolved
"voter_turnout_rate": [self.calculate_voter_turnout_rate],
"standardized_income": [self.standardize_median_household_income],
"age_groups": [self.create_age_groups],
"party_vote_share": [self.calculate_party_vote_share],
"education_impact": [self.calculate_party_vote_share, self.analyze_education_impact],
"gender_turnout_disparity": [self.investigate_gender_turnout_disparity],
"ethnicity_voting_patterns": [self.calculate_party_vote_share, self.examine_ethnicity_voting_patterns],
"income_impact": [
self.calculate_party_vote_share,
self.standardize_median_household_income,
self.explore_income_impact,
],
"candidate": [self.candidate],
}

def pre_check_estimands(self, election_id):
"""
Ensure estimand isn't one of the pre-specified values that are already included
"""
standard = ["dem_votes", "gop_votes", "total_votes"]
if not self.check_input_columns(standard, election_id):
self.create_estimand(None, self.standard)

def check_input_columns(self, columns, election_id):
"""
Check that input columns contain all neccessary values for a calculation
"""
missing_columns = []
if election_id == "G":
missing_columns = [col for col in columns if col not in self.data_handler.data.columns]
elif election_id == "P":
missing_columns = [
col for col in columns if col not in self.data_handler.data[self.data_handler.election_id]
rishasurana marked this conversation as resolved.
Show resolved Hide resolved
]
return len(missing_columns) == 0

def verify_estimand(self, estimand, election_id):
"""
Verify which estimands can be formed given a dataset and a list of estimands we would like to create
"""
if estimand not in self.transformation_map:
raise ValueError(f"Estimand '{estimand}' is not supported.")
self.transformations = self.transformation_map[estimand]

if not self.check_input_columns(
[col for transform in self.transformations for col in transform.__code__.co_varnames[1:]], election_id
):
return []

return self.transformations

def create_estimand(self, estimand=None, given_function=None):
"""
Create an estimand. You must give either a estimand name or a pre-written function.
"""
if estimand is None and given_function is not None:
given_function()
elif given_function is None and estimand is not None:
if estimand in self.transformation_map:
if self.transformation_map[estimand][0] in self.transformations:
transformation_func = self.transformations[
self.transformations.index(self.transformation_map[estimand][0])
]
transformation_func()

def generate_estimands(self, election_id):
"""
Main function to generate estimands
"""
if election_id == "G":
self.pre_check_estimands(election_id)

# Option 1: Pass in a dict of new functions of estimands we want to build
if self.given_function_dict != {}:
for estimand, function in self.given_function_dict.items():
self.verify_estimand(estimand, election_id)
self.create_estimand(None, function)

# Option 2: Pass in a list of estimands we want to build from a pre-set list
for estimand in self.estimands:
self.verify_estimand(estimand, election_id)
self.create_estimand(estimand, None)
return self.data_handler

# Transformation methods
def standard(self):
"""
Create/overwrite the standard estimands: ["dem_votes", "gop_votes", "total_votes"]
"""
if "results_turnout" in self.data_handler.data.columns:
self.data_handler.data["dem_votes"] = self.data_handler.data["results_dem"]
self.data_handler.data["gop_votes"] = self.data_handler.data["results_gop"]
self.data_handler.data["total_votes"] = self.data_handler.data["results_turnout"]
else:
self.data_handler.data["dem_votes"] = None
self.data_handler.data["gop_votes"] = None
self.data_handler.data["total_votes"] = None

def calculate_margin(self):
self.data_handler.data["margin"] = self.data_handler.data["dem_votes"] - self.data_handler.data["gop_votes"]

def calculate_voter_turnout_rate(self):
self.data_handler.data["voter_turnout_rate"] = (
self.data_handler.data["total_votes"] / self.data_handler.data["total_gen_voters"]
)

def standardize_median_household_income(self):
mean_income = self.data_handler.data["median_household_income"].mean()
std_income = self.data_handler.data["median_household_income"].std()
self.data_handler.data["standardized_income"] = (
self.data_handler.data["median_household_income"] - mean_income
) / std_income

def create_age_groups(self):
self.data_handler.data["age_group_under_30"] = np.where(self.data_handler.data["age_le_30"] == 1, 1, 0)
self.data_handler.data["age_group_30_45"] = np.where(self.data_handler.data["age_geq_30_le_45"] == 1, 1, 0)
self.data_handler.data["age_group_45_65"] = np.where(self.data_handler.data["age_geq_45_le_65"] == 1, 1, 0)
self.data_handler.data["age_group_over_65"] = np.where(self.data_handler.data["age_geq_65"] == 1, 1, 0)

def calculate_party_vote_share(self):
self.data_handler.data["party_vote_share_dem"] = (
self.data_handler.data["dem_votes"] / self.data_handler.data["total_votes"]
)
self.data_handler.data["party_vote_share_gop"] = (
self.data_handler.data["gop_votes"] / self.data_handler.data["total_votes"]
)

def analyze_education_impact(self):
self.data_handler.data["education_impact_dem"] = (
self.data_handler.data["percent_bachelor_or_higher"] * self.data_handler.data["party_vote_share_dem"]
)
self.data_handler.data["education_impact_gop"] = (
self.data_handler.data["percent_bachelor_or_higher"] * self.data_handler.data["party_vote_share_gop"]
)

def investigate_gender_turnout_disparity(self):
self.data_handler.data["gender_turnout_disparity"] = (
self.data_handler.data["gender_f"] - self.data_handler.data["gender_m"]
)

def examine_ethnicity_voting_patterns(self):
ethnicities = [
"east_and_south_asian",
"european",
"hispanic_and_portuguese",
"likely_african_american",
"other",
"unknown",
]
for ethnicity in ethnicities:
self.data_handler.data[f"vote_share_{ethnicity}"] = (
self.data_handler.data[f"ethnicity_{ethnicity}"] * self.data_handler.data["total_votes"]
)

def candidate(self):
election_data = self.data_handler.data[self.data_handler.election_id][0]
candidate_data = election_data["baseline_pointer"]
# cand_set = set(candidate_data)
for cand_name in candidate_data:
if cand_name != "turnout":
election_data[cand_name] = candidate_data[cand_name]

def explore_income_impact(self):
self.data_handler.data["income_impact_dem"] = (
self.data_handler.data["standardized_income"] * self.data_handler.data["party_vote_share_dem"]
)
self.data_handler.data["income_impact_gop"] = (
self.data_handler.data["standardized_income"] * self.data_handler.data["party_vote_share_gop"]
)
134 changes: 134 additions & 0 deletions tests/handlers/test_estimandizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
from elexmodel.handlers.data.CombinedData import CombinedDataHandler
from elexmodel.handlers.data.Estimandizer import Estimandizer
from elexmodel.handlers.data.LiveData import MockLiveDataHandler
from elexmodel.handlers.data.PreprocessedData import PreprocessedDataHandler


def test_create_estimand_margin_preprocessed(va_governor_county_data):
"""
Tests margin estimand generation (for preprocessed data only)

Structure of a "G" election:
(['postal_code', 'state_fips', 'county_fips', 'geographic_unit_name',
'geographic_unit_fips', 'geographic_unit_type', 'county_classification',
'results_turnout', 'results_dem', 'results_gop', 'baseline_turnout',
'baseline_dem', 'baseline_gop', 'age_le_30', 'age_geq_30_le_45',
'age_geq_45_le_65', 'age_geq_65', 'ethnicity_east_and_south_asian',
'ethnicity_european', 'ethnicity_hispanic_and_portuguese',
'ethnicity_likely_african_american', 'ethnicity_other',
'ethnicity_unknown', 'gender_f', 'gender_m', 'gender_unknown',
'median_household_income', 'percent_bachelor_or_higher',
'total_age_voters', 'total_eth_voters', 'total_gen_voters'],
dtype='object')
"""
va_data_copy = va_governor_county_data.copy()
election_id = "2017-11-07_VA_G"
office = "G"
geographic_unit_type = "county"
estimands = []
estimand_baseline = {}

preprocessed_data_handler = PreprocessedDataHandler(
election_id, office, geographic_unit_type, estimands, estimand_baseline, data=va_data_copy
)
new_estimands = ["margin"]

estimandizer = Estimandizer(preprocessed_data_handler, new_estimands)
new_data_handler = estimandizer.generate_estimands("G")

assert "margin" in new_data_handler.data


def test_create_estimand_voter_turnout_rate(va_governor_county_data):
"""
Tests voter turnout rate estimand generation on preprocessed data of the VA general
"""
va_data_copy = va_governor_county_data.copy()
election_id = "2017-11-07_VA_G"
office = "G"
geographic_unit_type = "county"
estimands = []
estimand_baseline = {}

preprocessed_data_handler = PreprocessedDataHandler(
election_id, office, geographic_unit_type, estimands, estimand_baseline, data=va_data_copy
)

new_estimands = ["voter_turnout_rate"]

estimandizer = Estimandizer(preprocessed_data_handler, new_estimands)
new_data_handler = estimandizer.generate_estimands("G")

assert "voter_turnout_rate" in new_data_handler.data


def test_create_estimand_age_combined(va_governor_county_data):
"""
Tests age bracket estimand generation on a combined data handler
"""
va_data_copy = va_governor_county_data.copy()
election_id = "2017-11-07_VA_G"
office = "G"
geographic_unit_type = "county"
estimands = []
estimand_baseline = {}

preprocessed_data_handler = PreprocessedDataHandler(
election_id, office, geographic_unit_type, estimands, estimand_baseline, data=va_data_copy
)

live_data_handler = MockLiveDataHandler(election_id, office, geographic_unit_type, estimands, data=va_data_copy)

current_data = live_data_handler.get_n_fully_reported(n=va_data_copy.shape[0])

combined_data_handler = CombinedDataHandler(
preprocessed_data_handler.data,
current_data,
estimands,
"county",
handle_unreporting="drop",
)

new_estimands = ["age_groups"]

estimandizer = Estimandizer(combined_data_handler, new_estimands)
new_data_handler = estimandizer.generate_estimands("G")

assert "age_group_30_45" in new_data_handler.data


def test_candidate(tx_primary_governor_config):
"""
Tests `{candidate_last_name}_{polID}` estimand generation on a preprocessed data handler for tx primaries

Structure of a "P" election:
{'2018-03-06_TX_R': [{
'office': 'G',
'states': ['TX'],
'geographic_unit_types': ['county'],
'baseline_results_year': 2014,
'historical_election': [],
'features': ['age_le_30', 'age_geq_30_le_45', 'age_geq_35_le_65', 'age_geq_65', 'ethnicity_east_and_south_asian', 'ethnicity_hispanic_and_portuguese', 'ethnicity_european', 'ethnicity_likely_african_american', 'ethnicity_other', 'ethnicity_unknown', 'median_household_income', 'percent_bachelor_or_higher'],
'aggregates': ['postal_code', 'county_classification'], 'fixed_effect': [],
'baseline_pointer': {'abbott_41404': 'abbott_41404', 'krueger_66077': 'abbott_41404', 'kilgore_57793': 'abbott_41404',
'turnout': 'turnout'}}]}

This function adds the combined values for each candidate (ex: all abbott_41404) to the main list under '2018-03-06_TX_R'
"""
tx_data_copy = tx_primary_governor_config.copy()
election_id = "2018-03-06_TX_R"
office = "G"
geographic_unit_type = "county"
estimands = []
estimand_baseline = {}

preprocessed_data_handler = PreprocessedDataHandler(
election_id, office, geographic_unit_type, estimands, estimand_baseline, data=tx_data_copy
)

new_estimands = ["candidate"]

estimandizer = Estimandizer(preprocessed_data_handler, new_estimands)
new_data_handler = estimandizer.generate_estimands("P")

assert "abbott_41404" in new_data_handler.data[new_data_handler.election_id][0]