Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to Featurizer #69

Merged
merged 11 commits into from
Sep 15, 2023
Merged

Updates to Featurizer #69

merged 11 commits into from
Sep 15, 2023

Conversation

lennybronner
Copy link
Collaborator

Description

In order to get the upcoming bootstrap model working I had to make changes to how the Featurizer works. But these changes need to be compatible with our nonparametric and gaussian models. This PR is only for the necessary changes to the Featurizer and changes in the BaseElectionModel to work with the new Featurizer (and unit test changes).

Beyond centering and scaling the features and adding an intercept, the core problem that the Featurizer needed to deal with was generating fixed effects. Specifically that fixed effects in the fitting data might not appear in the holdout data and fixed effect values in the holdout data might not appear in the fitting data.

In the past we manually added and subtracted the columns. Instead we now generate the fixed effects for all units and instead differentiate between expanded fixed effects (all fixed effects that haven't been dropped to avoid multicolinearity) and active fixed effects (expanded fixed effects that appear in the fitting data).

Jira Ticket

Test Steps

These commands should still work as expected:

elexmodel 2020-11-03_USA_G --office_id=P --estimands=dem --geographic_unit_type=county --pi_method=nonparametric --percent_reporting=20 --aggregates=postal_code --fixed_effects=postal_code
elexmodel 2020-11-03_USA_G --office_id=P --estimands=dem --geographic_unit_type=county --pi_method=nonparametric --percent_reporting=20 --aggregates=postal_code --fixed_effects=postal_code --fixed_effects=county_classification

also tox for unit tests

@lennybronner lennybronner requested a review from a team as a code owner August 25, 2023 17:10
Copy link
Contributor

@jchaskell jchaskell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. I'm a bit hesitant about the FE approach but don't have any other ideas about how to solve the problem.

src/elexmodel/handlers/data/CombinedData.py Show resolved Hide resolved
src/elexmodel/handlers/data/Featurizer.py Show resolved Hide resolved
src/elexmodel/handlers/data/Featurizer.py Show resolved Hide resolved
src/elexmodel/handlers/data/Featurizer.py Show resolved Hide resolved
src/elexmodel/handlers/data/Featurizer.py Show resolved Hide resolved
src/elexmodel/models/BaseElectionModel.py Outdated Show resolved Hide resolved
"a": [2.0, 2.0, 2.0, 4.0],
"b": [1.0, 1.0, 1.0, 3.0],
"c": [0.5, 0.5, 0.5, 2.5],
"d": [np.inf, np.inf, np.inf, np.inf],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize we don't expect non FE features to ever be all the same but do we want it to error out in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so? since it will be linearly dependent with the intercept and cause a matrix inversion error either way.

tests/handlers/test_featurizer.py Outdated Show resolved Hide resolved
tests/handlers/test_featurizer.py Outdated Show resolved Hide resolved
tests/handlers/test_featurizer.py Show resolved Hide resolved
Copy link
Contributor

@dmnapolitano dmnapolitano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my first time doing code reviews in github and I'm not sure if I'm done but I don't want to lose the comments I made so far 😬 If there's a way to "post comments but keep reviewing" I'd be very interested in knowing how (Google is no help here) 😅

src/elexmodel/models/BaseElectionModel.py Outdated Show resolved Hide resolved
src/elexmodel/models/BaseElectionModel.py Outdated Show resolved Hide resolved
src/elexmodel/handlers/data/CombinedData.py Show resolved Hide resolved
@lennybronner
Copy link
Collaborator Author

Mostly looks good. I'm a bit hesitant about the FE approach but don't have any other ideas about how to solve the problem.

What part are you hesitant about?

# beta_0 + beta_u * indic{u}
# and the fixed effect estimate for the dropped value is beta_0, so the average is:
# beta_0 + (beta_r / 3) + (beta_u / 3)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just saw this comment. And thanks for adding it; it helps me understand the math better 😄 👍🏻

But I have to admit I'm not familiar with this 😞 Do you have another example or some literature you can share?

Copy link
Contributor

@jchaskell jchaskell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Contributor

@dmnapolitano dmnapolitano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same! 🎉

@lennybronner lennybronner merged commit 8bff4bc into develop Sep 15, 2023
3 checks passed
@lennybronner lennybronner deleted the updates-to-featurizer branch September 15, 2023 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants