ELEX-3469 save aggregate predictions to s3 #100

dmnapolitano · 2024-09-03T20:51:54Z

Description

Hi! The changes in this PR allow us to save aggregate model output (national summary estimates) to s3. It also provides a new CLI argument, --national_summary, which will produce aggregate model output via CLI 🎉 Hopefully I did this right 😬 Thanks!

Jira Ticket

ELEX-3469

Test Steps

Set APP_ENV=dev and DATA_ENV=dev, then run any command with --national_summary, e.g.:

elexmodel 2023-11-07_VA_G --estimands=margin --features=baseline_normalized_margin --office_id=Y --geographic_unit_type=county-district --save_output results --pi_method bootstrap --national_summary

…have them written out to s3, with command line argument

…use it won't work if get_estimands() hasn't been called yet

…other subclasses

lennybronner

I'm not opposed to moving the call for self.model.get_national_summary_estimates(None, None, 0, 0.99) to within get_estimates. At the moment it is executed outside the model which then calls get_national_summary_votes_estimates in the client. But if we want to make this change we need to:

Talk to the live team, because this changes how they call the aggregate model
Make corresponding changes in the testbed
Make sure we pass through the relevant arguments.

dmnapolitano · 2024-09-04T13:05:32Z

I'm not opposed to moving the call for self.model.get_national_summary_estimates(None, None, 0, 0.99) to within get_estimates. At the moment it is executed outside the model which then calls get_national_summary_votes_estimates in the client. But if we want to make this change we need to:

Talk to the live team, because this changes how they call the aggregate model

Make corresponding changes in the testbed

Make sure we pass through the relevant arguments.

Yes! This is great, I was wondering about all of this 🎉 I moved it because self.model is instantiated in get_estimates() so if one called get_national_summary_votes_estimates() first, it would fail, and I wasn't sure there was a scenario where we would ever want to call get_national_summary_votes_estimates() without having called get_estimates() 🤔 So this is good. I'll check with the team when Jen is online and see what they say 😄 👍🏻

dmnapolitano · 2024-09-04T13:08:26Z

Also I just realized this is PR 100 😄 🎉 💯

dmnapolitano · 2024-09-04T16:50:37Z

I'm not opposed to moving the call for self.model.get_national_summary_estimates(None, None, 0, 0.99) to within get_estimates. At the moment it is executed outside the model which then calls get_national_summary_votes_estimates in the client. But if we want to make this change we need to:

Talk to the live team, because this changes how they call the aggregate model

Make corresponding changes in the testbed

Make sure we pass through the relevant arguments.

Yes! This is great, I was wondering about all of this 🎉 I moved it because self.model is instantiated in get_estimates() so if one called get_national_summary_votes_estimates() first, it would fail, and I wasn't sure there was a scenario where we would ever want to call get_national_summary_votes_estimates() without having called get_estimates() 🤔 So this is good. I'll check with the team when Jen is online and see what they say 😄 👍🏻

Alright so the feedback from Jen is please don't change anything 😂 So I'll add the method back in and I have some ideas for how to accomplish this... 🤔

…e making sure the results get written where they need to be written

dmnapolitano · 2024-09-04T17:45:06Z

I'm not opposed to moving the call for self.model.get_national_summary_estimates(None, None, 0, 0.99) to within get_estimates. At the moment it is executed outside the model which then calls get_national_summary_votes_estimates in the client. But if we want to make this change we need to:

Talk to the live team, because this changes how they call the aggregate model

Make corresponding changes in the testbed

Make sure we pass through the relevant arguments.

Yes! This is great, I was wondering about all of this 🎉 I moved it because self.model is instantiated in get_estimates() so if one called get_national_summary_votes_estimates() first, it would fail, and I wasn't sure there was a scenario where we would ever want to call get_national_summary_votes_estimates() without having called get_estimates() 🤔 So this is good. I'll check with the team when Jen is online and see what they say 😄 👍🏻

Alright so the feedback from Jen is please don't change anything 😂 So I'll add the method back in and I have some ideas for how to accomplish this... 🤔

@lennybronner alright I just pushed some changes that I think should accomplish this and preserve the current get_national_summary_votes_estimates() 🙏🏻 Note that the writing of national summary estimates to s3 still only happens in get_estimates() and not in get_national_summary_votes_estimates(), which might be fine given the behavior of get_national_summary_votes_estimates() that live/site/elections is expecting 🤔

lennybronner

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

src/elexmodel/client.py

dmnapolitano · 2024-09-04T20:30:30Z

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

We could, but,

Is that something we want to happen every time this method is called? Is that a side effect the live team is expecting?
We'll have to add additional arguments to get_national_summary_votes_estimates() to handle the logic of where the file gets written to (could be optional though);
I'd have to write something else to write this out to s3, rather than reuse the code writing out other results files;
I had no way of testing this without running the testbed or adding the CLI argument (which you are also adding in PR Elex 4455 add race call contest agg #101 🙌🏻 );

I think those were my main thoughts on this. I'll take a look again to see what can be simplified, but, 🤔

lennybronner · 2024-09-04T21:08:33Z

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

We could, but,

Is that something we want to happen every time this method is called? Is that a side effect the live team is expecting?

We'll have to add additional arguments to get_national_summary_votes_estimates() to handle the logic of where the file gets written to (could be optional though);

I'd have to write something else to write this out to s3, rather than reuse the code writing out other results files;

I had no way of testing this without running the testbed or adding the CLI argument (which you are also adding in PR Elex 4455 add race call contest agg #101 🙌🏻 );

I think those were my main thoughts on this. I'll take a look again to see what can be simplified, but, 🤔

I don't really like re-running the national summary estimates, I'm not sure we've thought about whether something changes. I understand why you just return the national summary estimates as they exist now, but that only runs the aggregate model once per instantiation -- which we do not want. Maybe we add a parameter to get_national_summary_votes_estimates that tells us whether we should re-run or just return the current version.

…been re-run

dmnapolitano · 2024-09-05T14:13:22Z

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

We could, but,

Is that something we want to happen every time this method is called? Is that a side effect the live team is expecting?

We'll have to add additional arguments to get_national_summary_votes_estimates() to handle the logic of where the file gets written to (could be optional though);

I'd have to write something else to write this out to s3, rather than reuse the code writing out other results files;

I had no way of testing this without running the testbed or adding the CLI argument (which you are also adding in PR Elex 4455 add race call contest agg #101 🙌🏻 );

I think those were my main thoughts on this. I'll take a look again to see what can be simplified, but, 🤔

I don't really like re-running the national summary estimates, I'm not sure we've thought about whether something changes. I understand why you just return the national summary estimates as they exist now, but that only runs the aggregate model once per instantiation -- which we do not want. Maybe we add a parameter to get_national_summary_votes_estimates that tells us whether we should re-run or just return the current version.

Ok, I just pushed a possible solution for this based on this comment and the one you have on the code 🙌🏻 The thinking is this:

The model only gets instantiated/updated/re-run when get_estimates() is called;
get_national_summary_votes_estimates() requires a model, so without having called get_estimates() first, it will fail;
But, if the flow is, get_estimates(), then get_national_summary_votes_estimates(), then get_national_summary_votes_estimates() again, the model hasn't changed so there's no reason to re-compute the national summary votes;
So, if we set a flag (_model_updated) to True whenever get_estimates() has been called and False otherwise, then check the flag, that should take care of scenario # 3 🎉

The other option is to re-compute the national summary votes every time get_national_summary_votes_estimates() is called. If that's a pretty fast thing to do, it might be worth it compared to this logic 😅 🤔

dmnapolitano · 2024-09-05T14:16:42Z

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

We could, but,

Is that something we want to happen every time this method is called? Is that a side effect the live team is expecting?

We'll have to add additional arguments to get_national_summary_votes_estimates() to handle the logic of where the file gets written to (could be optional though);

I'd have to write something else to write this out to s3, rather than reuse the code writing out other results files;

I had no way of testing this without running the testbed or adding the CLI argument (which you are also adding in PR Elex 4455 add race call contest agg #101 🙌🏻 );

I think those were my main thoughts on this. I'll take a look again to see what can be simplified, but, 🤔

I don't really like re-running the national summary estimates, I'm not sure we've thought about whether something changes. I understand why you just return the national summary estimates as they exist now, but that only runs the aggregate model once per instantiation -- which we do not want. Maybe we add a parameter to get_national_summary_votes_estimates that tells us whether we should re-run or just return the current version.

Ok, I just pushed a possible solution for this based on this comment and the one you have on the code 🙌🏻 The thinking is this:

The model only gets instantiated/updated/re-run when get_estimates() is called;

get_national_summary_votes_estimates() requires a model, so without having called get_estimates() first, it will fail;

But, if the flow is, get_estimates(), then get_national_summary_votes_estimates(), then get_national_summary_votes_estimates() again, the model hasn't changed so there's no reason to re-compute the national summary votes;

So, if we set a flag (_model_updated) to True whenever get_estimates() has been called and False otherwise, then check the flag, that should take care of scenario # 3 🎉

The other option is to re-compute the national summary votes every time get_national_summary_votes_estimates() is called. If that's a pretty fast thing to do, it might be worth it compared to this logic 😅 🤔

Although either way, I just realized if you call that method and you're calling get_estimates() with national_summary = False, those updated national summary estimates won't get written to s3 🤦🏻‍♀️ Blargh, sorry, let me continue to think about this.... 🤔

… options to write output data in get_national_summary_votes_estimates(), don't write out all the data all over again if we don't have to

… own version of that

dmnapolitano · 2024-09-05T15:55:57Z

Ok @lennybronner take a look at the changes I just pushed. It borrows some logic / variables from get_estimands() to write specifically the national summary estimates to s3. The only thing that doesn't happen here is, self.model.get_national_summary_estimates() is re-called every time, which might be fine since this seems to be a pretty fast method, although I'd be curious to hear your and live/elections's experience with that 😬 If it ends up being too slow, I can add the logic bad in to avoid re-computing the national summary estimates. Anyway, thanks and LMK what you think about this approach 🤔

lennybronner · 2024-09-05T17:45:55Z

This generally looks good now, thanks so much for figuring this out. I left one small question. This will necessitate making changes to the model testbed, so please make those (and then run the 2020 election through the testbed to make sure it all still works as expected). Also, the live will need to make small changes to how they interact with the national summary data, so as long as they aware aware of this, this seems fine.

Where does re-calling get_national_summary_estimates happen? I only see it happen once, where it should be.

src/elexmodel/handlers/data/ModelResults.py

dmnapolitano · 2024-09-05T18:28:52Z

This generally looks good now, thanks so much for figuring this out. I left one small question. This will necessitate making changes to the model testbed, so please make those (and then run the 2020 election through the testbed to make sure it all still works as expected). Also, the live will need to make small changes to how they interact with the national summary data, so as long as they aware aware of this, this seems fine.

Where does re-calling get_national_summary_estimates happen? I only see it happen once, where it should be.

Thanks! 🎉

What are the changes that need to be made to the model testbed? 🤔
What are the changes that the live team will need to make? I tried to make it so they don't have to change anything, but maybe I failed somehow 🤔
Sorry, I explained that poorly. If you do (1) get_estimates(), (2) get_national_summary_votes_estimates(), then (3) get_national_summary_votes_estimates(), that second call (3) will re-compute the national vote estimates and return the same results as (2). I was trying to check for this before but I took it out because it's a bit complicated and might not be worth the extra effort. So I figured I'd revisit it if we got complaints about speed lol 😄

lennybronner · 2024-09-06T15:36:56Z

This generally looks good now, thanks so much for figuring this out. I left one small question. This will necessitate making changes to the model testbed, so please make those (and then run the 2020 election through the testbed to make sure it all still works as expected). Also, the live will need to make small changes to how they interact with the national summary data, so as long as they aware aware of this, this seems fine.
Where does re-calling get_national_summary_estimates happen? I only see it happen once, where it should be.

Thanks! 🎉

What are the changes that need to be made to the model testbed? 🤔

What are the changes that the live team will need to make? I tried to make it so they don't have to change anything, but maybe I failed somehow 🤔

Sorry, I explained that poorly. If you do (1) get_estimates(), (2) get_national_summary_votes_estimates(), then (3) get_national_summary_votes_estimates(), that second call (3) will re-compute the national vote estimates and return the same results as (2). I was trying to check for this before but I took it out because it's a bit complicated and might not be worth the extra effort. So I figured I'd revisit it if we got complaints about speed lol 😄

(1) and (2) The structure of the response of get_national_summary_votes_estimates changed, it's not a dataframe rather than a list, right?

(3) that's what we want.

dmnapolitano · 2024-09-06T15:43:35Z

This generally looks good now, thanks so much for figuring this out. I left one small question. This will necessitate making changes to the model testbed, so please make those (and then run the 2020 election through the testbed to make sure it all still works as expected). Also, the live will need to make small changes to how they interact with the national summary data, so as long as they aware aware of this, this seems fine.
Where does re-calling get_national_summary_estimates happen? I only see it happen once, where it should be.

Thanks! 🎉

What are the changes that need to be made to the model testbed? 🤔

What are the changes that the live team will need to make? I tried to make it so they don't have to change anything, but maybe I failed somehow 🤔

Sorry, I explained that poorly. If you do (1) get_estimates(), (2) get_national_summary_votes_estimates(), then (3) get_national_summary_votes_estimates(), that second call (3) will re-compute the national vote estimates and return the same results as (2). I was trying to check for this before but I took it out because it's a bit complicated and might not be worth the extra effort. So I figured I'd revisit it if we got complaints about speed lol 😄

(1) and (2) The structure of the response of get_national_summary_votes_estimates changed, it's not a dataframe rather than a list, right?

(3) that's what we want.

(3) Great! 🎉

(1) and (2), no, get_national_summary_votes_estimates() in ModelClient returned something like this:

{"margin" : [agg_pred, agg_lower, agg_upper]}

I made sure to preserve that. When writing to s3, I'm currently converting that to a CSV but that can easily be a JSON or whatever is best 🤔

lennybronner · 2024-09-06T16:48:53Z

amazing, thank you! And just to confirm, you've run this from the testbed (including writing to s3) and it works?

dmnapolitano · 2024-09-06T17:29:32Z

amazing, thank you! And just to confirm, you've run this from the testbed (including writing to s3) and it works?

Thanks!! I will test that now! 🤔

dmnapolitano · 2024-09-06T19:07:41Z

amazing, thank you! And just to confirm, you've run this from the testbed (including writing to s3) and it works?

Thanks!! I will test that now! 🤔

@lennybronner ok, so this testbed command works as we'd expect it to:

python run.py 2022-11-08_USA_G redo --office_id S_county --fixed_effects "['county_classification']" --geographic_unit_type county --pi_method bootstrap --estimands "['margin']" --features "['baseline_normalized_margin', 'education_bachelors_or_higher']" --end_timestamp "2022-11-10 22:00:29-05:00" --agg_model_preds

🎉 but for writing to s3 from the testbed, there's no way to do that via the CLI currently, although I could add the options. And if I did, it would just write through ModelClient and then overwrite every set of predictions produced. Is that ok or would writing everything we save in output/ be better? 🤔

lennybronner · 2024-09-06T22:42:33Z

Yes, this is why I meant this needs a corresponding PR in the testbed. It's ok to overwrite the predictions produced in dev in s3.

dmnapolitano · 2024-09-09T13:13:25Z

Yes, this is why I meant this needs a corresponding PR in the testbed. It's ok to overwrite the predictions produced in dev in s3.

Oh sorry, I misunderstood! Ok, I'll work on that now 😅 👍🏻

dmnapolitano · 2024-09-09T14:36:56Z

Yes, this is why I meant this needs a corresponding PR in the testbed. It's ok to overwrite the predictions produced in dev in s3.

Oh sorry, I misunderstood! Ok, I'll work on that now 😅 👍🏻

@lennybronner this is ready for review 🎉 washpost/elex-live-model-testbed#25

…et_national_vote_summary_estimates() and running pre-commit

lennybronner

Great, and now just to confirm, you ran this with your testbed branch?

dmnapolitano · 2024-09-11T15:00:36Z

Great, and now just to confirm, you ran this with your testbed branch?

Yup! 🎉

dmnapolitano added 4 commits September 3, 2024 15:59

Adding ability to add national summary estimates to ModelResults and …

2456f33

…have them written out to s3, with command line argument

Remove get_national_summary_votes_estimates() method from client beca…

22a1c3d

…use it won't work if get_estimands() hasn't been called yet

Fixing warning about missing alpha parameter in model superclass and …

00f6354

…other subclasses

Adding client unit test for getting national summary estimates

31834d4

dmnapolitano requested a review from a team as a code owner September 3, 2024 20:51

Adding new national_summary CLI flag to README

e426e9e

lennybronner reviewed Sep 4, 2024

View reviewed changes

dmnapolitano added the 💭 wip label Sep 4, 2024

dmnapolitano mentioned this pull request Sep 4, 2024

Elex 4455 add race call contest agg #101

Merged

dmnapolitano marked this pull request as draft September 4, 2024 16:50

Preserve the pre-existing national summary method in ModelClient whil…

2480ff6

…e making sure the results get written where they need to be written

dmnapolitano marked this pull request as ready for review September 4, 2024 17:41

lennybronner reviewed Sep 4, 2024

View reviewed changes

src/elexmodel/client.py Outdated Show resolved Hide resolved

Adding a flag to the ModelClient to help indicate when the model has …

5e87a35

…been re-run

dmnapolitano marked this pull request as draft September 5, 2024 14:16

dmnapolitano added 3 commits September 5, 2024 11:06

Remove national_summary kwarg logic from get_estimates(), make use of…

6662cb8

… options to write output data in get_national_summary_votes_estimates(), don't write out all the data all over again if we don't have to

Updating unit test

4190fc6

Switching to pd.DataFrame.from_dict() rather than trying to create my…

86b3b74

… own version of that

dmnapolitano marked this pull request as ready for review September 5, 2024 15:56

lennybronner reviewed Sep 5, 2024

View reviewed changes

src/elexmodel/handlers/data/ModelResults.py Show resolved Hide resolved

dmnapolitano removed the 💭 wip label Sep 6, 2024

dmnapolitano added the 💭 wip label Sep 9, 2024

dmnapolitano added 3 commits September 9, 2024 13:42

Merging in changes from develop

5785aed

Updating the cli call now that called_states is no longer an arg to g…

e85cddd

…et_national_vote_summary_estimates() and running pre-commit

Making sure unit tests pass again

f5af27d

dmnapolitano removed the 💭 wip label Sep 9, 2024

lennybronner reviewed Sep 11, 2024

View reviewed changes

lennybronner approved these changes Sep 11, 2024

View reviewed changes

dmnapolitano merged commit d766bcd into develop Sep 11, 2024
3 checks passed

dmnapolitano deleted the ELEX-3469-save-aggregate-predictions-to-s3 branch September 11, 2024 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ELEX-3469 save aggregate predictions to s3 #100

ELEX-3469 save aggregate predictions to s3 #100

dmnapolitano commented Sep 3, 2024 •

edited by jira bot

Loading

lennybronner left a comment

dmnapolitano commented Sep 4, 2024 •

edited

Loading

dmnapolitano commented Sep 4, 2024

dmnapolitano commented Sep 4, 2024

dmnapolitano commented Sep 4, 2024

lennybronner left a comment

dmnapolitano commented Sep 4, 2024

lennybronner commented Sep 4, 2024

dmnapolitano commented Sep 5, 2024 •

edited

Loading

dmnapolitano commented Sep 5, 2024

dmnapolitano commented Sep 5, 2024

lennybronner commented Sep 5, 2024

dmnapolitano commented Sep 5, 2024

lennybronner commented Sep 6, 2024

dmnapolitano commented Sep 6, 2024

lennybronner commented Sep 6, 2024

dmnapolitano commented Sep 6, 2024

dmnapolitano commented Sep 6, 2024

lennybronner commented Sep 6, 2024

dmnapolitano commented Sep 9, 2024

dmnapolitano commented Sep 9, 2024

lennybronner left a comment

dmnapolitano commented Sep 11, 2024

ELEX-3469 save aggregate predictions to s3 #100

ELEX-3469 save aggregate predictions to s3 #100

Conversation

dmnapolitano commented Sep 3, 2024 • edited by jira bot Loading

Description

Jira Ticket

Test Steps

lennybronner left a comment

Choose a reason for hiding this comment

dmnapolitano commented Sep 4, 2024 • edited Loading

dmnapolitano commented Sep 4, 2024

dmnapolitano commented Sep 4, 2024

dmnapolitano commented Sep 4, 2024

lennybronner left a comment

Choose a reason for hiding this comment

dmnapolitano commented Sep 4, 2024

lennybronner commented Sep 4, 2024

dmnapolitano commented Sep 5, 2024 • edited Loading

dmnapolitano commented Sep 5, 2024

dmnapolitano commented Sep 5, 2024

lennybronner commented Sep 5, 2024

dmnapolitano commented Sep 5, 2024

lennybronner commented Sep 6, 2024

dmnapolitano commented Sep 6, 2024

lennybronner commented Sep 6, 2024

dmnapolitano commented Sep 6, 2024

dmnapolitano commented Sep 6, 2024

lennybronner commented Sep 6, 2024

dmnapolitano commented Sep 9, 2024

dmnapolitano commented Sep 9, 2024

lennybronner left a comment

Choose a reason for hiding this comment

dmnapolitano commented Sep 11, 2024

dmnapolitano commented Sep 3, 2024 •

edited by jira bot

Loading

dmnapolitano commented Sep 4, 2024 •

edited

Loading

dmnapolitano commented Sep 5, 2024 •

edited

Loading