Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ELEX-3469 save aggregate predictions to s3 #100

Merged
merged 13 commits into from
Sep 11, 2024

Conversation

dmnapolitano
Copy link
Contributor

@dmnapolitano dmnapolitano commented Sep 3, 2024

Description

Hi! The changes in this PR allow us to save aggregate model output (national summary estimates) to s3. It also provides a new CLI argument, --national_summary, which will produce aggregate model output via CLI 🎉 Hopefully I did this right 😬 Thanks!

Jira Ticket

ELEX-3469

Test Steps

Set APP_ENV=dev and DATA_ENV=dev, then run any command with --national_summary, e.g.:

elexmodel 2023-11-07_VA_G --estimands=margin --features=baseline_normalized_margin --office_id=Y --geographic_unit_type=county-district --save_output results --pi_method bootstrap --national_summary

@dmnapolitano dmnapolitano requested a review from a team as a code owner September 3, 2024 20:51
Copy link
Collaborator

@lennybronner lennybronner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to moving the call for self.model.get_national_summary_estimates(None, None, 0, 0.99) to within get_estimates. At the moment it is executed outside the model which then calls get_national_summary_votes_estimates in the client. But if we want to make this change we need to:

  1. Talk to the live team, because this changes how they call the aggregate model
  2. Make corresponding changes in the testbed
  3. Make sure we pass through the relevant arguments.

@dmnapolitano
Copy link
Contributor Author

dmnapolitano commented Sep 4, 2024

I'm not opposed to moving the call for self.model.get_national_summary_estimates(None, None, 0, 0.99) to within get_estimates. At the moment it is executed outside the model which then calls get_national_summary_votes_estimates in the client. But if we want to make this change we need to:

  1. Talk to the live team, because this changes how they call the aggregate model
  2. Make corresponding changes in the testbed
  3. Make sure we pass through the relevant arguments.

Yes! This is great, I was wondering about all of this 🎉 I moved it because self.model is instantiated in get_estimates() so if one called get_national_summary_votes_estimates() first, it would fail, and I wasn't sure there was a scenario where we would ever want to call get_national_summary_votes_estimates() without having called get_estimates() 🤔 So this is good. I'll check with the team when Jen is online and see what they say 😄 👍🏻

@dmnapolitano
Copy link
Contributor Author

Also I just realized this is PR 100 😄 🎉 💯

@dmnapolitano
Copy link
Contributor Author

I'm not opposed to moving the call for self.model.get_national_summary_estimates(None, None, 0, 0.99) to within get_estimates. At the moment it is executed outside the model which then calls get_national_summary_votes_estimates in the client. But if we want to make this change we need to:

  1. Talk to the live team, because this changes how they call the aggregate model
  2. Make corresponding changes in the testbed
  3. Make sure we pass through the relevant arguments.

Yes! This is great, I was wondering about all of this 🎉 I moved it because self.model is instantiated in get_estimates() so if one called get_national_summary_votes_estimates() first, it would fail, and I wasn't sure there was a scenario where we would ever want to call get_national_summary_votes_estimates() without having called get_estimates() 🤔 So this is good. I'll check with the team when Jen is online and see what they say 😄 👍🏻

Alright so the feedback from Jen is please don't change anything 😂 So I'll add the method back in and I have some ideas for how to accomplish this... 🤔

@dmnapolitano dmnapolitano marked this pull request as draft September 4, 2024 16:50
…e making sure the results get written where they need to be written
@dmnapolitano dmnapolitano marked this pull request as ready for review September 4, 2024 17:41
@dmnapolitano
Copy link
Contributor Author

I'm not opposed to moving the call for self.model.get_national_summary_estimates(None, None, 0, 0.99) to within get_estimates. At the moment it is executed outside the model which then calls get_national_summary_votes_estimates in the client. But if we want to make this change we need to:

  1. Talk to the live team, because this changes how they call the aggregate model
  2. Make corresponding changes in the testbed
  3. Make sure we pass through the relevant arguments.

Yes! This is great, I was wondering about all of this 🎉 I moved it because self.model is instantiated in get_estimates() so if one called get_national_summary_votes_estimates() first, it would fail, and I wasn't sure there was a scenario where we would ever want to call get_national_summary_votes_estimates() without having called get_estimates() 🤔 So this is good. I'll check with the team when Jen is online and see what they say 😄 👍🏻

Alright so the feedback from Jen is please don't change anything 😂 So I'll add the method back in and I have some ideas for how to accomplish this... 🤔

@lennybronner alright I just pushed some changes that I think should accomplish this and preserve the current get_national_summary_votes_estimates() 🙏🏻 Note that the writing of national summary estimates to s3 still only happens in get_estimates() and not in get_national_summary_votes_estimates(), which might be fine given the behavior of get_national_summary_votes_estimates() that live/site/elections is expecting 🤔

Copy link
Collaborator

@lennybronner lennybronner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

src/elexmodel/client.py Outdated Show resolved Hide resolved
@dmnapolitano
Copy link
Contributor Author

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

We could, but,

  1. Is that something we want to happen every time this method is called? Is that a side effect the live team is expecting?
  2. We'll have to add additional arguments to get_national_summary_votes_estimates() to handle the logic of where the file gets written to (could be optional though);
  3. I'd have to write something else to write this out to s3, rather than reuse the code writing out other results files;
  4. I had no way of testing this without running the testbed or adding the CLI argument (which you are also adding in PR Elex 4455 add race call contest agg #101 🙌🏻 );

I think those were my main thoughts on this. I'll take a look again to see what can be simplified, but, 🤔

@lennybronner
Copy link
Collaborator

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

We could, but,

  1. Is that something we want to happen every time this method is called? Is that a side effect the live team is expecting?
  2. We'll have to add additional arguments to get_national_summary_votes_estimates() to handle the logic of where the file gets written to (could be optional though);
  3. I'd have to write something else to write this out to s3, rather than reuse the code writing out other results files;
  4. I had no way of testing this without running the testbed or adding the CLI argument (which you are also adding in PR Elex 4455 add race call contest agg #101 🙌🏻 );

I think those were my main thoughts on this. I'll take a look again to see what can be simplified, but, 🤔

I don't really like re-running the national summary estimates, I'm not sure we've thought about whether something changes. I understand why you just return the national summary estimates as they exist now, but that only runs the aggregate model once per instantiation -- which we do not want. Maybe we add a parameter to get_national_summary_votes_estimates that tells us whether we should re-run or just return the current version.

@dmnapolitano
Copy link
Contributor Author

dmnapolitano commented Sep 5, 2024

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

We could, but,

  1. Is that something we want to happen every time this method is called? Is that a side effect the live team is expecting?
  2. We'll have to add additional arguments to get_national_summary_votes_estimates() to handle the logic of where the file gets written to (could be optional though);
  3. I'd have to write something else to write this out to s3, rather than reuse the code writing out other results files;
  4. I had no way of testing this without running the testbed or adding the CLI argument (which you are also adding in PR Elex 4455 add race call contest agg #101 🙌🏻 );

I think those were my main thoughts on this. I'll take a look again to see what can be simplified, but, 🤔

I don't really like re-running the national summary estimates, I'm not sure we've thought about whether something changes. I understand why you just return the national summary estimates as they exist now, but that only runs the aggregate model once per instantiation -- which we do not want. Maybe we add a parameter to get_national_summary_votes_estimates that tells us whether we should re-run or just return the current version.

Ok, I just pushed a possible solution for this based on this comment and the one you have on the code 🙌🏻 The thinking is this:

  1. The model only gets instantiated/updated/re-run when get_estimates() is called;
  2. get_national_summary_votes_estimates() requires a model, so without having called get_estimates() first, it will fail;
  3. But, if the flow is, get_estimates(), then get_national_summary_votes_estimates(), then get_national_summary_votes_estimates() again, the model hasn't changed so there's no reason to re-compute the national summary votes;
  4. So, if we set a flag (_model_updated) to True whenever get_estimates() has been called and False otherwise, then check the flag, that should take care of scenario # 3 🎉

The other option is to re-compute the national summary votes every time get_national_summary_votes_estimates() is called. If that's a pretty fast thing to do, it might be worth it compared to this logic 😅 🤔

@dmnapolitano
Copy link
Contributor Author

I'm not sure I understand why so much change was necessary? Why can't we just write the data to s3 from get_national_summary_votes_estimates?

We could, but,

  1. Is that something we want to happen every time this method is called? Is that a side effect the live team is expecting?
  2. We'll have to add additional arguments to get_national_summary_votes_estimates() to handle the logic of where the file gets written to (could be optional though);
  3. I'd have to write something else to write this out to s3, rather than reuse the code writing out other results files;
  4. I had no way of testing this without running the testbed or adding the CLI argument (which you are also adding in PR Elex 4455 add race call contest agg #101 🙌🏻 );

I think those were my main thoughts on this. I'll take a look again to see what can be simplified, but, 🤔

I don't really like re-running the national summary estimates, I'm not sure we've thought about whether something changes. I understand why you just return the national summary estimates as they exist now, but that only runs the aggregate model once per instantiation -- which we do not want. Maybe we add a parameter to get_national_summary_votes_estimates that tells us whether we should re-run or just return the current version.

Ok, I just pushed a possible solution for this based on this comment and the one you have on the code 🙌🏻 The thinking is this:

  1. The model only gets instantiated/updated/re-run when get_estimates() is called;
  2. get_national_summary_votes_estimates() requires a model, so without having called get_estimates() first, it will fail;
  3. But, if the flow is, get_estimates(), then get_national_summary_votes_estimates(), then get_national_summary_votes_estimates() again, the model hasn't changed so there's no reason to re-compute the national summary votes;
  4. So, if we set a flag (_model_updated) to True whenever get_estimates() has been called and False otherwise, then check the flag, that should take care of scenario # 3 🎉

The other option is to re-compute the national summary votes every time get_national_summary_votes_estimates() is called. If that's a pretty fast thing to do, it might be worth it compared to this logic 😅 🤔

Although either way, I just realized if you call that method and you're calling get_estimates() with national_summary = False, those updated national summary estimates won't get written to s3 🤦🏻‍♀️ Blargh, sorry, let me continue to think about this.... 🤔

@dmnapolitano dmnapolitano marked this pull request as draft September 5, 2024 14:16
… options to write output data in get_national_summary_votes_estimates(), don't write out all the data all over again if we don't have to
@dmnapolitano
Copy link
Contributor Author

Ok @lennybronner take a look at the changes I just pushed. It borrows some logic / variables from get_estimands() to write specifically the national summary estimates to s3. The only thing that doesn't happen here is, self.model.get_national_summary_estimates() is re-called every time, which might be fine since this seems to be a pretty fast method, although I'd be curious to hear your and live/elections's experience with that 😬 If it ends up being too slow, I can add the logic bad in to avoid re-computing the national summary estimates. Anyway, thanks and LMK what you think about this approach 🤔

@dmnapolitano dmnapolitano marked this pull request as ready for review September 5, 2024 15:56
@lennybronner
Copy link
Collaborator

This generally looks good now, thanks so much for figuring this out. I left one small question. This will necessitate making changes to the model testbed, so please make those (and then run the 2020 election through the testbed to make sure it all still works as expected). Also, the live will need to make small changes to how they interact with the national summary data, so as long as they aware aware of this, this seems fine.

Where does re-calling get_national_summary_estimates happen? I only see it happen once, where it should be.

@dmnapolitano
Copy link
Contributor Author

This generally looks good now, thanks so much for figuring this out. I left one small question. This will necessitate making changes to the model testbed, so please make those (and then run the 2020 election through the testbed to make sure it all still works as expected). Also, the live will need to make small changes to how they interact with the national summary data, so as long as they aware aware of this, this seems fine.

Where does re-calling get_national_summary_estimates happen? I only see it happen once, where it should be.

Thanks! 🎉

  1. What are the changes that need to be made to the model testbed? 🤔
  2. What are the changes that the live team will need to make? I tried to make it so they don't have to change anything, but maybe I failed somehow 🤔
  3. Sorry, I explained that poorly. If you do (1) get_estimates(), (2) get_national_summary_votes_estimates(), then (3) get_national_summary_votes_estimates(), that second call (3) will re-compute the national vote estimates and return the same results as (2). I was trying to check for this before but I took it out because it's a bit complicated and might not be worth the extra effort. So I figured I'd revisit it if we got complaints about speed lol 😄

@lennybronner
Copy link
Collaborator

This generally looks good now, thanks so much for figuring this out. I left one small question. This will necessitate making changes to the model testbed, so please make those (and then run the 2020 election through the testbed to make sure it all still works as expected). Also, the live will need to make small changes to how they interact with the national summary data, so as long as they aware aware of this, this seems fine.
Where does re-calling get_national_summary_estimates happen? I only see it happen once, where it should be.

Thanks! 🎉

  1. What are the changes that need to be made to the model testbed? 🤔
  2. What are the changes that the live team will need to make? I tried to make it so they don't have to change anything, but maybe I failed somehow 🤔
  3. Sorry, I explained that poorly. If you do (1) get_estimates(), (2) get_national_summary_votes_estimates(), then (3) get_national_summary_votes_estimates(), that second call (3) will re-compute the national vote estimates and return the same results as (2). I was trying to check for this before but I took it out because it's a bit complicated and might not be worth the extra effort. So I figured I'd revisit it if we got complaints about speed lol 😄

(1) and (2) The structure of the response of get_national_summary_votes_estimates changed, it's not a dataframe rather than a list, right?

(3) that's what we want.

@dmnapolitano
Copy link
Contributor Author

This generally looks good now, thanks so much for figuring this out. I left one small question. This will necessitate making changes to the model testbed, so please make those (and then run the 2020 election through the testbed to make sure it all still works as expected). Also, the live will need to make small changes to how they interact with the national summary data, so as long as they aware aware of this, this seems fine.
Where does re-calling get_national_summary_estimates happen? I only see it happen once, where it should be.

Thanks! 🎉

  1. What are the changes that need to be made to the model testbed? 🤔
  2. What are the changes that the live team will need to make? I tried to make it so they don't have to change anything, but maybe I failed somehow 🤔
  3. Sorry, I explained that poorly. If you do (1) get_estimates(), (2) get_national_summary_votes_estimates(), then (3) get_national_summary_votes_estimates(), that second call (3) will re-compute the national vote estimates and return the same results as (2). I was trying to check for this before but I took it out because it's a bit complicated and might not be worth the extra effort. So I figured I'd revisit it if we got complaints about speed lol 😄

(1) and (2) The structure of the response of get_national_summary_votes_estimates changed, it's not a dataframe rather than a list, right?

(3) that's what we want.

(3) Great! 🎉

(1) and (2), no, get_national_summary_votes_estimates() in ModelClient returned something like this:

{"margin" : [agg_pred, agg_lower, agg_upper]}

I made sure to preserve that. When writing to s3, I'm currently converting that to a CSV but that can easily be a JSON or whatever is best 🤔

@lennybronner
Copy link
Collaborator

amazing, thank you! And just to confirm, you've run this from the testbed (including writing to s3) and it works?

@dmnapolitano
Copy link
Contributor Author

amazing, thank you! And just to confirm, you've run this from the testbed (including writing to s3) and it works?

Thanks!! I will test that now! 🤔

@dmnapolitano
Copy link
Contributor Author

amazing, thank you! And just to confirm, you've run this from the testbed (including writing to s3) and it works?

Thanks!! I will test that now! 🤔

@lennybronner ok, so this testbed command works as we'd expect it to:

python run.py 2022-11-08_USA_G redo --office_id S_county --fixed_effects "['county_classification']" --geographic_unit_type county --pi_method bootstrap --estimands "['margin']" --features "['baseline_normalized_margin', 'education_bachelors_or_higher']" --end_timestamp "2022-11-10 22:00:29-05:00" --agg_model_preds

🎉 but for writing to s3 from the testbed, there's no way to do that via the CLI currently, although I could add the options. And if I did, it would just write through ModelClient and then overwrite every set of predictions produced. Is that ok or would writing everything we save in output/ be better? 🤔

@lennybronner
Copy link
Collaborator

Yes, this is why I meant this needs a corresponding PR in the testbed. It's ok to overwrite the predictions produced in dev in s3.

@dmnapolitano
Copy link
Contributor Author

Yes, this is why I meant this needs a corresponding PR in the testbed. It's ok to overwrite the predictions produced in dev in s3.

Oh sorry, I misunderstood! Ok, I'll work on that now 😅 👍🏻

@dmnapolitano
Copy link
Contributor Author

Yes, this is why I meant this needs a corresponding PR in the testbed. It's ok to overwrite the predictions produced in dev in s3.

Oh sorry, I misunderstood! Ok, I'll work on that now 😅 👍🏻

@lennybronner this is ready for review 🎉 washpost/elex-live-model-testbed#25

Copy link
Collaborator

@lennybronner lennybronner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, and now just to confirm, you ran this with your testbed branch?

@dmnapolitano
Copy link
Contributor Author

Great, and now just to confirm, you ran this with your testbed branch?

Yup! 🎉

@dmnapolitano dmnapolitano merged commit d766bcd into develop Sep 11, 2024
3 checks passed
@dmnapolitano dmnapolitano deleted the ELEX-3469-save-aggregate-predictions-to-s3 branch September 11, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants