Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cached data for tests? #164

Open
tfardet opened this issue Jul 18, 2023 · 14 comments
Open

Use cached data for tests? #164

tfardet opened this issue Jul 18, 2023 · 14 comments
Assignees
Labels
question Further information is requested

Comments

@tfardet
Copy link
Collaborator

tfardet commented Jul 18, 2023

Right now tests are taking a lot of time and probably using a lot of bandwidth downloading everything.
It's also a problem when servers are down (like at the moment).

Suggestion:

  • move all non-download-related tests to cached data
  • even for tests where download is necessary, make a fallback using cached data so the test does not fail when servers are down

Pros: faster tests and lower impact on servers, do not fail tests that are irrelevant to PRs
Cons: need to check manually that relevant tests ran properly for relevant PRs

@tfardet tfardet added the question Further information is requested label Jul 18, 2023
@hadrilec
Copy link
Contributor

if you have ideas on how to do it, feel free to make a PR

@tgrandje
Copy link
Collaborator

I have some. @tfardet do you want to do it or do I assign myself?

@tgrandje
Copy link
Collaborator

tgrandje commented Aug 8, 2023

Haven't had the time to work on it yet (and leaving for 15d). @tfardet if you have the time to give it a try before I'm back, don't hesitate :-)

@tgrandje tgrandje self-assigned this Aug 25, 2023
@tgrandje
Copy link
Collaborator

tgrandje commented Aug 25, 2023

Ok, I think my solution is working (on local machine for now). I use mockups of requests.Session and use instead requests_cache.Session from requests-cache module (it might be a good solution by the way to simplify the caching of pynsee, but that's another subject).

I had some trouble due to multiprocessing being used in get_population and I had also to mock multiprocessing.Pool, but now this seems to work.

I have some questions:

  • the sqlite database generated by the caching is process going to be huge (for an object stored on github). Retrospectively, I should have guessed that 🙄... So, testing only get_population (tests\localdata\test_pynsee_localdata.py) results in a 372Mo database. I can of course use multiple database (say one per unit test), but that will mean a lot more work on artifacts' storage. @hadrilec do you know what kind of account is InseeFrLab using? The size limit is linked to it...
  • what caching timeout should we set? For now, I started with 15days which could be a good compromise for some databases (SIRENE for instance); we could also set different timeouts for each webservice (6months for COG for instance?). @hadrilec @tfardet do you have suggestions?
  • I'm just starting to use CI/CD. I might need a hand to rewrite the pipelines to store artifacts 😄 !

@tgrandje
Copy link
Collaborator

@hadrilec just one more question. For testing purposes, I'm caching results by monkeypatching .get methods; that means, this is happening after the calls to _wait_api_query_limit. I see that _request_insee is also handling 429 error code (which IMHO should be enough to handle the bottleneck.

Is there a technical reason to have both?

Do you think this would cause trouble to patch _wait_api_query_limit (and nullify it's effect) while testing? If we adjust the caching timeout correctly, we should (mostly) rely on the caching backend (except for once in 15days...).

@hadrilec
Copy link
Contributor

hadrilec commented Aug 30, 2023

@hadrilec just one more question. For testing purposes, I'm caching results by monkeypatching .get methods; that means, this is happening after the calls to _wait_api_query_limit. I see that _request_insee is also handling 429 error code (which IMHO should be enough to handle the bottleneck.

Is there a technical reason to have both?

Do you think this would cause trouble to patch _wait_api_query_limit (and nullify it's effect) while testing? If we adjust the caching timeout correctly, we should (mostly) rely on the caching backend (except for once in 15days...).

I think either you or @tfardet modified _request_insee in order to catch error 429. The main objective of _wait_api_query_limit is to compute the minimum waiting time required in order to not have an error 429, so the function tracks when the last queries were made and then sleeps during some time in order for the number of queries made in the last 60 seconds to be 30. I dont know whether if you make a new request while you already made too many in the last minute the API adds up this new request to the total or not. In such case, if _wait_api_query_limit is disabled the waiting time might by infinite because _request_insee sleeps only during 10 seconds while a longer time might be needed in order for the number of queries made in the last minute to be lower than 30.

In a nutshell, if I understood well how this works, I would say that yes indeed we could remove _wait_api_query_limit but in this case either the minimum sleeping time should be computed, and _wait_api_query_limit comes back or it should be by default 60 seconds, which might entail a longer average waiting time for the user. All this should be properly tested to make sure the rationale described above makes sense. I would say that it could be made in another PR.

EDIT: After a second thought, what is written above is not accurate. Sorry for this. As _request_insee sleeps 10 sec after a 429 error, the number of queries made in the last minute will automatically decrease below 30. Consequently, _wait_api_query_limit is not needed anymore. But the package should be tested without to check that everything is fine without this function.

@hadrilec
Copy link
Contributor

Ok, I think my solution is working (on local machine for now). I use mockups of requests.Session and use instead requests_cache.Session from requests-cache module (it might be a good solution by the way to simplify the caching of pynsee, but that's another subject).

I had some trouble due to multiprocessing being used in get_population and I had also to mock multiprocessing.Pool, but now this seems to work.

I have some questions:

  • the sqlite database generated by the caching is process going to be huge (for an object stored on github). Retrospectively, I should have guessed that 🙄... So, testing only get_population (tests\localdata\test_pynsee_localdata.py) results in a 372Mo database. I can of course use multiple database (say one per unit test), but that will mean a lot more work on artifacts' storage. @hadrilec do you know what kind of account is InseeFrLab using? The size limit is linked to it...
  • what caching timeout should we set? For now, I started with 15days which could be a good compromise for some databases (SIRENE for instance); we could also set different timeouts for each webservice (6months for COG for instance?). @hadrilec @tfardet do you have suggestions?
  • I'm just starting to use CI/CD. I might need a hand to rewrite the pipelines to store artifacts 😄 !

@linogaliana or @avouacr could you please tell us what is the kind of github account Insee has?
I understand the question from @tgrandje is how much data can we stored?

The documentation of pynsee relies on jupyter notebooks that are a bit heavy. @RLesur complained about it in #30. Consequently, @tgrandje I guess the account is quite limited in terms of storage. To be confirmed by @avouacr.
btw, @tgrandje if you have time and willingness #30 could be a very nice use case for playing with CI/CD 😄

For the caching timeout, I think 15 days is ok.

@avouacr
Copy link
Member

avouacr commented Aug 31, 2023

@hadrilec @tgrandje We use the GitHub Free plan for each of our organizations, which is obviously pretty limited by default. But for various purposes (mainly Onyxia related storage), we buy each year a data pack of Git LFS, which means we have 50GB of LFS storage and 600GB of bandwidth per 15d. But, there is no warranty that we'll continue to buy this storage long-term.

A more sustainable solution could be to use the S3/MinIO storage of the SSP Cloud to store your caching DB, and put it in public so as to let the package/CI job access it via a public URL. I think @linogaliana uses this strategy for the cartiflette package.

@tgrandje
Copy link
Collaborator

Good idea @avouacr. I'll try to fix something : with everything in cache, I've just ended with 1.7 Go of SQlites. Do you think this can be deployed as it is (until I manage to work with MinIO) ?

Can you also confirm that you can upload to the public MinIO without the (daily) tokens ? For cartiflette, I think it's either upload from inside SSP Cloud (or with tokens) or download from the public cache...

@linogaliana
Copy link
Contributor

linogaliana commented Sep 1, 2023

+1 for the S3 solution

@tgrandje It is possible to create service account that would not expire and would be provided to Github Actions secrets. We are not doing that in cartiflette yet because we are still manually producing files but as you know we will improve that later on

@avouacr
Copy link
Member

avouacr commented Sep 1, 2023

@tgrandje I think it would be preferable to work with MinIO directly as putting large files in a git project can end up being tedious to manage. Maybe you could start a PR with the current state of you work on caching and I could help setting up the communication with MinIO ?

@tfardet
Copy link
Collaborator Author

tfardet commented Sep 1, 2023

I think it would be preferable to work with MinIO directly as putting large files in a git project can end up being tedious to manage.

Yeah: @tgrandje if the "deployed as it is" in your comment meant "tracked as files in git", then I really suggest to find another way as this will be extremely impractical to manage

@tgrandje
Copy link
Collaborator

tgrandje commented Sep 4, 2023

What I meant was storing files through github actions' artifacts cache, not through git. That might have been practical enough and might have reduced data transfer, but I understand the cost is not trivial (as a matter of fact, I haven't found how to evaluate it on their calculator).

The current branch is here. I first started to split SQLites between modules as I thought the 500Mo displayed on github actions' threshold displayed in the doc was per file: it can easily be re-merged (I'll do that asap) to simplify the exchange with MinIO.

It should also be easy to add a try/except in a pytest_sessionstart/pytest_sessionfinishto interact with MinIO. @avouacr I'll set up the functions and let you know where to work.

@tgrandje
Copy link
Collaborator

tgrandje commented Sep 5, 2023

Ok, so @avouacr I've just given rights to you in my repo. You should be able to work on I/O with MinIO here. What do you think ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants