Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration for credentials, proxies, and progress #162

Closed
wants to merge 63 commits into from

Conversation

tfardet
Copy link
Collaborator

@tfardet tfardet commented Jul 18, 2023

Use configuration dictionary:

  • credentials and proxies are taken from the config dict
  • priority order is environment, then config for both
  • hide_progress option to disable progress bars from tqdm

Question 1: where should I document that?

Removed all writing to environment except for the token.
Question 2: should I also move it to the config dict to avoid writing to env altogether?

Cleaned up some bare excepts and imports.

Fixes #152

tfardet added 2 commits July 18, 2023 13:13
- credentials are stored in the dict
- proxies are also taken from there

Removed all writing to environment except for the token.
Cleaned up bare excepts and imports.
@tfardet tfardet added the enhancement New feature or request label Jul 18, 2023
@tfardet tfardet changed the title Configuration dictionary for credentials, proxies, and progress Configuration for credentials, proxies, and progress Jul 18, 2023
@tgrandje
Copy link
Collaborator

About Q1: I'd say either here or there?

Maybe we should also document the fact that tokens are stored to a CSV on the current machine if not stored through environment variables (I seem to remember I moved to environment variables for this very reason).

About Q2: I like the idea about the _config dict! I'm totally ok with dropping the env storage myself, bearing in mind that I'm not in favour of storing uncrypted keys on the disk. Maybe we should add that behaviour in the config dict also.
There are other variables stored in the env also (like pynsee_file_list); I didn't check all of those (I just encountered one while checking your code). That would make sense to remove those also, doesn't it? @hadrilec what do you think?

@InseeFrLab InseeFrLab deleted a comment from codecov-commenter Jul 19, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jul 19, 2023

Codecov Report

Attention: 217 lines in your changes are missing coverage. Please review.

Comparison is base (6dec8af) 68.38% compared to head (977b67d) 67.43%.

Files Patch % Lines
pynsee/localdata/get_geo_list.py 14.28% 84 Missing ⚠️
pynsee/localdata/_get_geo_list_simple.py 27.77% 26 Missing ⚠️
pynsee/utils/config.py 80.55% 21 Missing ⚠️
pynsee/sirene/_make_dataframe.py 29.41% 12 Missing ⚠️
pynsee/localdata/get_included_area.py 35.29% 11 Missing ⚠️
pynsee/localdata/get_local_data.py 41.17% 10 Missing ⚠️
pynsee/metadata/get_legal_entity.py 79.16% 10 Missing ⚠️
pynsee/utils/_request_insee.py 67.85% 9 Missing ⚠️
pynsee/metadata/get_definition.py 90.16% 6 Missing ⚠️
pynsee/localdata/_find_latest_local_dataset.py 58.33% 5 Missing ⚠️
... and 10 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #162      +/-   ##
==========================================
- Coverage   68.38%   67.43%   -0.95%     
==========================================
  Files         117      115       -2     
  Lines        3957     3894      -63     
==========================================
- Hits         2706     2626      -80     
- Misses       1251     1268      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

"!!! Token is missing, please check insee_key and insee_secret are correct !!!"
)
else:
logger.info("Token has been created")
Copy link
Collaborator

@tgrandje tgrandje Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we keep a trace of this in log (I've a project which was repeating those calls and this was helpful)? Would be better inside _get_token_from_insee instead, though.

@tgrandje
Copy link
Collaborator

tgrandje commented Jul 25, 2023

Ok, so the extensive values stored in environ by pynsee are:

  • pynsee_file_list: this is only used in tests, should we keep it or store it in _config?
  • pynsee_use_sdmx: I'm not sure but this one seems to be a "real" argument stored in environ for conveniance's sake?
  • insee_date_test: this is only used in tests; should not be stored in _config (but might be linked to the refactorization of tests)
  • pynsee_idbank_file: I'm not sure this one is used outside tests?
  • pynsee_idbank_loop_url (and PYNSEE_IDBANK_LOOP_URL): this is only used in one function (_dwn_idbank_files) and defaults to True. Is this used outside of tests (also not documented, I think) ?
  • pynsee_print_url : this one is related to a missed print in _request_insee and this is it. Should be set to logging info/debug and dropped altogether (this is also used in a notebook which I'm not convinced is where it should be).

EDIT: just took care of the last one

@hadrilec
Copy link
Contributor

Thanks a lot for the comprehensive list of env variables. They are used mainly as workarounds, I do agree it should be cleaned up, so thanks for raising this topic. Below, my take on the topic for each case:

  • pynsee_file_list: it can be included in a config dictionary
  • pynsee_use_sdmx: it can be included in a config dictionary
  • insee_date_test: this is used only for testing purposes, if you know a better way to perform the test related to this variable, let me know, otherwise I would prefer to keep it as it is
  • pynsee_idbank_file: it can be included in a config dictionary
  • pynsee_idbank_loop_url: this is not really useful, the code snippet using it should be removed and the variable not used anymore
  • pynsee_print_url: this is used mainly for development to check that the queries created are in line with expectations, it can be also useful for some users, I think it can be included in a config dictionary

@tfardet
Copy link
Collaborator Author

tfardet commented Jul 28, 2023

I'm totally ok with dropping the env storage myself, bearing in mind that I'm not in favour of storing uncrypted keys on the disk. Maybe we should add that behaviour in the config dict also.

Sure, we could set a variable for that, but in that case it would probably make sense to retain the config over multiple runs and that would require a config file. Though it's probably a good idea, maybe this could be done in a later PR.

In the meantime, we could probably write the credentials in a file with restricted access using 0o600 with an opener.

@hadrilec : noted, I'll do that during the 2nd week of August, when I'm back from holiday

@tgrandje
Copy link
Collaborator

@hadrilec I'll try to think on something with tests to reduce the need for config in a separate branch.

About pynsee_print_url, that's precisely why we introduced the logging feature in the first place : in the last commit I did, you just have to switch the logging level to get the "print"s. If you really want to I'll revert the commit, but I don't think this is needed.

@hadrilec
Copy link
Contributor

@tgrandje ok, I am fine for having the queries printed in the log but it should not go under INFO but under DEBUG, otherwise the user will be overwhelmed. Consequently, pynsee_print_url should be removed.

@tfardet
Copy link
Collaborator Author

tfardet commented Aug 8, 2023

I'm wondering about the _get_envir_token function: is it really necessary to have this test of the token (this is going to be repeated a lot)?
Can't we just work based on the error code from the requests if they fail?

@tgrandje
Copy link
Collaborator

tgrandje commented Aug 8, 2023

I'm not sure this is really slowing things (when not working from cache), though I'm saying this on a hunch and haven't time to check. In my experience, the major impact to speed is the API's rate (30requests/minute if I'm not mistaken) and there is not much to do about it...

But maybe we could store a session through the dict config altogether ?

@tfardet
Copy link
Collaborator Author

tfardet commented Aug 10, 2023

Storing a session in the dict is definitely a great idea, will do!

Regarding the token, besides the time it might take, it's also more of a "why is this needed?" question: the API is going to reply with a 401 unauthorized code if we provide incorrect headers (token) so I don't see the point of checking the token.

@tfardet tfardet marked this pull request as draft August 17, 2023 10:06
@tfardet
Copy link
Collaborator Author

tfardet commented Aug 20, 2023

@hadrilec just to clarify something: I still plan to allow some configs to be provided via environment variables but in the list that you provided, I'm not sure whether I should support all of them or not.

Could you tell me if you want to keep the possibility for people to provide these via env variables?

  • pynsee_file_list
  • pynsee_use_sdmx
  • pynsee_idbank_file
  • pynsee_print_url (this one I would propose to keep only as config var for sure)

@tfardet
Copy link
Collaborator Author

tfardet commented Aug 20, 2023

@hadrilec can you also confirm that I should remove these lines (as you said pynsee_idbank_loop_url should be removed) or whether I need to keep the while loop (i.e. should the removed variable default to True or False).

@hadrilec
Copy link
Contributor

@tfardet, indeed the while loop should be kept but the usage of the variable pynsee_idbank_loop_url should be removed

@hadrilec
Copy link
Contributor

@hadrilec just to clarify something: I still plan to allow some configs to be provided via environment variables but in the list that you provided, I'm not sure whether I should support all of them or not.

Could you tell me if you want to keep the possibility for people to provide these via env variables?

  • pynsee_file_list
  • pynsee_use_sdmx
  • pynsee_idbank_file
  • pynsee_print_url (this one I would propose to keep only as config var for sure)

I would say that the user should be able to modify the value of pynsee_idbank_file through an env variable, indeed the download of this file could fail in case insee decides to change the url and then this feature would be a backup at the python session initialization. Same for this variable pynsee_file_list

@tgrandje
Copy link
Collaborator

@tfardet I've started building the test refactorization from your branch. It was easier, given the fact that you simplified the configuration storage. It also makes sense, as you said you'll try to store the session object and I'm basically patching session with a CachedSession object.

Are you ok with me requesting merge directly in you branch ? (In which case, it should speedup the tests on your side, that's precisely how I found the error I mentionned today).

I'll wait for advice regarding the storage volume here beforehand of course.

@tfardet
Copy link
Collaborator Author

tfardet commented Aug 31, 2023

@tgrandje sorry, I did not have time to work on this lately, other stuff came up, probably won't have time to touch it before next week...

I'm not sure what you mean by "requesting merge": do you mean that you would send specific patches for the config or do you want to merge your whole CI caching PR?

If it's the former, it's completely fine; if it's the later, unless you manage to make your PR really small, I'm not sure it's a good idea.
In the 2nd case it's probably better to make it dependent on this PR (also as a draft) and we'll just merge them in order, no?

hadrilec and others added 3 commits September 2, 2023 23:35
- credentials are stored in the dict
- proxies are also taken from there

Removed all writing to environment except for the token.
Cleaned up bare excepts and imports.
@hadrilec
Copy link
Contributor

yes, I agree, I should have made the changes elsewhere, let's see if I time later on to decouple the changes

I deleted the function _wait_api_query_limit, because we drew the conclusion that is was not needed anymore, and after having performed several tests it is indeed the case.

regarding, the issue on "metadonnees" I understand that the connection is lost at some point, I an going to add a by default user agent to the header of the requests to see if it solves the problem as it is the advice described here:
https://stackoverflow.com/questions/61827856/python-connectionerror-httpsconnectionpoolhost-api-foursquare-com-port-443

pynsee/utils/_request_insee.py Outdated Show resolved Hide resolved
pynsee/utils/config.py Outdated Show resolved Hide resolved
pynsee/utils/config.py Outdated Show resolved Hide resolved
@hadrilec
Copy link
Contributor

hadrilec commented Nov 19, 2023

Hi @tfardet,

I executed the code below and whenever pynsee is imported queries are triggered to check whether the credentials are working well. I have the feeling that this snippet is also used by many other functions, hence the number of sessions and requests created might increase a lot without fulfilling a user need. I believe this explains why the tests are failing.

Could you please make sure the function _register_token is used only by init_conn and is not used by any other function?

import logging
import sys
logging.basicConfig(stream=sys.stdout,
                    level=logging.DEBUG,
                    format="%(message)s")

from pynsee import *

@tfardet
Copy link
Collaborator Author

tfardet commented Nov 19, 2023

@hadrilec ah, sorry, maybe I missed something, will check it out!

EDIT: could you post what you see? I don't get anything when using a registered token, only when using env variables, and only the necessary part:

Starting new HTTPS connection (1): api.insee.fr:443
https://api.insee.fr:443 "POST /token HTTP/1.1" 200 None
Obtained token: ...
Starting new HTTPS connection (1): api.insee.fr:443
https://api.insee.fr:443 "GET /series/BDM/V1/data/CLIMAT-AFFAIRES HTTP/1.1" 200 None
https://api.insee.fr:443 "GET /series/BDM/V1/dataflow/FR1/all HTTP/1.1" 200 None
https://api.insee.fr:443 "GET /metadonnees/V1/codes/cj/n3/5599 HTTP/1.1" 200 None
https://api.insee.fr:443 "GET /entreprises/sirene/V3/siret?q=activitePrincipaleUniteLegale:86.10*&nombre=1000 HTTP/1.1" 200 None
https://api.insee.fr:443 "GET /donnees-locales/V0.1/donnees/geo-SEXE-DIPL_19@GEO2020RP2017/FE-1.all.all HTTP/1.1" 200 None
Existing environment variables used, instead of locally saved credentials

@tfardet
Copy link
Collaborator Author

tfardet commented Nov 19, 2023

@hadrilec it runs just fine locally, and the code seems OK... can you post what seems wrong on your side?

@hadrilec
Copy link
Contributor

well, I have the same but I think this should not be the targeted behaviour. From a user point of view, I think it is better to start testing the connection with the api through the init_conn function and not with the import package. In addition, I suspect that as many functions seem to rely on set_config, the function testing the connection to all apis might be triggered all over again and again, is that right or is there something I misunderstand?

@tfardet
Copy link
Collaborator Author

tfardet commented Nov 19, 2023

  1. Regarding the current behavior:
    • I think that the whole point of providing environment variables is to not have to call init_conn so we do have to check and initialize the token from them on import
    • if the env variables are not set and you ran init_conn in the past, you can check that no requests are made
  2. Regarding get/set_config:
    • you can check that get_config never makes requests
    • set_config only calls _register_token if a valid-looking "token" entry or a "key/secret" pair is passed

@tfardet
Copy link
Collaborator Author

tfardet commented Nov 19, 2023

That being said, tests are now taking 4h to run, so there's definitely something wrong, I just don't see what it is, I'll investigate

EDIT: I haven't been able to find out what went wrong yet but the last run in less than 1h is 49bd3a9 while 2b9b436 is above 4h. @hadrilec you changed a lot of files between these two commits, any idea if you could have changed something relevant (the fact that the entire files appear changed makes it very hard to parse... I might actually have to restart a new PR from master to workaround that issue)

Last 1h run: https://github.com/InseeFrLab/pynsee/actions/runs/5655467051/usage (49bd3a9)
New 4h run: https://github.com/InseeFrLab/pynsee/actions/runs/6888365365/usage (2b9b436)

@tfardet
Copy link
Collaborator Author

tfardet commented Nov 19, 2023

@hadrilec honestly, this PR has become unmanageable IMO ^^"
Could you please redo your fix for the tests and move to 3.8/3.11 in another PR?

Once we merge that (hopefully the tests will be back to less than 1h), I'll redo a PR from that one.
Otherwise it's going to take forever to debug this.

@hadrilec
Copy link
Contributor

Thanks a lot @tfardet for the second PR!
Indeed, you are right, I guess it is the backup solution in case the multiprocessing fails that takes forever to run.

@hadrilec
Copy link
Contributor

hadrilec commented Mar 2, 2024

@tfardet @tgrandje, guys let me know if I can do anything to move forward

@tfardet
Copy link
Collaborator Author

tfardet commented Mar 2, 2024

@hadrilec given the amount of changes to the code since I started, I need to start this again from scratch.
Unfortunately, I have too many things on my plate at the moment, so I won't be able to work on this before April at best, and more probably May.

@tfardet tfardet closed this Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve handling of progress bars
4 participants