Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AL-2409] Enable local dataset usage with Indra #2573

Merged
merged 25 commits into from
Sep 12, 2023
Merged

Conversation

adolkhan
Copy link
Contributor

@adolkhan adolkhan commented Sep 1, 2023

🚀 🚀 Pull Request

Impact

  • Bug fix (non-breaking change which fixes expected existing functionality)
  • Enhancement/New feature (adds functionality without impacting existing logic)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description

In this PR I am doing 2 main changes:

  1. Enabling running indra locally if either user is logged in or have specified token during vectorstore init
  2. Removing duplicate backend calls from bugout reporter. Previously when doing VectorStore init, bugout was sending 2 backend requests, one during actual VectorStore init another when the dataset creation/loading logic was called on a lower level. Now we are not doing backend requests to fetch a username, instead we just decode it directly from token that was either specified as token, exposed to env or user has logged in using CLI. This resulted in 2s reduction during dataset creation/loading:
  3. If you're logged in use compute_engine, while if you log out automatically change to python

Some Benchmarking on dataset creation using VecotrStore:

type of dataset Before the change (s) After the change (s)
local datasets without token 0.022420883178710938 0.037683963775634766
local datasets with token 2.718925952911377 0.037683963775634766
hub datasets with indra 16.12098789215088 12.905412197113037
hub datasets with tensor_db 15.676331996917725 12.73162317276001

Things to be aware of

May be its worth testing thoroughly bugout username reporting with different configurations

Things to worry about

Additional Context

@adolkhan adolkhan marked this pull request as draft September 1, 2023 13:11
@adolkhan adolkhan changed the title first changes [AL-2409] Enable local dataset usage with Indra Sep 1, 2023
@adolkhan adolkhan marked this pull request as ready for review September 7, 2023 12:50
@codecov
Copy link

codecov bot commented Sep 7, 2023

Codecov Report

Patch coverage: 85.71% and project coverage change: -3.97% ⚠️

Comparison is base (78109ea) 83.93% compared to head (84700a6) 79.96%.
Report is 20 commits behind head on main.

❗ Current head 84700a6 differs from pull request most recent head 84d12cc. Consider uploading reports for the commit 84d12cc to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2573      +/-   ##
==========================================
- Coverage   83.93%   79.96%   -3.97%     
==========================================
  Files         224      224              
  Lines       24504    24643     +139     
==========================================
- Hits        20567    19707     -860     
- Misses       3937     4936     +999     
Flag Coverage Δ
unittests 79.96% <85.71%> (-3.97%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
deeplake/client/client.py 73.03% <33.33%> (-15.54%) ⬇️
deeplake/util/bugout_reporter.py 81.52% <33.33%> (-2.00%) ⬇️
deeplake/core/vectorstore/vector_search/utils.py 77.23% <88.13%> (-20.28%) ⬇️
deeplake/api/dataset.py 81.18% <100.00%> (-0.75%) ⬇️
deeplake/client/config.py 100.00% <100.00%> (ø)
deeplake/core/dataset/dataset.py 90.16% <100.00%> (-2.69%) ⬇️
deeplake/core/vectorstore/deeplake_vectorstore.py 89.58% <100.00%> (-7.92%) ⬇️
.../core/vectorstore/vector_search/dataset/dataset.py 95.25% <100.00%> (-2.38%) ⬇️

... and 72 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@FayazRahman FayazRahman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, will also need to add more coverage

@@ -361,6 +363,7 @@ def empty(
lock_enabled: Optional[bool] = True,
lock_timeout: Optional[int] = 0,
verbose: bool = True,
username: str = "public",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have username parameter when it is inferred from credentials?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does token counts as creds?

@@ -4684,3 +4684,6 @@ def _temp_write_access(self):

def _get_storage_repository(self) -> Optional[str]:
return getattr(self.base_storage, "repository", None)

def get_user_name(self) -> Optional[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

@adolkhan adolkhan merged commit 424c4e0 into main Sep 12, 2023
6 checks passed
@adolkhan adolkhan deleted the vectorstore_indra_local branch September 12, 2023 10:51
@sonarcloud
Copy link

sonarcloud bot commented Sep 12, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

89.1% 89.1% Coverage
0.0% 0.0% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants