Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing the number of API calls on large batch translations #198

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

NorthOC
Copy link

@NorthOC NorthOC commented Feb 23, 2023

An API call is made for each item in translate_batch, even if the same item is repeated a hundred of times. This is slow. That is why my pull request aims to fix that:

  1. When a new batch item is translated, it is stored in a dictionary of already_translated items. The key for the translation in this dictionary is the previously untranslated sentence.
  2. The next item in the batch is then checked against the keys of already_translated items and, if an identical match is found, uses the existing translation.
  3. If there is no identical match, an API call is made for the translation.

Even with hundreds of items in a dictionary, checking against its keys is significantly faster than an API call.

The best part is that the already_translated dictionary will be used for the next big batch because it is initialized with the BaseTranslator class. In short, this means that only unique items will require an API call, while repeated items will be translated swiftly.

example:

translator = LibreTranslator(source='auto', target='es')
batch = ['Hello world', 'I am a python', 'Hello world', 'Hello world', 'I am a python']

translated_batch = translator.translate_batch(batch)
# API call (slow)
# API call (slow)
# Already translated (fast)
# Already translated (fast)
# Already translated (fast)
# Batch translation finished

# translator._already_translated = {
#   'Hello world' : 'Hola mundo',
#   'I am a python' : 'Soy un pitón'
# }

…ng against a dictionary of already translated identical sentences
for i, text in enumerate(batch):
translated = self.translate(text, **kwargs)
for _i, text in enumerate(batch):
if text in self._already_translated.keys:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it .keys or .keys()?
btw I think you can even remove the .keys since it checks keys by default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nidhaloff @NorthOC , Just a thought, wouldn't it be nice to use the python builtin lru_cache system here ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiaulislam Great idea, yes it would be nice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it .keys or .keys()? btw I think you can even remove the .keys since it checks keys by default

Removed redundant keys method on dict when iterating.

@nidhaloff @NorthOC , Just a thought, wouldn't it be nice to use the python builtin lru_cache system here ?

No idea how caching works :/

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiaulislam @NorthOC Alright, I think it's ok to merge this.
Maybe @jiaulislam can take a look at caching in a separate PR, if you have time of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiaulislam @NorthOC Alright, I think it's ok to merge this. Maybe @jiaulislam can take a look at caching in a separate PR, if you have time of course.

yes, certainly. I will do it after PR #202

@@ -27,6 +27,7 @@ def __init__(
payload_key: Optional[str] = None,
element_tag: Optional[str] = None,
element_query: Optional[dict] = None,
already_translated: dict[str, str] = {},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you need this declaration here?

Copy link
Author

@NorthOC NorthOC Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, I ran the deep-translator for days.

In my case, I ran into API errors due to LibreTranslator mirror server resets (to prevent bots from running 24/7).

I had already translated many batches, so instead of doing API calls again, when the server reset, I added the translated batches into already_translated and saved a shit ton of time (from 5 days of runtime for a full translation to less than 24 hours of runtime).

We're talking like hundreds of API calls per hour.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but is it really needed to pass already_translated in the constructor? wouldn't it be sufficient to hide this from the user and just define it inside the class?

Copy link
Owner

@nidhaloff nidhaloff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NorthOC unit tests are failing. I also asked a question about passing already_translated as a parameter and if this is needed

@NorthOC
Copy link
Author

NorthOC commented Mar 6, 2023

@nidhaloff Moved already_translated into the function for batch translations. I ran pytest, seems to pass all checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants