Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum size of job attributes (increase from 64K?) #14702

Open
jmarshall opened this issue Sep 26, 2024 · 1 comment
Open

Maximum size of job attributes (increase from 64K?) #14702

jmarshall opened this issue Sep 26, 2024 · 1 comment
Labels

Comments

@jmarshall
Copy link
Contributor

jmarshall commented Sep 26, 2024

We recently encountered a batch submission that eventually failed after numerous errors like this one — but nonetheless submitted a new batch containing zero jobs.

[…]
  File "/usr/local/lib/python3.10/site-packages/hailtop/utils/utils.py", line 792, in retry_transient_errors
    return await retry_transient_errors_with_debug_string('', 0, f, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/hailtop/utils/utils.py", line 834, in retry_transient_errors_with_debug_string
    st = ''.join(traceback.format_stack())
. The most recent error was <class 'hailtop.httpx.ClientResponseError'> 500, message='Internal Server Error', url=URL('http://batch.hail/api/v1alpha/batches/485962/updates/1/jobs/create') body='500 Internal Server Error\n\nServer got itself in trouble'. 
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/hailtop/utils/utils.py", line 809, in retry_transient_errors_with_debug_string
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/hailtop/aiocloud/common/session.py", line 117, in _request_with_valid_authn
    return await self._http_session.request(method, url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/hailtop/httpx.py", line 148, in request_and_raise_for_status
    raise ClientResponseError(
hailtop.httpx.ClientResponseError: 500, message='Internal Server Error', url=URL('http://batch.hail/api/v1alpha/batches/485962/updates/1/jobs/create') body='500 Internal Server Error\n\nServer got itself in trouble'
2024-09-25 01:54:55,288 - hailtop.utils 835 - WARNING - A transient error occured. We will automatically retry. We have thus far seen 50 transient errors (next delay: 60.0s).

The corresponding server-side error was

pymysql.err.DataError: (1406, \"Data too long for column 'value' at row 106\")

coming from the INSERT INTO job_attributes … query in insert_jobs_into_db().

We write a list of the samples being processed as a job attribute, and it turned out that for at least some of the jobs of this batch this list had grown to longer than 64K of text.

The job_attributes.value database field is of type TEXT, which limits each individual attribute to 64KiB bytes.

While writing a long list of sample ids as an attribute may or may not be a great idea 😄 it is fair to say that 64K is not a large maximum for user-supplied data here in the 21st century!

It may be worth adding a database migration to change the job_attributes.value column type (and perhaps also that of job_group_attributes.value) from TEXT to MEDIUMTEXT, which would raise the limit to 16 MiB bytes (at, it appears, a cost of 1 byte per row).

@ehigham ehigham added batch needs-triage A brand new issue that needs triaging. labels Sep 26, 2024
@cjllanwarne cjllanwarne self-assigned this Sep 30, 2024
@cjllanwarne
Copy link
Collaborator

Hi @jmarshall, the team talked about this issue in our standup today. We had some concerns about appropriateness of using this table as a long term storage area for larger metadata, and the likely developer effort and system downtime to perform the migration. So we currently don't plan on prioritizing this in the immediate future, but do let us know if you have any concerns about that - or if it ends up being impossible for you to work around this - and we might be able to reconsider (or maybe come up with alternative solutions). Thanks!

@cjllanwarne cjllanwarne removed the needs-triage A brand new issue that needs triaging. label Sep 30, 2024
@cjllanwarne cjllanwarne removed their assignment Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants