Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] A procedure to rename job_groups_cancelled.id -> job_groups_cancelled.batch_id #14672

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ichengchang
Copy link

@ichengchang ichengchang commented Sep 6, 2024

The rename_job_groups_cancelled_columnsql file renames the job_groups_cancelled.id column to job_groups_cancelled.batch_id. The sql also updates all constraints that reference the original column to reflect the new column name.

I have reviewed other tables and found no foreign keys referencing the job_groups_cancelled table.

All queries that previously used job_groups_cancelled.id have been updated to reference job_groups_cancelled.batch_id accordingly.

Resolve #14646

Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ivan! Thanks so much for picking this up :) I haven't much experience on the batch system but I'll try my best to give accurate feedback. I have a few questions/observations up front:

  1. Your changes to stored procedures under batch/sql make me a little nervous.

Most of these are migrations applied in the order defined in the build step mentioned in [NOTE 1] except estimated-current.sql [NOTE 2].

I don't think changing these will have the desired effect and may make it impossible for someone to reproduce the database.
The only changes to existing sql you'll need to make are in the sql strings in python code.

  1. This needs to be written as a migration and maybe could be simplified?

I think this needs to be done as a database migration. We'll have no need for a stored procedure once complete.
You can assume current columns and constraints exist, dispense with the error checking and simplify.
Can you convert this to a sql script and add it to the end of the list of migrations in build.yaml? You'll probably want online: false too.
I fear you'll have to take inspiration from rename-job-groups-tables.sql by applying one ALTER TABLE command then drop and recreate EVERYTHING that references that name (constraints, triggers, procedures etc).
This will likely involve copy+paste and rename.
Alternatively, create, execute then drop the procedure within rename-job-groups-cancelled.

[NOTE 1] migration applied in build.yaml

The relevant build step in build.yaml can be found by searching for the entry starting with the yaml below. This controls which migrations are applied and in what order.

kind: createDatabase2
name: batch_database
databaseName: batch

[NOTE 2] estimated-current.yaml

I don't agree with why we have this. It would be nice to generate this automatically. Anyway, please keep your changes to this file as it's meant for documentation purposes only. None of it is applied and who knows how much of it works.

@ichengchang
Copy link
Author

Hi Ivan! Thanks so much for picking this up :) I haven't much experience on the batch system but I'll try my best to give accurate feedback. I have a few questions/observations up front:

  1. Your changes to stored procedures under batch/sql make me a little nervous.

Most of these are migrations applied in the order defined in the build step mentioned in [NOTE 1] except estimated-current.sql [NOTE 2].

I don't think changing these will have the desired effect and may make it impossible for someone to reproduce the database. The only changes to existing sql you'll need to make are in the sql strings in python code.

  1. This needs to be written as a migration and maybe could be simplified?

I think this needs to be done as a database migration. We'll have no need for a stored procedure once complete. You can assume current columns and constraints exist, dispense with the error checking and simplify. Can you convert this to a sql script and add it to the end of the list of migrations in build.yaml? You'll probably want online: false too. I fear you'll have to take inspiration from rename-job-groups-tables.sql by applying one ALTER TABLE command then drop and recreate EVERYTHING that references that name (constraints, triggers, procedures etc). This will likely involve copy+paste and rename. Alternatively, create, execute then drop the procedure within rename-job-groups-cancelled.

[NOTE 1] migration applied in build.yaml

The relevant build step in build.yaml can be found by searching for the entry starting with the yaml below. This controls which migrations are applied and in what order.

kind: createDatabase2
name: batch_database
databaseName: batch

[NOTE 2] estimated-current.yaml

I don't agree with why we have this. It would be nice to generate this automatically. Anyway, please keep your changes to this file as it's meant for documentation purposes only. None of it is applied and who knows how much of it works.

Got it! I wasn't sure how Hail usually does schema update. Based on your above description the process becomes clearer ro me. Here's my second try:

  • Updated build.yaml in the batch database migrations section.
  • Simplified the sql in rename-job-groups-cancelled-column.sql.

Do you mean estimated-current.sql rather than estimated-current.yaml above?

@ichengchang
Copy link
Author

@ehigham Also another question is how does the schema update enforce certain order of operations.

The rename-job-groups-cancelled-column sql should run before other sqls that depend on the modified column name in job_groups_cancelled table, correct?

batch/sql/finalize-job-groups.sql Outdated Show resolved Hide resolved
batch/sql/rename-job-groups-tables.sql Outdated Show resolved Hide resolved
@ehigham
Copy link
Member

ehigham commented Sep 9, 2024

Do you mean estimated-current.sql rather than estimated-current.yaml above?

Yes, sorry for the confusion

Also another question is how does the schema update enforce certain order of operations.

The rename-job-groups-cancelled-column sql should run before other sqls that depend on the modified column name in job_groups_cancelled table, correct?

Migrations are applied successively. You cannot edit a previous migration or the order in which they're applied as they've already been applied to the production database.
That's why I said this:

I fear you'll have to take inspiration from rename-job-groups-tables.sql by applying one ALTER TABLE command then drop and recreate EVERYTHING that references that name (constraints, triggers, procedures etc). This will likely involve copy+paste and rename.

I think you need to find any trigger or stored procedure that references that column, drop it and recreate it with the field renamed. It's a little scary.

@ichengchang
Copy link
Author

Do you mean estimated-current.sql rather than estimated-current.yaml above?

Yes, sorry for the confusion

Also another question is how does the schema update enforce certain order of operations.
The rename-job-groups-cancelled-column sql should run before other sqls that depend on the modified column name in job_groups_cancelled table, correct?

Migrations are applied successively. You cannot edit a previous migration or the order in which they're applied as they've already been applied to the production database. That's why I said this:

I fear you'll have to take inspiration from rename-job-groups-tables.sql by applying one ALTER TABLE command then drop and recreate EVERYTHING that references that name (constraints, triggers, procedures etc). This will likely involve copy+paste and rename.

I think you need to find any trigger or stored procedure that references that column, drop it and recreate it with the field renamed. It's a little scary.

@ehigham Thanks for your comments above. I’ve added the triggers and stored procedures referencing the job_groups_cancelled table in rename-job-groups-cancelled-column.sql.

I was initially confused by estimate-current.sql; I thought it was a system-generated file to track the latest batch DDLs after a schema update, rather than a file that is manually updated. After reading this thread, I completely agree with your point. In other organizations I've worked with, we maintained schema changes in a separate folder, identified by release versions (e.g., semver) and the DLLs are ordered by sequence number. This way, we had a clear history of DDLs and the order they were applied, eliminating the need for files like estimate-current.sql.

I just have one question: Do we need to manually update estimate-current.sql with the schema changes from rename-job-groups-cancelled-column.sql?

@ehigham
Copy link
Member

ehigham commented Sep 10, 2024

I just have one question: Do we need to manually update estimate-current.sql with the schema changes from rename-job-groups-cancelled-column.sql?

Yes. estimated-current.sql is an estimated current schema for documentation purposes only. Please update it to reflect the state of the database once your migration has been applied.

@ehigham
Copy link
Member

ehigham commented Sep 10, 2024

I think you missed reference in the python function delete_prev_cancelled_job_group_cancellable_resources_records

@ichengchang
Copy link
Author

I think you missed reference in the python function delete_prev_cancelled_job_group_cancellable_resources_records

I think you missed reference in the python function delete_prev_cancelled_job_group_cancellable_resources_records

Good catch! I’ve fixed it and also updated the id field referenced by the alias cancelled_t to batch_id in a few places.

@ehigham
Copy link
Member

ehigham commented Sep 13, 2024

Still seeing this error in the deploy_batch job:

utils.py	retry_long_running:923	in delete_prev_cancelled_job_group_cancellable_resources_records	
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 915, in retry_long_running
    return await f(*args, **kwargs)\n  File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 959, in loop
    await f(*args, **kwargs)\n  File "/usr/local/lib/python3.9/dist-packages/batch/driver/main.py", line 1485, in delete_prev_cancelled_job_group_cancellable_resources_records
    async for target in targets:\n  File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 334, in execute_and_fetchall
    async for row in tx.execute_and_fetchall(sql, args, query_name):\n  File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 257, in execute_and_fetchall
    await cursor.execute(sql, args)\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/cursors.py", line 239, in execute
    await self._query(query)\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/cursors.py", line 457, in _query
    await conn.query(q)\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 469, in query
    await self._read_query_result(unbuffered=unbuffered)\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 683, in _read_query_result
    await result.read()\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 1164, in read
    first_packet = await self.connection._read_packet()\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 652, in _read_packet
    packet.raise_for_error()\n  File "/usr/local/lib/python3.9/dist-packages/pymysql/protocol.py", line 219, in raise_for_error
    err.raise_mysql_exception(self._data)\n  File "/usr/local/lib/python3.9/dist-packages/pymysql/err.py", line 150, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.OperationalError: (1054, "Unknown column 'cancelled.id' in 'on clause'")

@ichengchang
Copy link
Author

ichengchang commented Sep 16, 2024

Still seeing this error in the deploy_batch job:

utils.py	retry_long_running:923	in delete_prev_cancelled_job_group_cancellable_resources_records	
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 915, in retry_long_running
    return await f(*args, **kwargs)\n  File "/usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py", line 959, in loop
    await f(*args, **kwargs)\n  File "/usr/local/lib/python3.9/dist-packages/batch/driver/main.py", line 1485, in delete_prev_cancelled_job_group_cancellable_resources_records
    async for target in targets:\n  File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 334, in execute_and_fetchall
    async for row in tx.execute_and_fetchall(sql, args, query_name):\n  File "/usr/local/lib/python3.9/dist-packages/gear/database.py", line 257, in execute_and_fetchall
    await cursor.execute(sql, args)\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/cursors.py", line 239, in execute
    await self._query(query)\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/cursors.py", line 457, in _query
    await conn.query(q)\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 469, in query
    await self._read_query_result(unbuffered=unbuffered)\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 683, in _read_query_result
    await result.read()\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 1164, in read
    first_packet = await self.connection._read_packet()\n  File "/usr/local/lib/python3.9/dist-packages/aiomysql/connection.py", line 652, in _read_packet
    packet.raise_for_error()\n  File "/usr/local/lib/python3.9/dist-packages/pymysql/protocol.py", line 219, in raise_for_error
    err.raise_mysql_exception(self._data)\n  File "/usr/local/lib/python3.9/dist-packages/pymysql/err.py", line 150, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.OperationalError: (1054, "Unknown column 'cancelled.id' in 'on clause'")

This error strikes me as odd because cancelled.id has been updated to cancelled.batch_id in delete_prev_cancelled_job_group_cancellable_resources_records:

batch/driver/main.py

Based on the error, it looks like the main.py being executed at /usr/local/lib/python3.9/dist-packages/batch/driver/main.py is still using the old version of the code, the changes from the PR were not correctly reflected in the environment. Is it possible that we might be missing a pip install step to ensure the latest code is deployed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rename job_groups_cancelled.id -> job_groups_cancelled.batch_id
2 participants