Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use natural keys for foreign key relationships during serialization/deserialization #33999

Merged
merged 8 commits into from
Jan 26, 2024

Conversation

gherceg
Copy link
Contributor

@gherceg gherceg commented Jan 21, 2024

Product Description

Technical Summary

I added what is hopefully a thorough description of this change in the comment for the dump method, but I'll rephrase that here and provide some additional proof for why this is necessary.

When loading domain data into a fresh environment from a file generated by the existing dump_domain_data command, we hit a ForeignKey violation when loading SQLUserData, which foreign keys to the Django auth.User object via the primary key.

The problem is that when we serialize our dumped data, we set use_natural_primary_keys=True to use natural keys for our sharded objects (the only place we defined natural_key methods in our code). However the auth.User defines a natural_key method, which ensures that the primary key is not included when serializing this object (docs here). Since SQLUserData foreign keys to this object, and we have use_natural_foreign_keys=False in our serializer, the foreign key relationship is serialized using the auth.User's primary key, instead of its natural key.

Setting use_natural_foreign_keys=True ensures that the serialized SQLUserData will reference auth.User's natural key. Also have glossed over it, but auth.User also defines a get_by_natural_key method which is used when deserializing data, enabling SQLUserData to foreign key to the correct object at that point.

Here's an example of what a serialized SQLUserData object looks like before and after this change, looking most closely at the value for the "django_user" field:

Before

{"model": "users.sqluserdata", "pk": 1, "fields": {"domain": "gherceg", "user_id": "32ea2287870244ca8c448b1664be4f41", "django_user": 2, "modified_on": "2023-12-12T03:09:48.924945Z", "profile": null, "data": {}}}

After

{"model": "users.sqluserdata", "pk": 1, "fields": {"domain": "gherceg", "user_id": "32ea2287870244ca8c448b1664be4f41", "django_user": ["[email protected]"], "modified_on": "2023-12-12T03:09:48.924945Z", "profile": null, "data": {}}}

The reason this came up recently is because SQLUserData is the first object we include in dumped data that foreign keys to auth.User. All other objects that foreign key to auth.User are excluded from the dump.

Feature Flag

Safety Assurance

Safety story

This code is run manually by developers and does not have any direct user facing impact.

I confirmed the following:

  • this does not impact the output of sharded objects that foreign key using form and case ids, since those foreign keys do not rely on the default model's primary key.
  • verified that natural_key is only used on sharded objects in our code
  • verified that get_by_natural_key is not used anywhere in our codej

The worst case is that this change corrupts the load_domain_data command, which is already broken because of this issue so 🤷🏻.

Automated test coverage

This area doesn't have the greatest test coverage. It is a bit tricky to ensure that all models are dumped and loaded properly, but I intend to spend time improving the test suite here. When debugging this specific issue, I used some simple unit tests, but those didn't feel worth committing as they really just verify django serialization behavior as it is documented.

QA Plan

N/A

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@gherceg gherceg changed the title Set use_natural_foreign_keys to True Set use_natural_foreign_keys to True when dumping domain data Jan 22, 2024
Copy link
Member

@dannyroberts dannyroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any equivalent change that needs to be made to the load code, to make it know to interpret these as the "natural" key?

(My two comments are not blocking—I agree that this won't break anything that currently works)

@gherceg
Copy link
Contributor Author

gherceg commented Jan 22, 2024

Danny asked a couple of important questions that have led to some more discoveries.

Is there any equivalent change that needs to be made to the load code, to make it know to interpret these as the "natural" key?

Through testing this and the behavior I observed, my understanding was that this change would only result in a different output for the SQLUserData model, and not for Form and Case related models. My unjustified rationale was that the foreign key fields for Form and Case related models didn't reference the primary key directly, and that this somehow impacted how they were serialized, whereas the SQLUserData foreign key field to auth.User depended on the primary key.

However this is not the case. I think this behavior is due to how our natural_key methods for form and case related models are defined. As I mentioned in the thread to Danny's other question, the django docs explicitly mention:

[The natural_key] method should always return a natural key tuple – in this example, (first name, last name).

So the natural_key method is being used for foreign key fields to CommCareCase and XFormInstance based on the Django serialization code, but since CommCareCase and XFormInstance object managers do not define a get_by_natural_key method and the natural_key method does not return a tuple, Django's deserialization code does not identify this field's value as a natural key.

All that to say, I don't foresee any immediate issues with this change, but we are not aligned with Django's recommended approach to using natural keys which could lead to unknown issues in future Django upgrades. Perhaps it is worth aligning with Django's recommendation just to be safe.

@gherceg
Copy link
Contributor Author

gherceg commented Jan 22, 2024

Actually, it looks like we do return a tuple for LedgerTransaction, which means as it stands right now, data generated with this change would not be successfully imported. I'm leaning towards doing what I said above, which will entail returning tuples for all natural_key methods and defining get_by_natural_key methods for each model that uses natural keys.

This updated any natural_key methods to ensure tuples were returned
which ensures the serialized keys will be usable in get_by_natural_key.
@dimagimon dimagimon added the Risk: High Change affects files that have been flagged as high risk. label Jan 22, 2024
@gherceg gherceg changed the title Set use_natural_foreign_keys to True when dumping domain data Use natural keys for foreign key relationships during serialization/deserialization Jan 22, 2024
This goes against Django's recommendation and has the potential to
change behavior in future Django upgrades. However it is worth doing to
both accommodate our custom SQL load code, and to avoid needing to do DB
lookups for cases and forms for every object that foreign keys to
either.
@gherceg
Copy link
Contributor Author

gherceg commented Jan 24, 2024

More updates! I ended up reverting back to returning the case_id and form_id for CommCareCase and XFormInstance instead of a tuple in 0d29b4e. After looking closer at Django's deserializer code for foreign keys with Daniel, the implications of returning a tuple as the natural key for CommCareCase and XFormInstance became more clear. Given that code, we would be incurring an additional DB lookup for a case or form when loading in large tables like CaseTransaction. There doesn't seem to be any reason we need to return tuples at the current time, but as I mentioned in comments in that change, Django upgrades could potentially break this in the future.

So the remaining piece was to ensure foreign key fields for both iterable natural keys and non-iterable natural keys are loaded correctly. I added tests in 65cdf6c testing both of those paths which confirm data is loaded as expected. I figured I could use a full "end to end" test to ensure SQLUserData is read from our DB, serialized, deserialized, and loaded as expected so added that in ce3cb54.

Includes some general cleanup as well
corehq/form_processor/models/cases.py Outdated Show resolved Hide resolved
corehq/form_processor/models/cases.py Outdated Show resolved Hide resolved
@gherceg gherceg merged commit d3468c9 into master Jan 26, 2024
13 checks passed
@gherceg gherceg deleted the gh/data-dump/use-natural-foreign-keys branch January 26, 2024 13:56
@gherceg gherceg mentioned this pull request Feb 1, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Risk: High Change affects files that have been flagged as high risk.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants