-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix uploader to work without -replace flag #27
Comments
I was thinking about any potential solutions to this, but I've had no real luck so far. However, would it be possible to save the original _id in a new field for each document and base the links on that field rather than the _id field? This would allow the _id fields to be updated, but still maintain the integrity of the connections between courses/profs/sections. In theory this sounds like it would work, but I'm definitely worried that in execution something major could break down. The other solution I could think of would somehow replace the old _id fields used for the links with the new _id fields, but I feel like implementing that is very unruly and unnecessarily difficult, so I would much prefer the first option. Another potential approach of the first solution would be instead of using _id fields to link courses/profs/sections, we could potentially use multiple identification fields, the same ones we use for merging. For example, courses would be with catalog_year, course_number, and subject_prefix. I'm not too sure of the viability of this option, but I just wanted to throw it out there as an alternative to using _id fields altogether. I'd love anyone's feedback on this and I would also love to see others' solutions! |
This is what I was thinking, maybe changing the primary key to a new field calculated like |
I think letting Mongo auto-generate the |
The uploader is now functional, however it currently only works when the
-replace
flag is provided. This means that we can only ever replace all of the data in the DB rather than simply updating things that have changed. This is unwanted for a multitude of reasons, including the fact that is makes the DB far more mutable than it needs to be and performs an enormous amount of unnecessary writes.Luckily, the main cause of this is fairly simple -- when replacing old documents with a
$merge
pipeline, we cannot modify the immutable_id
field of the original document, or, in other words, we cannot change the_id
between the old and new version.There is a problem that comes with this, however, which I will outline with an overview of how the data collection process works:
_id
references)$merge
aggregateThe problem lies in the second point above -- any new "links" that have been created between newly parsed courses/profs/sections will be using new
_id
s, not the original ones. Thus, if we were to simply ignore the new_id
s when performing the$merge
, we would end up with countless invalid links.Thoughts on how to resolve this are welcome, there are multiple ways that we could implement a solution.
The text was updated successfully, but these errors were encountered: