-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to search forcassini
LDD attributes in ISS datasets
#148
Comments
Confirmed that document is in Confirmed that the key is missing from Harvest date/time is 2022-06-28T23:09:48.274461Z, so my soft assumption is that this is the result of a bug or missing feature in harvest which has since been implemented. I would suggest re-harvesting that product and re-testing to confirm that the expected entries are added to the index mappings. @jordanpadams this will be pretty delicate and (computationally) expensive to fix with repairkit if it isn't a fairly isolated issue, because it requires either non-noop updates to the relevant fields or deletion/reinsertion, once the mapping entries are added. The cleanest way to do it would probably be for repairkit to
This should be idempotent and avoid any potential for data loss, and could be run from a local env to avoid blowing out the cloud-sweeper task runtime. What's the source for the mapping types? That DD you pointed me to a little while back? |
After a little searching, it looks like there may be a slightly-easier solution, apparently ES/OS documents are immutable, and therefore any meaningful (non-noop) update to a document will trigger a re-index of the entire document. Ergo, it should be sufficient to add all missing properties to the index, and then write a metadata flag value (showing that the document has been checked) for all unchecked documents, with no need to play around with temporary indices. EDIT: Yep, this is the case, tested and confirmed. |
Going well! π Details? See above β |
Implemented in resolution of missing property mapping typename TBD (@jordanpadams please weigh in on that) @jordanpadams @tloubrieu-jpl the sweeper queries twice - once to generate the set of missing mappings, then again to generate/write the doc updates once the mappings have been ensured. These two queries need to return consistent results, otherwise an old version of harvest could write new documents in the middle of a sweep which would get picked up in the second stage but not the first. In that (theoretically-possible but shockingly-unlikely) event, those documents would erroneously be marked as fixed and excluded from future sweeps, and there only way to detect them would be to manually run the sweeper with the redundant-work filter disabled. Pick an option, in increasing order of rigor:
3 is the most-correct option, but may not be compatible with our dockerized registry, so I'd prefer to go with 2, or 1 if you're absolutely sure no-one will run a pre-2023 version of harvest at just the wrong time. |
per @jordanpadams, log earliest/latest harvest timestamp for affected files, and a unique list of harvest versions. Pull these from the docs themselves. |
Per @jordanpadams,
|
status: implemented, in review |
status: testing against MCP rms in-progress. The initial run got 40min and about halfway through, then AOSS throttled. I doubt this is something we care to address if sweepers initialization is the only thing that hits whatever limits are imposed. @jordanpadams @tloubrieu-jpl Once it checks out, want me to run it against all the other nodes, and include all the sweepers (not just the reindexer)? EDIT: Sweeper is exhibiting the same result-skipping behaviour as repairkit, which I should've seen coming. I'll implement the same fix as was applied there. |
For rms, problems were detected for harvest version 3.8.1, and harvest timestamps 2022-06-28 through 2024-03-13. Logs were long due to many documents not having harvest versions and throwing warnings, so I haven't sent them through - @jordanpadams let me know if you'd like me to strip those out and send them. |
Confirmed with |
Status: running against large indices appears to overload those indices. Need to figure out a way to consistently page through the documents. Given the way it works:
|
PIT search is only available as of OpenSearch 2.4, and while AWS OpenSearch Service supports OpenSearch 2.15, AWS Serverless Collections currently uses OpenSearch 2.0
so point-in-time search is not available to us at this point and a stopgap solution must be implemented. |
Status: continuing to test this more rigorously, and ran into some issues on production indexes. working on improving the algorithm to support this. |
Status: current run against ATM terminated as cluster is having to do a bunch of redundant work (indexing is slow, resulting in duplicated updates being written, resulting in more requests which probably affect indexing performance even if no-op) Sweeper will be updated to pause until 95(?)% of the pending updates have processed and been reflected in the remaining hits count ATM will require reindexing once @sjoshi-jpl is back because reasons |
Status: flow management code is tested and operational. Completing ATM sweep, will review logs with @jordanpadams before running against other nodes. |
Status: flow management code imperfect, sometimes waits for a hits change which never comes. |
@alexdunnjpl is looking at how not to overload OpenSearch. |
Development complete, running successfully against ATM (which will need manual reindexing when @sjoshi-jpl returns), will run now against another larger node to fully-validate. |
Some bugs needed to be fixed while testing on IMG. That will be ready for merge shortly. |
Alex is running the sweepers in production. |
PSA is in-prog, should finish in the next day or two. IMG will need to be re-run after migration |
@tloubrieu-jpl PSA is complete. PR is ready for review/merge - will close that out now |
Checked for duplicates
Yes - I've already checked
π Describe the bug
When I tried to search by
cassini:ISS_Specific_Attributes.cassini:image_number
via the API, I get no results, when I should get many.π΅οΈ Expected behavior
I expected to be able to search by this field and get a result
π To Reproduce
https://pds.mcp.nasa.gov/api/search/1/products?q=(cassini:ISS_Specific_Attributes.cassini:image_number%20eq%20%221454725799%22) should return 1 result: https://pds-rings.seti.org/pds4/bundles/cassini_iss_saturn//data_raw/14547xxxxx/1454725799n.xml
Same thing in Kibana Discover, no go.
π₯ Environment Info
...
π Version of Software Used
No response
π©Ί Test Data / Additional context
No response
π¦ Related requirements
π¦ NASA-PDS/registry-api#539
βοΈ Engineering Details
I am concerned more broadly that attributes throughout the systems are randomly unsearchable because harvest is not or was not properly creating fields in the schema prior to loading them into the index. Not sure how we can scrub this, but a sweeper may be necessary to somehow scan and fix this all the time.
π Integration & Test
No response
The text was updated successfully, but these errors were encountered: