Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to search forcassini LDD attributes in ISS datasets #148

Closed
jordanpadams opened this issue Sep 16, 2024 · 23 comments Β· Fixed by #149
Closed

Unable to search forcassini LDD attributes in ISS datasets #148

jordanpadams opened this issue Sep 16, 2024 · 23 comments Β· Fixed by #149
Assignees

Comments

@jordanpadams
Copy link
Member

jordanpadams commented Sep 16, 2024

Checked for duplicates

Yes - I've already checked

πŸ› Describe the bug

When I tried to search by cassini:ISS_Specific_Attributes.cassini:image_number via the API, I get no results, when I should get many.

πŸ•΅οΈ Expected behavior

I expected to be able to search by this field and get a result

πŸ“œ To Reproduce

https://pds.mcp.nasa.gov/api/search/1/products?q=(cassini:ISS_Specific_Attributes.cassini:image_number%20eq%20%221454725799%22) should return 1 result: https://pds-rings.seti.org/pds4/bundles/cassini_iss_saturn//data_raw/14547xxxxx/1454725799n.xml

Same thing in Kibana Discover, no go.

πŸ–₯ Environment Info

  • Version of this software [e.g. vX.Y.Z]
  • Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
    ...

πŸ“š Version of Software Used

No response

🩺 Test Data / Additional context

No response

πŸ¦„ Related requirements

πŸ¦„ NASA-PDS/registry-api#539

βš™οΈ Engineering Details

I am concerned more broadly that attributes throughout the systems are randomly unsearchable because harvest is not or was not properly creating fields in the schema prior to loading them into the index. Not sure how we can scrub this, but a sweeper may be necessary to somehow scan and fix this all the time.

πŸŽ‰ Integration & Test

No response

@alexdunnjpl
Copy link
Contributor

Confirmed that document is in rms-registry and contains relevant key and value.

Confirmed that the key is missing from rms-registry _mapping (and in fact there is no mapping for any attribute referencing "cassini".

Harvest date/time is 2022-06-28T23:09:48.274461Z, so my soft assumption is that this is the result of a bug or missing feature in harvest which has since been implemented.

I would suggest re-harvesting that product and re-testing to confirm that the expected entries are added to the index mappings.

@jordanpadams this will be pretty delicate and (computationally) expensive to fix with repairkit if it isn't a fairly isolated issue, because it requires either non-noop updates to the relevant fields or deletion/reinsertion, once the mapping entries are added. The cleanest way to do it would probably be for repairkit to

  • iterate through the doc corpus and for each doc
    • update the mappings
    • flag documents requiring re-indexing using a metadata property
  • re-index flagged documents to a temporary index
  • delete flagged documents from the source index
  • reindex the temp index back to the source index
  • delete the temp index

This should be idempotent and avoid any potential for data loss, and could be run from a local env to avoid blowing out the cloud-sweeper task runtime.

What's the source for the mapping types? That DD you pointed me to a little while back?

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Sep 19, 2024

After a little searching, it looks like there may be a slightly-easier solution, apparently ES/OS documents are immutable, and therefore any meaningful (non-noop) update to a document will trigger a re-index of the entire document.

Ergo, it should be sufficient to add all missing properties to the index, and then write a metadata flag value (showing that the document has been checked) for all unchecked documents, with no need to play around with temporary indices.

EDIT: Yep, this is the case, tested and confirmed.

@nutjob4life
Copy link
Member

Going well! πŸŽ‰ Details? See above ↑

@alexdunnjpl alexdunnjpl transferred this issue from NASA-PDS/registry-api Oct 2, 2024
@alexdunnjpl
Copy link
Contributor

Implemented in index-mapping-repair with the exception of two wrinkles:

resolution of missing property mapping typename TBD (@jordanpadams please weigh in on that)

@jordanpadams @tloubrieu-jpl the sweeper queries twice - once to generate the set of missing mappings, then again to generate/write the doc updates once the mappings have been ensured. These two queries need to return consistent results, otherwise an old version of harvest could write new documents in the middle of a sweep which would get picked up in the second stage but not the first.

In that (theoretically-possible but shockingly-unlikely) event, those documents would erroneously be marked as fixed and excluded from future sweeps, and there only way to detect them would be to manually run the sweeper with the redundant-work filter disabled. Pick an option, in increasing order of rigor:

  1. The likelihood of someone running an obsolete version of harvest at exactly the wrong time is functionally zero - don't guard against it.

  2. Instead of filtering to "documents which haven't been swept before", apply an additional constraint of "harvest time is earlier than sweeper execution start".

  3. Use a point-in-time search.

3 is the most-correct option, but may not be compatible with our dockerized registry, so I'd prefer to go with 2, or 1 if you're absolutely sure no-one will run a pre-2023 version of harvest at just the wrong time.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Oct 3, 2024

Resolve missing types by cracking open the doc's blob, extracting the DD url, and reading it.

Cache downloaded DDs, cache cracked blobs, and avoid cracking for mappings which have already been resolved by the sweeper.

image

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Oct 3, 2024

per @jordanpadams, log earliest/latest harvest timestamp for affected files, and a unique list of harvest versions. Pull these from the docs themselves.

@alexdunnjpl
Copy link
Contributor

Per @jordanpadams,

I think you can use the -dd indexes in the registry for tracking down these classes/attributes.

@alexdunnjpl alexdunnjpl moved this from ToDo to βš™ Review / QA in EN Portfolio Backlog Oct 10, 2024
@alexdunnjpl alexdunnjpl moved this from Release Backlog to βš™ Review / QA in B15.1 Oct 10, 2024
@alexdunnjpl
Copy link
Contributor

status: implemented, in review
per @jordanpadams review/live-test postponed until next week, after the current site demos.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Oct 14, 2024

status: testing against MCP rms in-progress.

The initial run got 40min and about halfway through, then AOSS throttled. I doubt this is something we care to address if sweepers initialization is the only thing that hits whatever limits are imposed.

@jordanpadams @tloubrieu-jpl the query referred to in the OP now successfully hits on a single document. 1830h EDIT: Well, it did... it isn't appearing to now. I'll need to investigate this further. EDIT 2: aaand it's working again. Probably just been some weird reindexing stuff going on.

Once it checks out, want me to run it against all the other nodes, and include all the sweepers (not just the reindexer)?

EDIT: Sweeper is exhibiting the same result-skipping behaviour as repairkit, which I should've seen coming. I'll implement the same fix as was applied there.

@alexdunnjpl
Copy link
Contributor

For rms, problems were detected for harvest version 3.8.1, and harvest timestamps 2022-06-28 through 2024-03-13.

Logs were long due to many documents not having harvest versions and throwing warnings, so I haven't sent them through - @jordanpadams let me know if you'd like me to strip those out and send them.

@alexdunnjpl
Copy link
Contributor

Confirmed with en that a single run is sufficient to reindex all documents. Currently running manually against all nodes, storing logs for later analysis

@alexdunnjpl
Copy link
Contributor

Status: running against large indices appears to overload those indices. Need to figure out a way to consistently page through the documents.

Given the way it works:

  • the workload can be chunked without issue if the first/second query can be guaranteed to return the same result-set (this means PIT, most-likely, or sorting by harvest timestamp if that's infeasible), since any product which has yet to be processed will eventually be updated/reindexed, and any product which is updated/reindexed is guaranteed to have had its appropriate mappings created already.

  • if that is difficult or impossible, a naive approach which pages blindly could work iff the update generation step also checks that the mapping is present, not yielding an update if a missing mapping exists at update-creation-time

@alexdunnjpl
Copy link
Contributor

PIT search is only available as of OpenSearch 2.4, and while AWS OpenSearch Service supports OpenSearch 2.15, AWS Serverless Collections currently uses OpenSearch 2.0

Serverless collections currently run OpenSearch version 2.0.x. As new versions are released, OpenSearch Serverless will automatically upgrade your collections to consume new features, bug fixes, and performance improvements.

so point-in-time search is not available to us at this point and a stopgap solution must be implemented.

@jordanpadams
Copy link
Member Author

Status: continuing to test this more rigorously, and ran into some issues on production indexes. working on improving the algorithm to support this.

@alexdunnjpl
Copy link
Contributor

Status: current run against ATM terminated as cluster is having to do a bunch of redundant work (indexing is slow, resulting in duplicated updates being written, resulting in more requests which probably affect indexing performance even if no-op)

Sweeper will be updated to pause until 95(?)% of the pending updates have processed and been reflected in the remaining hits count

ATM will require reindexing once @sjoshi-jpl is back because reasons

@alexdunnjpl
Copy link
Contributor

Status: flow management code is tested and operational. Completing ATM sweep, will review logs with @jordanpadams before running against other nodes.

@alexdunnjpl
Copy link
Contributor

Status: flow management code imperfect, sometimes waits for a hits change which never comes.
Next step: implement a check which stops stalling if the same hits-count is returned n times in a row. Stall functionality will need to be extracted to its own class at this point - it's getting sufficiently complicated.

@tloubrieu-jpl
Copy link
Member

@alexdunnjpl is looking at how not to overload OpenSearch.

@alexdunnjpl
Copy link
Contributor

Development complete, running successfully against ATM (which will need manual reindexing when @sjoshi-jpl returns), will run now against another larger node to fully-validate.

@tloubrieu-jpl
Copy link
Member

Some bugs needed to be fixed while testing on IMG. That will be ready for merge shortly.

@tloubrieu-jpl
Copy link
Member

Alex is running the sweepers in production.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Nov 13, 2024

PSA is in-prog, should finish in the next day or two.

IMG will need to be re-run after migration

@tloubrieu-jpl tloubrieu-jpl removed their assignment Nov 19, 2024
@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Nov 26, 2024

@tloubrieu-jpl PSA is complete. PR is ready for review/merge - will close that out now

@github-project-automation github-project-automation bot moved this from βš™ Review / QA to 🏁 Done in B15.1 Nov 26, 2024
@github-project-automation github-project-automation bot moved this from βš™ Review / QA to 🏁 Done in EN Portfolio Backlog Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏁 Done
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

5 participants