Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inner hits support to hybrid query #776

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Jun 6, 2024

Description

Adding support for inner hits to hybrid query. This is a feature of OpenSearch that is available for other queries but was not supported by hybrid query.

Inner hits will be tracked similarly to how they are tracked for all other queries. They will contain details of inner hits for cases of nested fields and parent/child relationships between documents. The only catch is the score of the inner hit—such scores will be before normalization. Having a normalized score is technically difficult because inner hits processing is done in the Fetch phase, which occurs after the normalization processor has finished its work.

Following are example of response that contains such inner hits section for nested and parent/child queries:

{
    "took": 79,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.540445,
        "hits": [
            {
                "_index": "index-test",
                "_id": "Sogqp48BjNYyAI8a4z9u",
                "_score": 1.540445,
                "_source": {
                    "doc_price": 100,
                    "doc_index": 4976,
                    "doc_location": {
                        "coordinates": [
                            [
                                -111.15,
                                45.12
                            ],
                            [
                                -109.83,
                                44.12
                            ]
                        ],
                        "type": "envelope"
                    },
                    "doc_location_2": "81.15, 44.12",
                    "doc_date": "02/03/2014",
                    "doc_point": {
                        "lon": 74.0,
                        "lat": 40.71
                    },
                    "id": "7ebe00c8-9858-11ee-b9d1-0242ac120002",
                    "doc_keyword": "workable",
                    "category": "permission",
                    "title": "Writing a list of random sentences is harder than I initially thought it would be.",
                    "user": {
                        "firstname": "john",
                        "age": 1,
                        "lastname": "black"
                    }
                },
                "inner_hits": {
                    "user": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 1.540445,
                            "hits": [
                                {
                                    "_index": "index-test",
                                    "_id": "Sogqp48BjNYyAI8a4z9u",
                                    "_nested": {
                                        "field": "user",
                                        "offset": 0
                                    },
                                    "_score": 1.540445,
                                    "_source": {
                                        "firstname": "john",
                                        "age": 1,
                                        "lastname": "black"
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

{
    "took": 134,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index-test",
                "_id": "10",
                "_score": 1.0,
                "_routing": "1",
                "_source": {
                    "my_id": "10",
                    "text": "This is an answer",
                    "my_join_field": {
                        "name": "answer",
                        "parent": "5"
                    }
                },
                "inner_hits": {
                    "question": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 1.2039728,
                            "hits": [
                                {
                                    "_index": "index-test",
                                    "_id": "5",
                                    "_score": 1.2039728,
                                    "_source": {
                                        "my_id": "5",
                                        "text": "This is a question",
                                        "my_join_field": "question"
                                    }
                                }
                            ]
                        }
                    }
                }
            },
            {
                "_index": "index-test",
                "_id": "11",
                "_score": 1.0,
                "_routing": "1",
                "_source": {
                    "my_id": "11",
                    "text": "This is second answer",
                    "my_join_field": {
                        "name": "answer",
                        "parent": "5"
                    }
                },
                "inner_hits": {
                    "question": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 1.2039728,
                            "hits": [
                                {
                                    "_index": "index-test",
                                    "_id": "5",
                                    "_score": 1.2039728,
                                    "_source": {
                                        "my_id": "5",
                                        "text": "This is a question",
                                        "my_join_field": "question"
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

Issues Resolved

#718

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski added Features Introduces a new unit of functionality that satisfies a requirement backport 2.x Label will add auto workflow to backport PR to 2.x branch labels Jun 6, 2024
@martin-gaievski martin-gaievski changed the title Add inner_hits to hybrid query Add inner hits support to hybrid query Jun 6, 2024
@martin-gaievski martin-gaievski force-pushed the poc_inner_hits_in_hybrid_query branch 3 times, most recently from b5ef151 to f4804af Compare June 7, 2024 00:52
@navneet1v
Copy link
Collaborator

The only catch is the score of the inner hit—such scores will be before normalization. Having a normalized score is technically difficult because inner hits processing is done in the Fetch phase, which occurs after the normalization processor has finished its work.

@martin-gaievski do we have a path forward for resolving this technical challenge? Have we started the discussion around this with Core team.

Also, I would like to take a step back here and question what is the meaning of normalized score for inner hits?

@navneet1v
Copy link
Collaborator

@martin-gaievski is this feature scoped for 2.15?

@martin-gaievski martin-gaievski added Enhancements Increases software capabilities beyond original client specifications and removed Features Introduces a new unit of functionality that satisfies a requirement labels Jun 7, 2024
@martin-gaievski
Copy link
Member Author

martin-gaievski commented Jun 7, 2024

@martin-gaievski is this feature scoped for 2.15?

There is no hard requirement for the version

@martin-gaievski
Copy link
Member Author

The only catch is the score of the inner hit—such scores will be before normalization. Having a normalized score is technically difficult because inner hits processing is done in the Fetch phase, which occurs after the normalization processor has finished its work.

@martin-gaievski do we have a path forward for resolving this technical challenge? Have we started the discussion around this with Core team.

Also, I would like to take a step back here and question what is the meaning of normalized score for inner hits?

For now no path clear forward, I'll be working on summarizing technical hurdles we do have. Short list is:

  • inner hits are collected in two steps, query phase and fetch phase
  • there is a separate query and query phase builder for inner hits. this inner query builder calls core TopDocsCollector to get scores.
  • hits are collected as part of the fetch phase, it's after or normalization processor run, so we cannot manipulate with inner scores and normalize them as part of existing processor
  • there is no way in core today to skip score calculation at fetch phase, so whatever we can came up with in normalization will be override in fetch phase

@martin-gaievski martin-gaievski force-pushed the poc_inner_hits_in_hybrid_query branch from e475041 to 895ae31 Compare June 8, 2024 06:34
@martin-gaievski
Copy link
Member Author

Closing for now as this needs some additional investigation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants