Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_msearch support when logging features #305

Open
mooreniemi opened this issue May 9, 2020 · 5 comments
Open

_msearch support when logging features #305

mooreniemi opened this issue May 9, 2020 · 5 comments

Comments

@mooreniemi
Copy link

Howdy! When logging features it sure would be faster to get to use the msearch api, though what's documented in the examples is just search. When trying to use that, hit:

~/.local/lib/python3.7/site-packages/elasticsearch/connection/base.py in _raise_error(self, status_code, raw_data)
    176
    177         raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
--> 178             status_code, error_message, additional_info
    179         )

RequestError: RequestError(400, 'illegal_argument_exception', 'key [ext] is not supported in the metadata section')
@nathancday
Copy link
Member

That's an interesting idea. What makes you this is would be faster?

It'd be cool to test, switching out the underlying function from elasticsearch doesn't seem like it would be too much

@nathancday
Copy link
Member

To make this show this problem better, could you show what it would look like in Elasticsearch's REST commands (to separate it from the repo code)? e.g. a series of POSTs /index/feature_store/ {...}

@nathancday
Copy link
Member

More of note to myself, using m_search would cut down on the number of network connections made (1 vs n-features), so there may be some performance gain from that.

@mooreniemi
Copy link
Author

mooreniemi commented May 30, 2020

@nathancday yeah, sorry I didn't go into more explanation. Given parallel_bulk gives speedup in index/update etc. operations and equivalent is msearch for searching, which as you point out, at very least cuts down on network, I would assume from first principles better...

Given logging features depends on iterating through hopefully massive qrel files, speeding up this operation is helpful. I have done the equivalent at the python level, but my guess is msearch would still be faster.

I can try to provide a benchmark if that would help motivate this.

My rough sketch of the process is that qrels are then turned into queries and chunked up to push through as normal:

GET _msearch/template
{"index" : "twitter"}
{ "source" : "{ \"query\": { \"match\": { \"message\" : \"{{keywords}}\" } } } }", "params": { "query_type": "match", "keywords": "some message" } }
{"index" : "twitter"}
{ "source" : "{ \"query\": { \"match_{{template}}\": {} } }", "params": { "template": "all" } }

Again to be clear I'd just use the python msearch support directly on these files. :)

@nathancday
Copy link
Member

I agree the change to msearch should be minimal at the python level.

I've used parallel_bulk before and ran into new errors (like timeouts), so I just wanted to get a better idea of what we'd gain.

I do like your sketched workflow and msearch_template looks like the right tool to use. Would love to migrate this to a (in-progress) PR, I'm happy to help with be.ch marking around msearch v search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants