`_msearch` support when logging features #305

mooreniemi · 2020-05-09T20:45:20Z

Howdy! When logging features it sure would be faster to get to use the msearch api, though what's documented in the examples is just search. When trying to use that, hit:

~/.local/lib/python3.7/site-packages/elasticsearch/connection/base.py in _raise_error(self, status_code, raw_data)
    176
    177         raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
--> 178             status_code, error_message, additional_info
    179         )

RequestError: RequestError(400, 'illegal_argument_exception', 'key [ext] is not supported in the metadata section')

The text was updated successfully, but these errors were encountered:

nathancday · 2020-05-15T13:39:21Z

That's an interesting idea. What makes you this is would be faster?

It'd be cool to test, switching out the underlying function from elasticsearch doesn't seem like it would be too much

nathancday · 2020-05-21T18:13:23Z

To make this show this problem better, could you show what it would look like in Elasticsearch's REST commands (to separate it from the repo code)? e.g. a series of POSTs /index/feature_store/ {...}

nathancday · 2020-05-29T14:07:17Z

More of note to myself, using m_search would cut down on the number of network connections made (1 vs n-features), so there may be some performance gain from that.

mooreniemi · 2020-05-30T17:47:04Z

@nathancday yeah, sorry I didn't go into more explanation. Given parallel_bulk gives speedup in index/update etc. operations and equivalent is msearch for searching, which as you point out, at very least cuts down on network, I would assume from first principles better...

Given logging features depends on iterating through hopefully massive qrel files, speeding up this operation is helpful. I have done the equivalent at the python level, but my guess is msearch would still be faster.

I can try to provide a benchmark if that would help motivate this.

My rough sketch of the process is that qrels are then turned into queries and chunked up to push through as normal:

GET _msearch/template
{"index" : "twitter"}
{ "source" : "{ \"query\": { \"match\": { \"message\" : \"{{keywords}}\" } } } }", "params": { "query_type": "match", "keywords": "some message" } }
{"index" : "twitter"}
{ "source" : "{ \"query\": { \"match_{{template}}\": {} } }", "params": { "template": "all" } }

Again to be clear I'd just use the python msearch support directly on these files. :)

nathancday · 2020-06-01T12:24:34Z

I agree the change to msearch should be minimal at the python level.

I've used parallel_bulk before and ran into new errors (like timeouts), so I just wanted to get a better idea of what we'd gain.

I do like your sketched workflow and msearch_template looks like the right tool to use. Would love to migrate this to a (in-progress) PR, I'm happy to help with be.ch marking around msearch v search.

binarymax added the help wanted label Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`_msearch` support when logging features #305

`_msearch` support when logging features #305

mooreniemi commented May 9, 2020

nathancday commented May 15, 2020

nathancday commented May 21, 2020

nathancday commented May 29, 2020

mooreniemi commented May 30, 2020 •

edited

Loading

nathancday commented Jun 1, 2020

_msearch support when logging features #305

_msearch support when logging features #305

Comments

mooreniemi commented May 9, 2020

nathancday commented May 15, 2020

nathancday commented May 21, 2020

nathancday commented May 29, 2020

mooreniemi commented May 30, 2020 • edited Loading

nathancday commented Jun 1, 2020

`_msearch` support when logging features #305

`_msearch` support when logging features #305

mooreniemi commented May 30, 2020 •

edited

Loading