Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.11] Adds ELSER V2 PS and PA benchmarks (backport #2579) #2580

Merged
merged 1 commit into from
Oct 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file removed docs/en/stack/ml/nlp/images/ml-nlp-elser-v1-v2.png
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
114 changes: 58 additions & 56 deletions docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -386,6 +386,30 @@ hardwares and compares the model performance to {es} BM25 and other strong
baselines such as Splade or OpenAI.


[discrete]
[[version-overview]]
=== Version overview

ELSER V2 has a **platform specific** version that is designed to run only on
Linux with an x86-64 CPU architecture and a **platform agnostic or portable**
version that can be run on any platform.


[discrete]
==== ELSER V2

Besides the performance improvements, the biggest change in ELSER V2 is the
introduction of the first platform specific ELSER model - that is, a model that
can run only on Linux with an x86-64 CPU architecture. The platform specific
model is designed to work best on newer Intel CPUs, but it works on AMD CPUs as
well. It is recommended to use the new platform specific Linux-x86-64 model for
all new users of ELSER as it is significantly faster than the platform agnostic
(portable) model which can be run on any platform. ELSER V2 produces
significantly higher quality embeddings than ELSER V1. Regardless of which ELSER
V2 model you use (platform specific or platform agnostic (portable)), the
particular embeddings produced are the same.


[discrete]
[[elser-qualitative-benchmarks]]
=== Qualitative benchmarks
Expand All @@ -395,30 +419,11 @@ Discounted Cumulative Gain (NDCG) which can handle multiple relevant documents
and fine-grained document ratings. The metric is applied to a fixed-sized list
of retrieved documents which, in this case, is the top 10 documents (NDCG@10).

The table below shows the performance of ELSER v2 compared to ELSER v1. ELSER v2
has 10 wins, 1 draw, 1 loss and an average improvement in NDCG@10 of 2.5%.

image::images/ml-nlp-elser-v1-v2.png[alt="ELSER v2 benchmarks compared to ELSER v1",align="center"]
_NDCG@10 for BEIR data sets for ELSER v2 and ELSER v1 - higher values are better)_

The next table shows the performance of ELSER v1 compared to {es} BM25 with an
English analyzer broken down by the 12 data sets used for the evaluation. ELSER
v1 has 10 wins, 1 draw, 1 loss and an average improvement in NDCG@10 of 17%.

image::images/ml-nlp-elser-ndcg10-beir.png[alt="ELSER v1 benchmarks",align="center"]
_NDCG@10 for BEIR data sets for BM25 and ELSER v1 - higher values are better)_
The table below shows the performance of ELSER V2 compared to BM 25. ELSER V2
has 10 wins, 1 draw, 1 loss and an average improvement in NDCG@10 of 18%.

The following table compares the average performance of ELSER v1 to some other
strong baselines. The OpenAI results are separated out because they use a
different subset of the BEIR suite.

image::images/ml-nlp-elser-average-ndcg.png[alt="ELSER v1 average performance compared to other baselines",align="center"]
_Average NDCG@10 for BEIR data sets vs. various high quality baselines (higher_
_is better). OpenAI chose a different subset, ELSER v1 results on this set_
_reported separately._

To read more about the evaluation details, refer to
https://www.elastic.co/blog/may-2023-launch-information-retrieval-elasticsearch-ai-model[this blog post].
image::images/ml-nlp-bm25-elser-v2.png[alt="ELSER V2 benchmarks compared to BM25",align="center"]
_NDCG@10 for BEIR data sets for BM25 and ELSER V2 - higher values are better)_


[discrete]
Expand All @@ -435,36 +440,33 @@ realistic view on the model performance for your use case.


[discrete]
==== ELSER v1

Two data sets were utilized to evaluate the performance of ELSER v1 in different
hardware configurations: `msmarco-long-light` and `arguana`.

|==============================================================================================================
| **Data set** ^| **Data set size** ^| **Average count of tokens / query** ^| **Average count of tokens / document**
| `msmarco-long-light` ^| 37367 documents ^| 9 ^| 1640
| `arguana` ^| 8674 documents ^| 238 ^| 202
|==============================================================================================================

The `msmarco-long-light` data set contains long documents with an average of
over 512 tokens, which provides insights into the performance implications
of indexing and {infer} time for long documents. This is a subset of the
"msmarco" dataset specifically designed for document retrieval (it shouldn't be
confused with the "msmarco" dataset used for passage retrieval, which primarily
consists of shorter spans of text).

The `arguana` data set is a https://github.com/beir-cellar/beir[BEIR] data set.
It consists of long queries with an average of 200 tokens per query. It can
represent an upper limit for query slowness.

The table below present benchmarking results for ELSER using various hardware
configurations.

|==================================================================================================================================================================================
| 3+^| `msmarco-long-light` 3+^| `arguana` |
| ^.^| inference ^.^| indexing ^.^| query latency ^.^| inference ^.^| indexing ^.^| query latency |
| **ML node 4GB - 2 vCPUs (1 allocation * 1 thread)** ^.^| 581 ms/call ^.^| 1.7 doc/sec ^.^| 713 ms/query ^.^| 1200 ms/call ^.^| 0.8 doc/sec ^.^| 169 ms/query |
| **ML node 16GB - 8 vCPUs (7 allocation * 1 thread)** ^.^| 568 ms/call ^.^| 12 doc/sec ^.^| 689 ms/query ^.^| 1280 ms/call ^.^| 5.4 doc/sec ^.^| 159 ms/query |
| **ML node 16GB - 8 vCPUs (1 allocation * 8 thread)** ^.^| 102 ms/call ^.^| 9.7 doc/sec ^.^| 164 ms/query ^.^| 220 ms/call ^.^| 4.5 doc/sec ^.^| 40 ms/query |
| **ML node 32 GB - 16 vCPUs (15 allocation * 1 thread)** ^.^| 565 ms/call ^.^| 25.2 doc/sec ^.^| 608 ms/query ^.^| 1260 ms/call ^.^| 11.4 doc/sec ^.^| 138 ms/query |
|==================================================================================================================================================================================
==== ELSER V2

Overall the platform specific V2 model ingested at a max rate of 26 docs/s,
compared with the ELSER V1 max rate of 14 docs/s from the ELSER V1 benchamrk,
resulting in a 90% increase in throughput.

The performance of virtual cores (that is, when the number of allocations is
greater than half of the vCPUs) has increased. Previously, the increase in
performance between 8 and 16 allocations was around 7%. It has increased to 17%
(ELSER V1 on 8.11) and 20% (for ELSER V2 platform specific). These tests were
performed on a 16vCPU machine, with all documents containing exactly 256 tokens.

IMPORTANT: The length of the documents in your particular dataset will have a
significant impact on your throughput numbers.

image::images/ml-nlp-elser-bm-summary.png[alt="Summary of ELSER V1 and V2 benchmark reports",align="center"]

**The platform specific** results show a nearly linear growth up until 8
allocations, after which performance improvements become smaller. In this case,
the performance at 8 allocations was 22 docs/s, while the performance of 16
allocations was 26 docs/s, indicating a 20% performance increase due to virtual
cores.

image::images/ml-nlp-elser-v2-ps-bm-results.png[alt="ELSER V2 platform specific benchmarks",align="center"]

**The platform agnostic** model performance of 8 and 16 allocations are
respectively 14 docs/s and 16 docs/s, indicating a performance improvement due
to virtual cores of 12%.

image::images/ml-nlp-elser-v2-pa-bm-results.png[alt="ELSER V2 platform agnostic benchmarks",align="center"]