diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-bm25-elser-v2.png b/docs/en/stack/ml/nlp/images/ml-nlp-bm25-elser-v2.png new file mode 100644 index 000000000..157f1af6e Binary files /dev/null and b/docs/en/stack/ml/nlp/images/ml-nlp-bm25-elser-v2.png differ diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-elser-average-ndcg.png b/docs/en/stack/ml/nlp/images/ml-nlp-elser-average-ndcg.png deleted file mode 100644 index d0b07440d..000000000 Binary files a/docs/en/stack/ml/nlp/images/ml-nlp-elser-average-ndcg.png and /dev/null differ diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-elser-bm-summary.png b/docs/en/stack/ml/nlp/images/ml-nlp-elser-bm-summary.png new file mode 100644 index 000000000..493a64369 Binary files /dev/null and b/docs/en/stack/ml/nlp/images/ml-nlp-elser-bm-summary.png differ diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-elser-ndcg10-beir.png b/docs/en/stack/ml/nlp/images/ml-nlp-elser-ndcg10-beir.png deleted file mode 100644 index 1befc5df5..000000000 Binary files a/docs/en/stack/ml/nlp/images/ml-nlp-elser-ndcg10-beir.png and /dev/null differ diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-elser-v1-v2.png b/docs/en/stack/ml/nlp/images/ml-nlp-elser-v1-v2.png deleted file mode 100644 index cc811eaeb..000000000 Binary files a/docs/en/stack/ml/nlp/images/ml-nlp-elser-v1-v2.png and /dev/null differ diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-elser-v2-pa-bm-results.png b/docs/en/stack/ml/nlp/images/ml-nlp-elser-v2-pa-bm-results.png new file mode 100644 index 000000000..c5eeac8f2 Binary files /dev/null and b/docs/en/stack/ml/nlp/images/ml-nlp-elser-v2-pa-bm-results.png differ diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-elser-v2-ps-bm-results.png b/docs/en/stack/ml/nlp/images/ml-nlp-elser-v2-ps-bm-results.png new file mode 100644 index 000000000..7f51b5a65 Binary files /dev/null and b/docs/en/stack/ml/nlp/images/ml-nlp-elser-v2-ps-bm-results.png differ diff --git a/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc index e0b54aa9f..a23776378 100644 --- a/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc +++ b/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc @@ -386,6 +386,30 @@ hardwares and compares the model performance to {es} BM25 and other strong baselines such as Splade or OpenAI. +[discrete] +[[version-overview]] +=== Version overview + +ELSER V2 has a **platform specific** version that is designed to run only on +Linux with an x86-64 CPU architecture and a **platform agnostic or portable** +version that can be run on any platform. + + +[discrete] +==== ELSER V2 + +Besides the performance improvements, the biggest change in ELSER V2 is the +introduction of the first platform specific ELSER model - that is, a model that +can run only on Linux with an x86-64 CPU architecture. The platform specific +model is designed to work best on newer Intel CPUs, but it works on AMD CPUs as +well. It is recommended to use the new platform specific Linux-x86-64 model for +all new users of ELSER as it is significantly faster than the platform agnostic +(portable) model which can be run on any platform. ELSER V2 produces +significantly higher quality embeddings than ELSER V1. Regardless of which ELSER +V2 model you use (platform specific or platform agnostic (portable)), the +particular embeddings produced are the same. + + [discrete] [[elser-qualitative-benchmarks]] === Qualitative benchmarks @@ -395,30 +419,11 @@ Discounted Cumulative Gain (NDCG) which can handle multiple relevant documents and fine-grained document ratings. The metric is applied to a fixed-sized list of retrieved documents which, in this case, is the top 10 documents (NDCG@10). -The table below shows the performance of ELSER v2 compared to ELSER v1. ELSER v2 -has 10 wins, 1 draw, 1 loss and an average improvement in NDCG@10 of 2.5%. - -image::images/ml-nlp-elser-v1-v2.png[alt="ELSER v2 benchmarks compared to ELSER v1",align="center"] -_NDCG@10 for BEIR data sets for ELSER v2 and ELSER v1 - higher values are better)_ - -The next table shows the performance of ELSER v1 compared to {es} BM25 with an -English analyzer broken down by the 12 data sets used for the evaluation. ELSER -v1 has 10 wins, 1 draw, 1 loss and an average improvement in NDCG@10 of 17%. - -image::images/ml-nlp-elser-ndcg10-beir.png[alt="ELSER v1 benchmarks",align="center"] -_NDCG@10 for BEIR data sets for BM25 and ELSER v1 - higher values are better)_ +The table below shows the performance of ELSER V2 compared to BM 25. ELSER V2 +has 10 wins, 1 draw, 1 loss and an average improvement in NDCG@10 of 18%. -The following table compares the average performance of ELSER v1 to some other -strong baselines. The OpenAI results are separated out because they use a -different subset of the BEIR suite. - -image::images/ml-nlp-elser-average-ndcg.png[alt="ELSER v1 average performance compared to other baselines",align="center"] -_Average NDCG@10 for BEIR data sets vs. various high quality baselines (higher_ -_is better). OpenAI chose a different subset, ELSER v1 results on this set_ -_reported separately._ - -To read more about the evaluation details, refer to -https://www.elastic.co/blog/may-2023-launch-information-retrieval-elasticsearch-ai-model[this blog post]. +image::images/ml-nlp-bm25-elser-v2.png[alt="ELSER V2 benchmarks compared to BM25",align="center"] +_NDCG@10 for BEIR data sets for BM25 and ELSER V2 - higher values are better)_ [discrete] @@ -435,36 +440,33 @@ realistic view on the model performance for your use case. [discrete] -==== ELSER v1 - -Two data sets were utilized to evaluate the performance of ELSER v1 in different -hardware configurations: `msmarco-long-light` and `arguana`. - -|============================================================================================================== -| **Data set** ^| **Data set size** ^| **Average count of tokens / query** ^| **Average count of tokens / document** -| `msmarco-long-light` ^| 37367 documents ^| 9 ^| 1640 -| `arguana` ^| 8674 documents ^| 238 ^| 202 -|============================================================================================================== - -The `msmarco-long-light` data set contains long documents with an average of -over 512 tokens, which provides insights into the performance implications -of indexing and {infer} time for long documents. This is a subset of the -"msmarco" dataset specifically designed for document retrieval (it shouldn't be -confused with the "msmarco" dataset used for passage retrieval, which primarily -consists of shorter spans of text). - -The `arguana` data set is a https://github.com/beir-cellar/beir[BEIR] data set. -It consists of long queries with an average of 200 tokens per query. It can -represent an upper limit for query slowness. - -The table below present benchmarking results for ELSER using various hardware -configurations. - -|================================================================================================================================================================================== -| 3+^| `msmarco-long-light` 3+^| `arguana` | -| ^.^| inference ^.^| indexing ^.^| query latency ^.^| inference ^.^| indexing ^.^| query latency | -| **ML node 4GB - 2 vCPUs (1 allocation * 1 thread)** ^.^| 581 ms/call ^.^| 1.7 doc/sec ^.^| 713 ms/query ^.^| 1200 ms/call ^.^| 0.8 doc/sec ^.^| 169 ms/query | -| **ML node 16GB - 8 vCPUs (7 allocation * 1 thread)** ^.^| 568 ms/call ^.^| 12 doc/sec ^.^| 689 ms/query ^.^| 1280 ms/call ^.^| 5.4 doc/sec ^.^| 159 ms/query | -| **ML node 16GB - 8 vCPUs (1 allocation * 8 thread)** ^.^| 102 ms/call ^.^| 9.7 doc/sec ^.^| 164 ms/query ^.^| 220 ms/call ^.^| 4.5 doc/sec ^.^| 40 ms/query | -| **ML node 32 GB - 16 vCPUs (15 allocation * 1 thread)** ^.^| 565 ms/call ^.^| 25.2 doc/sec ^.^| 608 ms/query ^.^| 1260 ms/call ^.^| 11.4 doc/sec ^.^| 138 ms/query | -|================================================================================================================================================================================== +==== ELSER V2 + +Overall the platform specific V2 model ingested at a max rate of 26 docs/s, +compared with the ELSER V1 max rate of 14 docs/s from the ELSER V1 benchamrk, +resulting in a 90% increase in throughput. + +The performance of virtual cores (that is, when the number of allocations is +greater than half of the vCPUs) has increased. Previously, the increase in +performance between 8 and 16 allocations was around 7%. It has increased to 17% +(ELSER V1 on 8.11) and 20% (for ELSER V2 platform specific). These tests were +performed on a 16vCPU machine, with all documents containing exactly 256 tokens. + +IMPORTANT: The length of the documents in your particular dataset will have a +significant impact on your throughput numbers. + +image::images/ml-nlp-elser-bm-summary.png[alt="Summary of ELSER V1 and V2 benchmark reports",align="center"] + +**The platform specific** results show a nearly linear growth up until 8 +allocations, after which performance improvements become smaller. In this case, +the performance at 8 allocations was 22 docs/s, while the performance of 16 +allocations was 26 docs/s, indicating a 20% performance increase due to virtual +cores. + +image::images/ml-nlp-elser-v2-ps-bm-results.png[alt="ELSER V2 platform specific benchmarks",align="center"] + +**The platform agnostic** model performance of 8 and 16 allocations are +respectively 14 docs/s and 16 docs/s, indicating a performance improvement due +to virtual cores of 12%. + +image::images/ml-nlp-elser-v2-pa-bm-results.png[alt="ELSER V2 platform agnostic benchmarks",align="center"] \ No newline at end of file