Skip to content

Search Architecture

Jeff Mataya edited this page Mar 22, 2017 · 5 revisions

Overview

We use ElasticSearch as our main source of caching and as a search engine across the FoxCommerce platform. In order to use it correctly, we must create and update indicies, manage mappings per scope, and push information to ES. This document outlines how we approach those problems.

Approach

This section covers how we create an manage mappings and indices across ElasticSearch. At the most basic level, we want to create views of data and import them into ElasticSearch in order to be queried in fast, structured ways. Then, we will need to manage the evolution of that search over time by changing the structure of mappings and reindexing.

Definitions

ElasticSearch Terms

search_view

One or more Postgres tables that contain the raw data that can be searched. This data is consumed by Green River and pushed to ElasticSearch.

mapping

A mapping is a fundamental part of ElasticSearch. It's comprised of two parts:

  1. a definition of every field in the mapping and how ElasticSearch will analyze the contents of those fields
  2. the indexed data that we import from the search_view

We consider mapping definitions to be immutable, though that's not a strict requirement of ElasticSearch, and each search will consist of multiple versioned mappings.

There are a few restrictions that exist for updating ElasticSearch mappings:

  1. Adding a field to a mapping does not require data to be reindexed
  2. Changing the type or analyzer used on a field requires a complete reindex of the mapping
  3. Renaming a field requires a complete reindex of the mapping
  4. Removing a field requires a complete reindex of the mapping

It is the limitations of items #2-4 that caused us to create the search.

alias

Think of an alias as a symlink in ElasticSearch. In short, an alias is a named shortcut for a mapping.

Fox Platform Terms

search

Search is a FoxCommerce-defined concept that allows us to aggregate data, analyze that data, and query against it. It rolls up a number of ElasticSearch primitives, such as aliases and mappings, to get around some of the inherent limitations of ES and give us zero-downtime deployments. and updates.

scope

Scope is a data primitive that we use for permissioning across the platform. While many of the details of scopes are outside the scope (ha!) of this document, it's of note here because for searches that contain private data, we create an index per scope.

To illustrate this concept, let's look at the following example. In this example, we have three scopes: 1, 2, and 3. 1 is the parent of 2 and 3, and they form the following tree:

  1
 / \
2   3

Since all are stored using the ltree format, we write each scope as a '.' delimited directory. Here are the three scopes written in the format that we will use going forward:

  • 1 = 1
  • 2 = 1.2
  • 3 = 1.3

Because of the parent-child nature of the relationship, scope 1 has access to all data under 1.2 and 1.3, while 1.2 and 1.3 only have access to their own data.

Search Architecture

As mentioned earlier, a search is a structure that allows us to create and manage the lifecycle of a mapping. Here's a brief overview of the architcture of the search and how it fits into the overall context of the system.

         ┌─────────────────────┐
         │ API Gateway [nginx] │
         └──────────┬──────────┘  
                    |
         ┌──────────┴─────────┐           
┌────────┴────────┐   ┌───────┴───────┐
│ Application API │   │ ElasticSearch │
└────────┬────────┘   │ [index]       │
         │            └───────┬───────┘
         │                    │   ┌──────────┐  ┌──────────────────────────┐
         │                    ├──→│ products ├─→│ products_search_view__v2 │
         │                    │ ┌→│ [alias]  │  │ [mapping]                │
         │                    │ │ └──────────┘  └──────────────────────────┘
         │                    │ │               ┌──────────────────────────┐
         │                    │ │               │ products_search_view__v1 │
         │                    │ │               │ [mapping]                │
         │                    │ │               └──────────────────────────┘
         │                    │ │
         │                    │ │
         │                    │ │ ┌──────────┐  ┌──────────────────────────┐
         │                    └─+→│ orders   ├─→│ orders_search_view__v2   │
         │                      ├→│ [alias]  │  │ [mapping]                │
         │                      │ └──────────┘  └──────────────────────────┘
         │                      │               ┌──────────────────────────┐
         │                      │               │ orders_search_view__v1   │
         │                      │               │ [mapping]                │
         │                      │               └──────────────────────────┘
      ╭──┴─╮                    │
      │    │   ┌───────┐   ┌────┴────────┐
      │ DB ├──→│ Kafka │──→│ Green River │
      │    │   └───────┘   └─────────────┘
      ╰────╯

Search Results

As the simplified document above shows, an ElasticSearch index is populated by numerous search instances (represented as an alias). There may be multiple mappings for each alias, but that alias will only pull results from the most recent version of its mapping.

This means that when a client makes a request to /api/v1/search/products, there are a number of things happening in the background.

  1. Nginx is routing the request to the appropriate index (see Scopes for more details)
  2. Once inside the index, the request is routed to an alias
  3. The alias retrieves results from the specific mapping to which it's linked

Indexing

A similar process as what's described above is used to insert, update, and delete data from ElasticSearch.

  1. Postgres instances contain tables that match the schemas of the ElasticSearch mappings
  2. These tables are updated via Postgres triggers as changes occur in the system
  3. Those changes are streamed through Kafka and picked up by Green River
  4. Based on scope and an internal mapping, Green River decides what alias should be updated
  5. Green River makes a PUT against the appropriate alias
  6. The mapping that is linked to the alias gets updated

Scoping

Partitioning Data

As noted above, scopes are the primitives that we use to manage roles and permissions throughout the platform. We manage permissions in ElasticSearch by creating an index per scope, then letting Green River insert, update, and delete rows in the correct index or indexes.

All searches that can be scoped have a scope field in the Postgres table that backs up the search. When Green River processes an update on those takes, it analyzes the scope field in a search view, it intelligently appends the row to correct indices.

Example

Consider that we have the scopes listed above (1, 1.2, and 1.3) with the search products. Let's go through what happens when a new product is created in scope 1.2.

  1. Since we have three scopes, we have three admin indices: admin_1, admin_1.2, and admin_1.3.
  2. Green River picks up an event in Kafka when the product is created that has a scope of 1.2.
  3. Based on the format of 1.2, Green River creates a record in admin_1/products and admin_1.2/products.

Note that one of side effects of this implementation is that we duplicate data in both the admin_1 and admin_1.2 indices. This allows us to have a really simple permissions model where user can only access the index that identically matches their scope.

Clone this wiki locally