Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to ElasticSearch 6 #5609

Closed
EnTeQuAk opened this issue Jun 13, 2018 · 26 comments · Fixed by mozilla/addons-server#14206
Closed

Upgrade to ElasticSearch 6 #5609

EnTeQuAk opened this issue Jun 13, 2018 · 26 comments · Fixed by mozilla/addons-server#14206
Assignees
Milestone

Comments

@EnTeQuAk
Copy link
Contributor

EnTeQuAk commented Jun 13, 2018

  • Add ElasticSearch 6 to Travis
  • Figure out exact ElasticSearch version
@EnTeQuAk EnTeQuAk changed the title Add ElasticSearch 6 to Travis / Upgrade to ElasticSearch 6 Upgrade to ElasticSearch 6 Jun 13, 2018
@EnTeQuAk EnTeQuAk self-assigned this Jun 14, 2018
@EnTeQuAk EnTeQuAk added this to the 2018.06.28 milestone Jun 14, 2018
@EnTeQuAk
Copy link
Contributor Author

Slightly blocked by travis-ci/apt-source-safelist#379 - not a hard block though, we can get around it.

@EnTeQuAk EnTeQuAk modified the milestones: 2018.06.28, 2018.07.05 Jun 26, 2018
@EnTeQuAk
Copy link
Contributor Author

Added an upstream patch travis-ci/apt-source-safelist#385

@EnTeQuAk
Copy link
Contributor Author

@bqbn are there any ops requirements in regards to timing when this has to happen on our end?

@bqbn
Copy link
Contributor

bqbn commented Nov 21, 2018

We haven't started any work regarding ES6 upgrade yet. We can probably target this Q1/Q2 2019 from ops perspective.

We'd certainly want to do this after we are done with the UTF8mb4 upgrade (https://bugzilla.mozilla.org/show_bug.cgi?id=1479111), which is slated to happen early Q1.

@EnTeQuAk EnTeQuAk modified the milestones: 2019.01.23, 2019.01.31 Jan 22, 2019
@EnTeQuAk EnTeQuAk removed this from the 2019.01.31 milestone Jan 30, 2019
@diox diox self-assigned this Mar 6, 2019
@diox
Copy link
Member

diox commented Mar 6, 2019

Going to give this a try on travis to see where we are. I think we want do this soon, but after Python 3 and possibly after Django 2.2 too.

@bqbn
Copy link
Contributor

bqbn commented Mar 6, 2019

I think we want do this soon, but after Python 3 and possibly after Django 2.2 too.

What are those going to happen the earliest?

And it looks like ops may not be able to start working on this in Q1 after all.

@diox
Copy link
Member

diox commented Mar 6, 2019

Python 3: We were thinking this week or the next. Sadly we missed the tag so it's probably going to be next week.
Django 2.2: When it's done. It's currently in beta, and the port seems to be going well on our side so we'll probably be ready when it's released sometime in April this year. Ideally we'd want to migrate as soon as it's released.

Maybe we can try to target early Q2 for Elasticsearch 6 ? Not sure how much work there is to do yet both on the dev and ops sides.

@diox
Copy link
Member

diox commented Mar 8, 2019

Good news: there aren't a lot of changes to make, so my branch is almost done, just a couple tests to adjust, but the code is ready.

Bad news: Removal of mapping types hurt us. My branch has a refactor of stats code to deal with this, but this requires splitting the stats index into 2 separate indexes, each containing a single type (one with update counts, one with download counts). This means that we need to come up with a transition plan. A brutal approach would be to do a full reindex of stats on a new cluster with new code while serving the pages with the old cluster and old code (how long does that take?). An alternative would be to have ops figure out a way to move the mapping to the new indexes without requiring a full reindex, but I'm not sure what's possible here.

@bqbn
Copy link
Contributor

bqbn commented Mar 9, 2019

A brutal approach would be to do a full reindex of stats on a new cluster with new code while serving the pages with the old cluster and old code (how long does that take?).

There will be a new cluster for sure. Because current cluster is running v5, we have not planned to upgrade v5 to v6 on the current cluster, but planned to build a new cluster that runs v6.

Last time when we upgraded from v1 to v5, we created a new cluster that ran v5, and ran ./manage.py reindex and ./manage.py reindex --with-stats to create the 2 indices on the new cluster. The two commands took ~12 hours to finish, and we actually promoted the code the next day (Friday).

An alternative would be to have ops figure out a way to move the mapping to the new indexes without requiring a full reindex, but I'm not sure what's possible here.

Not sure how to do this yet at present. But it sounds like it requires data migration from the current cluster (v5) to the new cluster (v6). Nevertheless, we'd prefer the way we did it last time, i.e. by running ./manage.py reindex and ./manage.py reindex --with-stats to create the indices on the new cluster.

@diox
Copy link
Member

diox commented Mar 9, 2019

Last time when we upgraded from v1 to v5, we created a new cluster that ran v5, and ran ./manage.py reindex and ./manage.py reindex --with-stats to create the 2 indices on the new cluster. The two commands took ~12 hours to finish, and we actually promoted the code the next day (Friday).

Ok, that works. I didn't know how much time it would take. Worth noting that ES6 is (mostly) compatible with ES5 clusters as long as you don't create a new index, but I'm not sure it buys us anything here.

@EnTeQuAk EnTeQuAk removed their assignment Mar 12, 2019
@EnTeQuAk
Copy link
Contributor Author

EnTeQuAk commented Feb 5, 2020

@muffinresearch @diox @bqbn I just found this issue and wondered how it's priority is especially since support for the latest ES 5 version ended in March last year.

Looking at https://www.elastic.co/support/eol upgrading to ES6 will only be a step-stone to eventually land on ES7 as even ES6 isn't supported too long in the future as well.

@diox
Copy link
Member

diox commented Feb 5, 2020

I'd like to do it as soon as possible, we just need to find the time.

We'd need to refresh my branch but it was almost good to go. If we wait for https://github.com/orgs/mozilla/projects/116 it might greatly simplify the migration: we might end up removing all individual stats storage on our side and that would mean not having to care about the issue about mapping vs types I mentioned earlier (that would leave only add-ons in our ES cluster)

@bqbn
Copy link
Contributor

bqbn commented Feb 5, 2020

How does by late March sound to everyone? I also need to do some research and test and see how different new version is configuration wise.

Also, version wise, how about we upgrade it to v6.8.x this time? I mainly don't want to skip a version at this point.

@bqbn
Copy link
Contributor

bqbn commented Feb 19, 2020

We've built a v6.8.6 ES cluster in -dev environment. Any suggestion how we proceed to switch to that cluster?

Hopefully, there is a way for us to run the new version in -dev and -stage for a week or two before upgrading -prod. :)

@diox
Copy link
Member

diox commented Feb 19, 2020

I was hoping the new stats storage would be ready in time, but that looks unlikely at this point, since we haven't started the work on it.

Because we need a new data structure, it's going to be difficult to have on dev & stage for more than a week - keeping the code compatible with both data structures is not trivial - I don't think we have the developer time to do it.

My plan was:

  • Finish up the branch, make a PR, have it reviewed
  • Merge it on a Tuesday after tagging
  • Deploy it on dev (*)
  • Let it sit on dev for a week
  • If we notice bugs before tagging, fully revert the code, switch dev to the old cluster and reindex
  • Deploy it on stage (*)
  • Deploy it on prod 2 days later (*)

(*) requires doing what you were talking about in a comment above

@bqbn
Copy link
Contributor

bqbn commented Feb 20, 2020

OK, the plan is about the same as what I have in mind except that it has a tighter schedule. I think it's still reasonable though and we should try it.

One question I have, is it correct that only the cron jobs trigger the writes to the ES cluster? For example, if I disable the cron jobs on the admin instance, is it right that no other tasks will write to the ES cluster?

Meanwhile, I wrote a draft deployment plan based on our last upgrade,

https://docs.google.com/document/d/1pP7KK6RWXBKTjLkKh5NnfrUaHbS1mAGonOij-8zGJ8U/edit?usp=sharing

@diox @EnTeQuAk can you take a look and comment as needed?

Thanks.

@diox
Copy link
Member

diox commented Feb 20, 2020

One question I have, is it correct that only the cron jobs trigger the writes to the ES cluster? For example, if I disable the cron jobs on the admin instance, is it right that no other tasks will write to the ES cluster?

No, writes to the ES cluster are done by celery tasks which can be triggered by a bunch of things, like someone saving an add-on for instance. Shutting down the cron helps though, as it disables add-on auto-approvals (a source of add-on changes) and disables stats indexing (also it only happens once a day anyway).

@diox
Copy link
Member

diox commented Feb 25, 2020

Because I haven't had much time to work on this yet, and the stats project have started getting some traction, we're back on waiting on it for a while to see if that would help us.

Being able to get rid of the stats related indexes would greatly simplify the migration (I suspect it would make updating my branch and fixing tests almost trivial) and give us the ability to roll back if things go wrong.

@AlexandraMoga
Copy link

@diox this requires some regression testing on search, I presume?

@diox
Copy link
Member

diox commented May 29, 2020

Yes. I already did some when it landed yesterday and everything seemed to work. One area that needs special attention are statistics.

@AlexandraMoga
Copy link

Search seems to works as before and add-on stats (i.e users, ratings - if available) are still present in search results and on the add-on detail page.

@diox the statistics dashboard shows a continuous loading indicator for each add-on on -dev I've verified with - see example. Do you think the upgrade might have caused this?

@diox
Copy link
Member

diox commented May 29, 2020

@AlexandraMoga did those add-ons have working statistics before ? stats on dev are coming from actual requests from users that have their Firefox configured to hit dev's versioncheck, so very few add-ons will have working stats (and those which do will have very low numbers)

@AlexandraMoga
Copy link

@diox as far as I remember, even if the add-on had no stats, the graphs would still load, although empty. Now they are all stuck in a loading state.

@diox
Copy link
Member

diox commented May 29, 2020

This was caused by #7615, which has been fixed by #7630

You can now see it working on add-ons that do have some data, like https://addons-dev.allizom.org/en-US/firefox/addon/awesome-screenshot-plus-/statistics/usage/?last=7

@AlexandraMoga
Copy link

Yep, stats are showing up now. Here's another add-on on -dev that has some actual stats to show: https://addons-dev.allizom.org/en-US/firefox/addon/view-page-archive-cache/statistics/?last=90

@bqbn
Copy link
Contributor

bqbn commented May 30, 2020

Just an FYI, the following deprecation warnings are logged by the new ES cluster,

[2020-05-29T17:52:32,692][WARN ][o.e.d.i.a.AnalysisModule ] [i-019b7ddc2f87459fc] The [standard] token filter is deprecated and will be removed in a future version.
[2020-05-29T21:01:06,479][WARN ][o.e.d.i.q.f.RandomScoreFunctionBuilder] [i-019b7ddc2f87459fc] As of version 7.0 Elasticsearch will require that a [field] parameter is provided when a [seed] is set

Something to consider for the next upgrade. :)

@KevinMind KevinMind added migration:no-jira repository:addons-server Issue relating to addons-server labels May 4, 2024
@KevinMind KevinMind transferred this issue from mozilla/addons-server May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants