We’re actually hoping you didn’t notice, but recently Marktplaats replaced its legacy search engine (called Voyager) with Elasticsearch.
Search lies at the core of our user experience and over the years our users have become accustomed to a certain behavior when they interact with the platform. At the same time features have been introduced that make use of search in various ways. The introduction of a new technology that has completely different possibilities for such an essential component poses a massive risk of disrupting all of this.
In this article I’m going to address the key decisions that we believe made this a smooth and rather uneventful process. We completed this migration with zero downtime and a small team (2-3 developers depending on how you count) in 3 months.
Voyager was developed internally by eBay in the USA and was introduced at Marktplaats 6-7 years ago. It is very fast and scales quite well (just add hardware), but over the years the limited feature set has become more and more of a hindrance in bringing the best search experience to our users. Ebay itself has moved on from Voyager several years ago and thus development has stopped, while the number of people with in-depth knowledge about the system has dropped to a point that it has almost become a black box for us.
So Marktplaats and other eBay Classifieds Group (eCG) companies facing similar problems have been working together to look at other technologies to replace Voyager and other legacy systems used for search. We decided that Elasticsearch was the best option, where the rich feature set and the availability of support from both an active online community and commercial parties were key to us.
Keep it simple
Our approach to replacing Voyager can be summarized as “keep it simple”. Elasticsearch offers some very exciting options for our search experience, and we will definitely try them out in the (near) future, but for the migration we decided to replicate the things we were already doing with Voyager as much as possible. This would reduce development time and allow us to compare the old and new platforms better. We believed that with relatively little effort we could get Elasticsearch to produce the same results as Voyager and thus keep any disturbances to a minimum.
We were helped in this by our service oriented architecture. We have several services that do things like manage users, send email, index advertisements or perform search. These services communicate with each other using Apache Thrift and asynchronous messaging via ActiveMQ or Kafka. By introducing new services that interact with Elasticsearch while keeping the same interfaces as their legacy equivalent, we were confident that we could keep the changes we had to make to the rest of the platform to an absolute minimum. This way we could also use the several hundred integration and acceptance tests we have been building up over the years. Any failures could then be easily attributed to our new search platform.
Build what you need first
Secondly, we decided to take things slowly by first covering the bare essentials before expanding our work to include all edge cases. For example, we first build the mechanisms to just get advertisements in an Elasticsearch index, without bothering too much yet about what fields should be indexed or how they should be analyzed. This allowed us to quickly explore what would be needed to set up an index from scratch and we did the first basic (re)index on production 3 weeks after we started development.
Test both old and new
Once it was established we had a good foundation, we started to include more functionality by just looking at the old code and replicating its functionality in the new services.
Once we believed we had covered everything, we ran all our existing integration and acceptance tests against them. We ran them on the same integration environment as the rest of the platform and by just flipping a switch we could direct all search traffic to our new services and Elasticsearch or back to Voyager again.
This proved to be quite a shock at first. We made a friendly wager among ourselves and lets just say the most pessimistic (or experienced?) of us won, as failure was abundant. After we fixed these issues we repeated the process a couple of times, reducing the number of failures every time. In the end it turned out a lot of failures were caused by all kinds of hidden functionality in other parts of the system that we couldn’t detect in the code of the legacy services.
After we made our own test suites pass, we diverted our attention to the real test: our users. We had frequent debates on how we would ensure that our users would get the same results from the old and the new system.
A solution would be running live queries against both systems in parallel on production and recording the differences. This would give us accurate user behavior, but it also makes it very hard to reproduce any issue, since the contents of the index has more than likely changed between the original request and the time we would be able to have a look at it.
We settled on a solution where we would record actual user queries and run them against indices made from a copy of the production database. We also did not run all those queries at once, but in small random batches every 5 minutes. This way we would be able to quickly get an overview of potential issues and not be flooded by a large set of differences at once.
Two and a half months after we started development we put 1% of our production traffic on Elasticsearch on a Friday. We kept it at that during the weekend and increased traffic gradually every day the next week until we were running at 50% at the end of the week.
All the time we were monitoring the performance of the new cluster as well as user metrics like page views and complaints received by customer support. Two minor bugs were identified this way and fixed quickly. We then flipped the switch to go to 100% while keeping the old platform running.
During the next couple of weeks we experienced very few issues; the most serious was a single node that started to eject items from the field data cache. We fixed it by increasing the size of the cache and a rolling restart. Users were not impacted as far as we could determine. At no point did we consider going back to the legacy system.
All in all this has been a very smooth process for us. If your planning to replace your legacy search engine with something different and are anxious about any impact it may have on your overall metrics, our approach might help you. But as always, your mileage may vary.
For us the next steps are exploring the possibilities we now have with Elasticsearch.