Evolution of Promotion Search at Trendyol

Published in

Trendyol Tech

8 min readSep 28, 2021

Promotions are the discount that is applied to the shopping cart. When you modify the contents of your shopping cart (add a new item, remove an item, increase the number of items in your cart, etc. ), promotions should recalculate. A promotion search process is also required before recalculation to identify related promotions with a shopping cart. In this one, I’ll talk about how promotion searching has changed at Trendyol in the last two years.

In 2019, Promotions used to be calculated by Legacy Promotion API. It was a .Net Framework Application that working on IIS. It fetched all promotions periodically from Trendyol DB and cached promotions as in-memory. After that when a promotion computation was required, Legacy Promotion API was searching the promotions that are relevant to shopping cart from its memory.

The Problems We Encountered with Legacy Promotion API

Scalability

When we wanted to scale Legacy Promotion API we had to deploy it to new IIS machines. Regrettably, this is not an easy task. New machines should be set up and then Legacy Promotion API should be deployed via Octopus. This necessitated a lot of manual labor. I recall that for 2019 Black Friday, we scaled apps that working on Kubernetes in 5 seconds but for Legacy Promotion API, we spent our three days.

Issues with a Legacy Code Base

Legacy Promotion API had a messy codebase. It was part of a monolith system that was developed before Trendyol Tech started to apply extreme programming practices and microservices. It didn’t have enough reliable tests too.

Shared Database

Legacy Promotion API was using a shared database called Trendyol DB. This database was also used by nearly all Trendyol legacy apps. Trendyol DB had a huge load and when an issue occurred on Trendyol DB, it also was affecting Legacy Promotion API.

Replatforming

2019 was the replatforming year for Trendyol Tech. So we began to replatform Legacy Promotion API. After many spikes and performance tests, we came up with the following system design:

Promotion Search API (PSA)

When we were designing the original system we did spikes with different programming languages and different search algorithms. We picked Golang because of the performance of the Go spike. We also stuck with a brute force algorithm because more complex data structures and algorithms did not bring a huge performance advantage compared to the complexity they brought. We decided to postpone optimizing the search algorithm until we need it.

This application is responsible for promotion searching. PSA periodically retrieves all promotions from Promotion DB and cached promotions as in-memory. When PSA receives a search request, it finds related promotions from its memory. Additionally, PSA has a response time of roughly a millisecond.

Promotion API

Promotion API is a Java Spring Boot application that satisfy core business. Because of complexity of business rules we have implemented domain driven design with a OOP language for Promotion API. OOP language was a better fit for a domain driven design. Main responsibility of Promotion API is to calculate discounts on the shopping cart. In addition, many CRUD operations are executed by Promotion API. For this design, response time of Promotion API is round 8 milliseconds.

How This System Works

Promotion API calls PSA by sending shopping cart in the request body and PSA returns promotions applicable to the shopping cart. Finally, Promotion API calculates the possible combinations of promotions applied and picks the one that is most advantageous for the customer. If you want to learn more about how this calculation works click here.

Changing PSA Search Strategy

This design worked grate with about 5000 promotions. PSA cached all promotions in an array like below and searched promotion as brute force in a loop.

type Cache struct {
   Promotions    []Promotion
}

Unfortunately, because of the newly adopted business model that brought 12x promotions to the system, PSA response time increased about 7ms too and it showed us it will not scale horizontally. We did not waste any time and performed a load test. The performance of the services was not acceptable for us. Worse, there was a huge event two weeks later and PSA had to work more efficiently.

Fortunately, we were able to rapidly find a solution. Promotions can be created with fifteen different conditions and promotions can have more than one condition. But we realized that lots of promotions included sellerId and campaignId conditions. So, We decided that indexing promotions by sellerId and campaignId with a hash map. Finally, the promotion cache struct appeared as follows:

type Cache struct {
   SellerPromotions   map[int64][]Promotion
   CampaignPromotions map[int64][]Promotion
   OtherPromotions    []Promotion
}

Anymore, brute force searching wasn’t required. PSA was getting promotion lists by sellerId and campaignId of items from cache and then it was finding promotions applicable to the shopping cart. As a result of this development, response time of PSA decreased to around a millisecond again.

Promotion searching worked with this design at 2020 Black Friday and it successfully received 400k requests per minute. Its average response time was 20ms at the peak point.

Many More Promotions

In the third quarter of 2020, we have developed a promotion creation page for the sellers. After this development, the count of promotions increased rapidly. The database was storing 1.5 million promotions when I wrote this article. It means that in just two years, promotion count increased from 5k to 1.5m ( 300x data ).

As a result of this increase, some problems occurred on PSA. Firstly, refreshing PSA cache was taking too much time. Also when we scale out PSA ( increasing the number of pods on k8s), all PSA instances were caused much load on replica nodes of Promotion DB and this was a bottleneck to scale PSA. As a result, we made the decision to kill PSA until the 2021 Black Friday and we needed to find an alternative solution.

Elasticsearch

After lots of tests, we decided to try Elasticsearch to search promotions. But first of all, we needed a solution to feed data to Elasticsearch and manage reindexing process. Also it should be possible changing data model on Elasticsearch during reindexing.

We decided to use Couchbase Elasticsearch Connector ( CBES ) and deploy it to Kubernetes.

The Couchbase Elasticsearch Connector replicates your documents from Couchbase Server to Elasticsearch in near real time. The connector uses the high-performance Database Change Protocol (DCP) to receive notifications when documents change in Couchbase. You can reach CBES repository here.

But data of promotions are stored on SQL Server. So firstly we should feed promotions data to Couchbase. Fortunately, Promotion API are producing promotion created-updated events to Kafka. So, We developed a consumer app that consumes these events and writes data of promotions to Couchbase.

But old promotion created-updated events did not exist on Kafka. So we also wrote a migration app that migrated promotion data from SQL Server to Couchbase once.

The final design is shown below:

It’s worth noting that CBES doesn’t feed data to Elasticsearch as same as schema on Couchbase. We made some changes on CBES codebase to makes the schema more convenient for searching. This gives us great flexibility. If the search schema needs to be changed in the future, We just make some changes on the code base of CBES and feed data to a new Elasticsearch index with a new CBES instance.

Managing Reindexing With CBES

Reindexing may be necessary at times. We may want to change the schema of data or configuration changes may be required for the index of Elasticsearch. We manage this process like below:

New CBES instances start to feed data from Couchbase to a new index. Meanwhile, old CBES instances continue to feed to old index too. Promotion API uses an Elasticsearch alias which refers to the index of promotions. When all of available data is fed to the new index, We change index of alias refer. After this process, Promotion API starts to search promotions via new index.

Test

Lots of unit, integration, and acceptance tests were written during the development of this system. But promotion calculation is a sensitive area, so we did some additional tests on the production environment too.

When the development was completed, Promotion API was sending parallel requests to both Elasticsearch and PSA and compared the results. On the other hand, discounts were calculated with the response of PSA. If a difference existed, It logged this difference and an alert message was sent to Slack by Kibana. Furthermore, this progress wasn’t running for all users. The last two digits of userId were used to identify test participants. If this logic was disabled for the user, Promotion API only sent a request to PSA. Additionally, this logic might be enabled for specific userIds. We followed these logs for about a month.

After that Promotion API continued to comparing responses for test users but discounts were calculated with the response of Elasticsearch. Finally, following the logs and observing the whole ordering process for a long time, we took down PSA.

Result

Promotion API reached 1.6 million rpm in our most recent load test ( 4x load of 2020 Black Friday ). We made many performance optimizations on Elasticsearch and 1.6m rpm is just for one datacenter. Also, Promotion API is running on two different data centers. In my future articles, I will mention how Promotion API runs on two different datacenters and what kind of performance optimization we did for Elasticsearch.