Author: Shubam Kumar, SDE-2 LimeChat
Welcome to the first post of our new blog series, where we’ll deep dive into how our tech team members plan, experiment, and resolve issues that come up as we grow and scale in the world of conversational commerce.
With each post, we hope you learn something new!
To kick off the series, I want to start with part one of how our team of engineers was able to scale our servers to handle 10x the load.
We were growing extremely frustrated with our server requests continuously timing out and servers going down and messaging piling up in rabbitmq.
To recover from these kinds of failures, we always had to manually intervene and restart servers and workers- sometimes even delete all the messages in the queue because workers just couldn’t process them.
So to prepare ourselves and our systems to handle more clients, we knew it was high time that we took on a project to scale our systems.
In this post, we’ll show you how we optimized our processes and servers to handle a 10x scale, what we did wrong while approaching the problem and how we found a way that works for us.
Let’s take a look at the earlier framework.
The Python app serves the web using something called a web server.
A web server cannot just integrate with your app automatically. To do so, there is a standard process built for all python web apps where you need to expose a WSGI application from your project.
This WSGI app can be given to any server built independently of our python project. Thus we can use any WSGI server out there to run our WSGI app.
This is the configuration we were using to run our django app with gunicorn:
gunicorn app.wsgi -w 1
This config launches one gunicorn sync worker, and it processes each request synchronously, that is, one at a time. So if you make two requests at the same time, one request will have to wait until the other is finished.
Not a great experience for the user.
We then launched 30 Kubernetes pods (increased to 60 with autoscaling), where each of them had this configuration for the Django app. So then, with 30 gunicorn sync workers running in parallel to process the incoming requests, we could handle 30 requests in parallel.
But, if each pod can handle all the requests synchronously, then if one request takes a lot of time to execute (maybe because a query or another API call is taking too long or the connection to rabbitmq is blocked) then all other requests coming to that pod will have to wait until that request is finished executing.
Imagine this happening across the pods!
Now, this configuration also had these additional problems:
It became clear that this was not working. Here’s what we tried next.
gunicorn app --worker-class=gevent --workers=2 --connections=500
The above configuration launches 2 async workers with 500 greenlets.
Now each pod can handle 500*2=1000 connections concurrently using something called monkey patching. However, Django responded back negatively about monkey patching, as it was leaking database connections.
Launching 30 pods using this configuration to handle 30,000 concurrent requests sounds easy, but in reality, it was not.
Because of several requests that were being handled concurrently, we also needed that many database connections.
However, our DB ran out of connections because of the capping limit, and now every request was failing.
We also saw errors like:
OperationalError at /admin/ Remaining slots are reserved for non-replication superusers
These errors mean the database has no more connections left to give to you.
How did we fix that?
Next, we introduced pgbouncer (Postgres connection pooler) for global pooling because Django does not support pooling natively.
(FYI- Pooling reduces PostgreSQL resource consumption and supports online restart/upgrade without dropping client connections.)
We then stress-tested our servers again. Here’s what we noticed:
This latency is not ideal. Connection pooling is supposed to remove the hassle of establishing a new connection for each request and in turn reduce latency; that was not the case here.
We also ran into other problems while using gunicorn async workers. The main issue was the Connection leak.
Database connections would never close at random times, and this was because gunicorn async works use greenlets, and if you spawn another greenlet manually in your code or you called celery task, it would cause problems.
One of the problems it caused was that it would never call Django signal request_finished and so the database connection would never close.
Take a look here to learn how it works.
After this experiment, where we tried every permutation of configuration possible, we realized that scaling gunicorn for our use case just wasn’t working.
It was time to move on.
After running these experiments, we knew that the new server that we were looking for should:
We tried two servers- Bjoern and CherryPy.
Bjoern is a very lightweight single-threaded WSGI server that works with a libev event loop.
So you can scale it like you would scale your nodejs server, launching multiple processes to utilize all CPU resources.
Bjoern is great on paper and awesome based on some of the reviews we read. But when we stress-tested our app by launching it with Bjoern, the results weren’t great because latency had a big hit.
We weren’t convinced looking at the numbers and decided it was too slow for us.
CherryPy is a multithreaded thread-pooled server written purely in python. The creators claim that it is one of the high performant python servers out there.
Despite being written in pure python, it handles a large number of concurrent connections with high RPS and low latency.
We wrote a small snippet to launch our Django WSGI app with cherrypy and started stress testing:
# core/management/commands/runcpserver.py from cheroot import wsgi from limekit.wsgi import application from django.core.management.base import BaseCommand def run_cherrypy_server( host: str = '127.0.0.1', port: int = 8000, numthreads: int = 30, max_threads: int = 40, request_queue_size: int = 20, ): addr = host, port server = wsgi.Server( addr, application, numthreads=numthreads, # minimum number of threads to keep in thread pool max=max_threads, # maximum number of threads. request_queue_size=request_queue_size ) server.start() class Command(BaseCommand): help = 'Run Django WSGI application using CherryPy Server' def add_arguments(self, parser): parser.add_argument('-p', '--port', type=int, help='Port to run server on') parser.add_argument('-hip', '--host_ip', type=str, help='Host IP') parser.add_argument('-nt', '--numthreads', type=int, help='Min threads in Thread Pool for CherryPy Server') parser.add_argument('-mt', '--maxthreads', type=int, help='Max worker threads') parser.add_argument('-rqs', '--request_queue_size', type=int, help='Max queued connections') def handle(self, *args, **options): port = options['port'] or 8000 host = options['host_ip'] or '127.0.0.1' numthreads = options['numthreads'] or 20 max_threads = options['maxthreads'] or 40 request_queue_size = options['request_queue_size'] or 20 run_cherrypy_server( port=port, host=host, numthreads=numthreads, max_threads=max_threads, request_queue_size=request_queue_size )
Launch server script
./manage.py runcpserver -nt 100 -mt 150 -rqs 1000 -hip 0.0.0.0 -p 4999
We ran the server with 100 threads in the thread pool and increased it to 150, if needed, with 1000 rqs (request queue size) and launched 10 pods. We also removed pgbouncer for the Django server.
Here are the stress test results:
Finally! Everything is as we wanted.
The batteries included distributed task queue built for Django for background processing.
These batteries started bothering us as we started scaling. We started by moving from Redis to RabbitMq as a broker.
This is the configuration we had for launching each celery worker.
celery -A app.gcelery worker -Q queue_name -Ofair --pool=gevent --concurrency=500 --without-gossip --without-heartbeat --without-mingle --loglevel INFO
gevent again and
app.gcelery is for monkey patching. The code for this is below.
from gevent import monkey monkey.patch_all(httplib=False) import app.celery from app.celery import app
It worked smoothly until we started receiving more than usual traffic. All the celery worker pods would just die after raising this error!
Couldn't ack 1682, reason:BrokenPipeError(32, 'Broken pipe')
If we peruse through the traceback, we will see this:
packages/gevent/_socketcommon.py", line 722, in send return self._sock.send(data, flags) BrokenPipeError: [Errno 32] Broken pipe
With this configuration (+ prefetch_count=100) each worker could prefetch 500*100=50000 tasks.
This means all the tasks would just go to one or two workers at most, and then those workers wouldn’t just be able to process all those tasks, they would die and re-delivery would happen.
The same scenario repeated for every worker until every worker was dead, and tasks started to pile up in the Queue, which caused rabbitmq connections to get blocked. This would lead to the Django server getting stuck often.
So the goal became to evenly distribute tasks b/w celery pods so that only one worker would not prefetch all the tasks.
To tackle this, we reduced the prefetch count to 3 and concurrency to 300, so each worker in a pod could now prefetch 900 tasks.
So in total, if we launched 5 pods, we could prefetch 900*5= 4500 tasks from rabbitmq.
But, while even distribution worked as expected, the celery pods were still dying with the same error- BrokenPipe!
To solve this error, we ditched gevent for celery and moved to native threads.
This is the configuration we used:
celery -A app worker -Q queue_name -Ofair -P threads -c 100 --without-gossip --without-heartbeat --without-mingle --loglevel INFO
Each celery worker then launched with a thread pool of 100 to process all the tasks, with a prefetch count of 4, and as a result each worker could prefetch 100*4=400 tasks from rabbitmq.
We launched 10 pods for celery workers. So prefetch then became 10*400=4000 tasks.
With these many tasks in one queue and all of them making database queries, we also needed to be careful that we don’t run out of database connections, which is easily possible with the default way of obtaining a DB connection in
django + celery.
To solve this problem, we introduced pgbouncer for celery workers only, because all these tasks were taking connections but not using them all the time, and the connections were just sitting idle most of the time.
So now Django would run with cherryPy and celery with native threads pool +
pgbouncer (Final configuration)
It was time to stress test, and the results were:
Our goal was to optimize our system and processes to manage a 10x scale. Trying these experiments and constantly iterating got us there!
Currently, we receive a max of 60 requests per second, while our server can handle 1200. We also currently receive 20 messages per second in a single queue while it has a capacity to process 600.
While we have optimized the system, very soon, we’ll have to build on it and work on scaling it again.
We’ll need to answer two key things-
The only thing we know of as of now is that the service endpoint name resolution failures across Kubernetes services under a large amount of load.
We still are looking into what could cause this. One of the reasons could be because “AWS EC2 instances have a hard limit on the number of packets being transferred in a second which is 1200” but we are not sure.
That’s the next step – figuring it out and starting such experiments again. Stay tuned for how and what we learn in Part 2 of this coming soon!
Author: Shubam Kumar, SDE-2 LimeChat