Learning from a web app scalability nightmare
A little over a year ago, I co-led an effort to rebuild a web app from the ground up. It wasn’t our original intent to rebuild it, but after diagnosing and attempting multiple improvements, we determined a rewrite was the best option. Here’s what we learned.
The problem
The web app we were working on was an outreach and engagement tool. Our company used it to acquire new users, while engaging with existing users. It hosted weekly, hour-long events where users could come, claim a reward, and share on social media. The more people who claimed at one of these events, the bigger the reward pot would get.
Each week we were watching the number of claims soar higher than the week before. A few engineers and some marketers would get on a call each week while the event ran to hang out and to monitor the event. We’d place bets each week on how many claims would come through, and we were optimistic it would be bigger than the week before. All the while our company social media following was growing and customers were tagging us in happy posts.
But before long, we started to notice something interesting: the posts we were being tagged in were going from positive, energetic, enthusiastic, to frustrated, annoyed, and downright angry. We were seeing more and more posts like “couldn’t log in this week either 🙄”, or “is the page loading for anyone else?”, or one of my favorites: “SCAM”. Usually there were other words mixed in there that don’t bear repeating.
We’d also see screenshots on Twitter with some of the problems users were facing. A common one was a blank white page with 504 - Gateway Timeout
written neatly across the top. Another was our web app error page. We couldn’t find evidence of the 504s in our own logs or traces, suggesting something was blocked downstream of our app, but we did see lots of evidence for the error pages, with a common culprit being database connection issues or timeouts.
In all, we were encountering a fantastic problem: usage had scaled to the point where our existing architecture couldn’t support it. It was time to make some changes.
Starting small before starting over
Let me start by saying this: we didn’t jump straight into a rebuild. Every engineer loves the idea of starting over; “oh, if only we had made XYZ decision at the start, everything would be so much better for us now”. And honestly, that’s true in many cases. Architecting well at the outset can make your life dramatically better in two, or five, or ten years if that code or that system is still in use.
But we need to be really careful with this thinking. Will this herculean effort actually give us a return on our investment? Or will the opportunity cost on all that engineering time outweigh the value? Let me be clear: starting from scratch takes longer than many of us expect, especially given the many “hidden” dependencies and dependents at play.
Consider, for example, all of the eventing and reporting most systems do for business intelligence. There may be dozens of graphs and dashboards that will break if you build from scratch, and many stakeholders who won’t be pleased when they go to check their reports. Trying to keep continuity between the historical data and the data you’re going to start generating is also a pain. You end up tied to the same event names you’ve always used, or you have to manually stitch data together, or you just accept that there will be two disjoint data sets. None of those solutions are ideal, but it’s likely something you’ll have to deal with if you start from scratch, and it will take more time than you expect.
So what did we do first? We started by figuring out exactly what the problems were.
504 galore
I mentioned users were getting 504s. After some digging, we hypothesized the reason we didn’t see logs or traces for these 504s was because they were being returned before a request could make it to our application. A little more digging revealed that our servers were reaching more than 100% CPU and about 100% memory usage at the start of our events. Most likely, due to the resource constraint, the servers couldn’t accept the request before the downstream gateway request would time out. The downstream server would handle that by returning a 504 to our users.
We thought to ourselves, well that’s an easy issue to solve: just give it more CPU and memory! If we gave the app enough resources to handle the max load, there wouldn’t be any dropped requests.
We quickly realized this wasn’t going to work though. For this particular application, the max load was orders of magnitude higher than the average load. If we provisioned for max load, we would be using those servers (and the money we spent on them) extremely inefficiently. Further, this application was just one acquisition channel in a portfolio of acquisition channels, and the business was always tweaking the portfolio to improve their return on investment. If the cost of this app increased but the value of the users being acquired didn’t change, then ROI would decrease. If the ROI decreased below the ROI of other channels, there’d be no point in continuing to use this app. We might as well spend the same money on ads or other acquisition methods! So provisioning for max load wasn’t going to cut it.
Since provisioning for max load wasn’t an option due to our cost constraints, we decided to use autoscaling to only run at full capacity when we needed it. We could scale down when an event wasn’t going on to save money. An autoscaling policy was already in place, but we decided to tune it to scale up more aggressively. For anyone not familiar, autoscaling works by looking at metrics like CPU and memory, and when it sees those metrics cross a certain threshold, it spins up more server instances to bring the average resource consumption back down below the threshold. (There are other policies out there, but this threshold-based approach is one of the most common). We set it to scale at 20% CPU consumption rather than 75% and figured all would be well.
It wasn’t quite that simple though. This application wasn’t experiencing heavy loads constantly, nor was traffic ramping up gradually. Rather, the load on our servers was spiking tremendously right at the start of the event. This was due to our traffic pattern: many users would arrive promptly at the start from messaging or posts on social media. It was this spike that was killing us.
Spikes are a problem because most autoscaling policies don’t respond quick enough to them. For instance, the metrics our autoscaling policies were looking at, CPU and memory, report in one-minute time intervals. To scale up, those metrics must be elevated above the threshold for a few consecutive intervals. Then, even after starting a few more servers, it could take a few minutes for the application to start and to be ready to receive traffic. At minimum there was about a 5 minute delay before we had the additional servers we needed. In those 5 minutes, another 100k+ requests could have come through, swamping our already swamped servers. It seemed that autoscaling wasn’t an option for us either.
Since we couldn’t blindly add more resources, and autoscaling wasn’t going to play nicely with our spiky traffic pattern, we were left with a few options:
- smooth out the traffic spike so autoscaling would have time to kick in
- somehow handle far more requests with the same hardware
- find a way to reduce the number of requests
Or we could do some combination of the above. Spending some time looking at our other major problem helped us decide which approach to take.
Database connection issues
The other major problem with the app was its overuse of the database. This application was built during a two-day hackathon in Ruby on Rails, and it showed. When users could successfully reach the app, they were still greeted with slow-loading screens and error pages as the app would time out waiting for the database to respond. Perhaps unsurprisingly, our database CPU and memory usage was reaching 100% as well.
Discovering this info helped us refine our vision for how to handle our 504s. One thing it told us was that horizontally scaling our servers wasn’t going to help if our database instance was a bottleneck. Adding more servers would just make the issue worse by trying to throw more queries at an already overloaded database. We considered using a larger database instance with more CPU and memory, but we couldn’t due to the same cost constraints discussed earlier.
To try to work around our cost constraints, we did experiment with a “serverless” database. It could change CPU and memory allocations on the fly, so we tried to have it decrease CPU and memory allocation when an event ended, and then scale up again when an event was starting. This allowed us to decrease costs during off hours, but then spend more during an event for a larger database instance.
This was a brittle solution though, as it required timing the database scaling around when an event was happening. If the scaling failed to kick off for any reason—and this did happen to us once—then the event would be a total disaster. Users would be arriving to absolute gridlock until we got paged and someone went in to manually increase the resource allocations. That generated some really nasty tweets for us, and we decided to can the whole “serverless” database idea shortly after that.
After that debacle, it was clear we had to stop relying on hacks. We needed to actually find ways to reduce the number of database queries we were making, speed up the ones we had to make, and remove synchronous dependencies wherever possible. In short, we had to intelligently rearchitect our data access.
A quick note about developer experience
One other consideration was that since this app was getting a lot of investment from the company, we knew we’d be working in it quite a bit over the next few quarters. We wanted to make sure any changes we made improved the development experience, making it faster and safer to crank out iterations on the application. Specifically, we wanted to be able to deploy on demand while keeping our number of escaped defects low. There are few things more annoying than doing manual deploys, or having to hotfix bugs. If we could take those two things off our plate we knew our quality of life and our capacity to deliver on stories was going to be much higher.
To deploy on demand, I knew I wanted to use trunk-based dev, have excellent test coverage, and setup a single CI/CD pipeline. However, I also knew that with the poor Ruby on Rails codebase and the overall lack of tests, trying to do continuous deployment was going to be a challenge. Ruby (and other dynamically-typed languages) make it hard to catch bugs early in the software development lifecycle. Errors happen at runtime, so you’re forced to write excellent unit tests to catch things we normally wouldn’t even consider business logic. For instance, you have to test and make sure you don’t have typos when you reference your instance variables. You have to verify that variables you’re interpolating into a string aren’t nil
, since for whatever reason nil
cannot be cast into a String
. You have to make sure the arguments to a function actually implement the methods you’re calling on them, since there’s no mechanism to enforce the types of your arguments. (Also, yes, these are all real production issues we’ve had). Relying on fallible humans to write tests to catch all those errors is a recipe for disaster. I wanted to automate problems away wherever possible. Overall, we wanted to make sure we weren’t just building a performant and cost-efficient application; we also wanted to be building a safe, easy-to-test, readable, and maintainable application.
Considered options
To recap, the main problem with the application was that the infrastructure we were using to serve it was not sufficient for the load created during the initial traffic spike. The servers didn’t have enough CPU and memory, and couldn’t autoscale in time. Further, even if they could scale quickly enough, we were seeing that our database was also running out of CPU and memory. There was a limit to how many servers we could add because of the database bottleneck. Since we couldn’t increase infrastructure costs due to the business constraints on us, we had to either reduce the number of requests and database queries, handle the requests and queries more efficiently, or find a way to smooth out the traffic spike to reduce our max load.
The first option we considered was simply doing longer events and spacing out our invites. This was easy enough to implement: we switched from a one-hour event to a three-hour event and decreased the number of invites we sent per minute. While this helped a little bit, it wasn’t an ideal solution because users were still seeing the invite to the event on social media, and also had future events saved in their calendars. Despite spacing out invite emails and messages, we were still seeing a spike at the start of the event.
The next option we considered was separating the frontend portion of the application from the backend. While this wouldn’t alleviate our database bottleneck, it would allow us to horizontally scale the frontend and backend independently. In the fullstack, Ruby on Rails version of the application, requests were coming to the servers to fetch data and interact with the event, but requests were also coming in to get HTML, CSS, and JS files (as well as fonts, images, etc etc). This was the root of our 504 problem: requests for the frontend materials would fail with a 504 because of the load, so our users’ browsers would simply show them a 504 page. By separating the frontend from the backend, we were confident we could at least give the user the HTML/CSS/JS they needed to get something to display in their browser. This way, even if a subsequent call to the backend failed we could show them a gracefully degraded version of the page. That alone would be a huge improvement over showing them a blank white 504 page! We also suspected that removing all the frontend calls from our backend would reduce the load on the backend considerably and allow it to work better.
Our last considerations were around data consistency. We realized something important about the data we were serving: for the most part, data fetched during an event was not changing. Users would fetch event info like the name, start time, and end time, all of which remained constant for any given event. The only data that was changing was the reward amount, since it would increase as more users came to the event. Because of this, we realized we could statically build the page with most of the event info in it, skip the database calls, and then only hit the database during the event to claim a reward for a user and to show the new reward pot amount.
Further, we noted that since the page was polling every three seconds for the reward amount, we could cache that amount for up to three seconds without impacting the user experience. That would save us from having to make a database query for every user on the page. Instead, we could query once every couple seconds, populate the cache, then serve data to users from the cache. Given the size of our event, this change alone could reduce our number of queries by five orders of magnitude, from 100k+ requests every three seconds down to 1 request every three seconds.
Lastly, because of the polling mechanism, we also realized we could tolerate some eventual consistency when users claimed rewards. Rather than writing synchronously to the database and scaling our database load in tandem with our number of concurrent users, we realized we could queue up database writes and throttle them. Not only would queueing smooth out the load on our database, it also would give us the ability to retry writes if they failed.
The final design
In the end we decided to rewrite the application. Our small changes weren’t doing enough to smooth out our traffic pattern, nor could we simply scale our servers to infinity and beyond. We had come up against our resource constraints, and because of the business use case for this service, we came up against cost constraints too. We simply had to be smarter about how we were handling the load we were getting, including caching better and removing synchronous data dependencies where possible.
To start, we decided to separate the frontend and backend. Rather than a fullstack Rails app, we wanted to have a standalone backend and a static frontend fronted by a CDN. Our standalone backend would be receiving far-fewer requests by taking the frontend load off of it, plus with our realization that most of our event data wasn’t changing, we didn’t need those requests either. The event data could be integrated right into the page at build time and then cached in our CDN. Further, using a CDN sped up delivery of assets to our users, and it was also fully managed, so we didn’t need to worry about autoscaling our own servers. Due to the generous pricing model, this was also a cheaper way to handle the requests than running a fleet of servers.
We set our backend up to trigger new frontend builds any time event data changed. This allowed us to statically build new pages for new events without a frontend code change. It also saved our frontend from needing to fetch that data during run time. It was simply loaded into the CDN and ready to go.
On the backend, we also put our realizations about data consistency to use by breaking database writes out into an asynchronous task that could run on its own server during an event. One service in the backend cluster would serve the REST API our frontend was hitting, and it would fetch data from cache or put any writes onto a queue. The background worker node would then pull tasks off the queue and write to the database at its own leisure. This allowed us to throttle the rate that we were writing at, so as to keep database CPU and memory usage in check.
Another detail we didn’t get into, but probably could devote a whole article to, is the dramatic underperformance of Ruby on Rails. It is bad in both a CPU-bound and an IO-bound context. It’s interpreted, so it eats up more memory and CPU cycles than a compiled language, plus its older concurrency model requires direct interaction with the OS to swap threads. This is orders of magnitude more costly than modern coroutines or task-scheduling. Plus, given all the developer experience points I mentioned above, our whole team agreed that we would not be using Ruby going forward.
Instead of Ruby, we decided to use Rust on our backend. We had recently prototyped Rust with a simpler service and found it to be fast, reliable, and also quite ergonomic! Despite its learning curve, Rust is actually pretty ideal from a developer experience perspective. It eliminates all language-level bugs, so you can rely on your compiler to catch a variety of failure modes you would’ve had to write unit tests for in Ruby. Plus, with its safe memory model, you get thread safety for free (useful in a highly-concurrent distributed environment), and you prevent a number of common security flaws that you might get in C or C++. And did I mention that it’s extremely fast??? In practice that means we’re using fewer CPU cycles, which means we’re using less CPU. We can do more with the same CPU budget (or reduce our budget and save some money). It also has an extremely small memory footprint, which translates into more cost savings.
Overall, the rewrite took about 2 months. We projected 1 month originally, but due to some blockers with dependencies on other internal systems our timeline had to be extended. We saw that we were transferring our load spike (albeit a smaller spike) onto these other systems. Our app ended up giving our auth system a pretty good load test for example! We had to spend a sprint with them sorting that all out. We also ran into issues like the one I mentioned at the start, where reporting and tracing had to be rebuilt from the ground up. We accepted that it wouldn’t match the older data—which was acceptable in our case—but that’s a call that will differ from project to project.
Hindsight is (sometimes) 20/20, and for this project I can confidently say that the rewrite was worth it. We have only had three production issues in the last year and a half, one of which was invisible to the user and was corrected asynchronously the next day. We have been able to handle our load without any problems, and in fact have had issues with our company routing layer and auth systems more than we’ve had issues with our own application. Separating the frontend and backend has allowed us to iterate on them independently, and has seen the frontend undergo a total redesign, without needing to modify the backend. Both the frontend and backend can be deployed on demand as well, allowing us to work quickly and painlessly. In all, the codebase has become a real joy to work on. While a full rewrite should not be our first instinct as engineers, I am hopeful these learnings allow us to build well from the outset and avoid that feeling altogether.