Get a 360 view on game health by combining crash reporting, performance analytics, build quality and binary distribution into one toolbox.
A Distributed Denial of Service (DDoS) attack is an attempt to disrupt services for all your customers and take you offline. We recently experienced our first DDoS attack against one of our customers which created the need for a swift response. Our customer’s production environment was taken offline for a period of time. This blog post will walk through what happened and how AccelByte responded to resolve the issue.
On May 4, 2022, at around 8:30 AM Pacific time we started getting alerts about inaccessible health checks and status pages for the services for this game in our main incident reporting slack channel.
After some investigation, we determined the cause of it to be a massive amount of incoming traffic reaching levels of almost 4M req/sec, the normal traffic volume for this game was closer to 10 req/sec.
With numerous system components reporting failures, and all services either in a state of scaling or scaled to the max limits and services failing to recover, we knew that we were being attacked.
With many of our services, we put in place limits to prevent the system from scaling above certain levels to protect our datastore components. All our current customers’ deployments use dedicated components exclusively for that customer. These backends are sized based on historical and projected agreed-upon needs. This tends to be a little inflexible and more expensive but allows that customer to have complete confidence in knowing we keep their data separate and secure from our other customers. The downside of this is we often don’t have larger datastore servers deployed that may be able to handle or cushion the shock of a small DDoS. Many of these components cannot be scaled without causing some brief user impact, so we size the environment based on both historical demand and based on agreed-upon capacity needs, to ensure our customers are only paying for what they need and not idle capacity.
The AccelByte live operations team quickly stepped in to recover the system from the attack. The following outlines the steps we took to remedy the situation and get everything running smoothly again.
This unfortunate attack resulted in a few lessons learned.
First, we should always have a WAF configuration deployed and ready to go in a state where it can be turned on within seconds or minutes, regardless of whether we need it.
We also learned that we had made some incorrect assumptions about our Istio configurations and we will be taking steps to validate and adjust these in our environments as this caused a few different memory-related crash issues.
We need to review all our system configurations around limits to make sure they are correctly captured and documented and applied. We are in the process of slowly rebuilding all our environments using a fully IAC and gitops approach that uses flux and terraform. This is a fairly lengthy process to convert all our environments and will take someone time to complete, but hopefully, we should have all our existing environments done. It will also allow us to make changes from a central location and know that all environments will get that config change. All our new client environment builds are already using this methodology. This has allowed us to accelerate the speed at which we can deploy environments from days/weeks to hours and provides one place to adjust configurations.
We also learned that we need to have something in place to make sure pod sizing is more automated, no two customers are the same, and we’ve found the same service deployed in two different customers may not always behave the same. So using technology like the Virtual Pod Autoscaling service(VPA) will ensure that pod sizing is based upon actual real environment behavior rather than based on some predetermined limit values.
Finally, our load test team will be spending more time ensuring that testing is looking into ensuring HPA configurations always trigger horizontal scaling before the service goes into an OOM or Crash loop.
Interested in learning about our services? Request a demo here.