Skip to content

Responding to a DDoS Attack on Game Multiplayer Services

A Distributed Denial of Service (DDoS) attack is an attempt to disrupt services for all your customers and take you offline. We recently experienced our first DDoS attack against one of our customers which created the need for a swift response. Our customer's production environment was taken offline for a period of time. This blog post will walk through what happened and how AccelByte responded to resolve the issue.

Summarizing the DDoS Attack

On May 4, 2022, at around 8:30 AM Pacific time we started getting alerts about inaccessible health checks and status pages for the services for this game in our main incident reporting slack channel.

After some investigation, we determined the cause of it to be a massive amount of incoming traffic reaching levels of almost 4M req/sec, the normal traffic volume for this game was closer to 10 req/sec.

With numerous system components reporting failures, and all services either in a state of scaling or scaled to the max limits and services failing to recover, we knew that we were being attacked.

With many of our services, we put in place limits to prevent the system from scaling above certain levels to protect our datastore components. All our current customers' deployments use dedicated components exclusively for that customer. These backends are sized based on historical and projected agreed-upon needs. This tends to be a little inflexible and more expensive but allows that customer to have complete confidence in knowing we keep their data separate and secure from our other customers. The downside of this is we often don't have larger datastore servers deployed that may be able to handle or cushion the shock of a small DDoS. Many of these components cannot be scaled without causing some brief user impact, so we size the environment based on both historical demand and based on agreed-upon capacity needs, to ensure our customers are only paying for what they need and not idle capacity.

DDoS Attack Recovery Effort

The AccelByte live operations team quickly stepped in to recover the system from the attack. The following outlines the steps we took to remedy the situation and get everything running smoothly again.

  1. Manually scale the Istio ingress controller pods, this triggered the system to scale up and add more instances to get added as Kubernetes worker nodes to handle the increased load.
  2. Manually scale the website service pods in an attempt to see if they could handle the traffic. This didn't help as the pods continued to crash due to some memory limits.
  3. At this point, we initiated a rapid deployment of the AWS WAF, with a basic ruleset
  4. With the WAF in place, we were able to block all traffic
  5. We then opened up traffic to some internal ranges and a few select geographic areas and we could see gameplay starting to recover.
  6. We enabled some additional areas including where our operations team was located, it then caused the system to go unstable for a short period, we had to pause for a short period and reverse course.
  7. We changed directions and instead focused on turning on Europe and South America.
  8. In parallel, we created a new rate-based rule and deployed that to ensure our services remained stable in case our attacker found new hosts to exploit.
  9. With the rate filter in place, we began to remove our geographic filter, it appears our attacker had found something else to do at this point and had stopped sending traffic, so we had thought things were all clear.
  10. We had collected some evidence of the attack and were able to craft a custom rule to block the traffic, but were unable to validate it against real traffic.
  11. One of our engineers noticed that our rate-based rule was blocking real traffic so turned it off as he assumed the attacker gave up.
  12. The following day the attacker came back again, with way more traffic than the first time around, a different engineer from the previous day was initially responding to the issue and was confused as to why the custom rule didn't work, so enabled the rate filter that had been previously disabled, which quickly allowed the system to stabilize.
  13. With the attack ongoing and traffic increasing, the engineer who wrote the custom rule found the mistake in the custom regex and corrected it.
  14. With the untested rule now corrected and validated in the real world, the vast majority of traffic was now blocked.

The Resolution and Lessons Learned

This unfortunate attack resulted in a few lessons learned.

First, we should always have a WAF configuration deployed and ready to go in a state where it can be turned on within seconds or minutes, regardless of whether we need it.

We also learned that we had made some incorrect assumptions about our Istio configurations and we will be taking steps to validate and adjust these in our environments as this caused a few different memory-related crash issues.

We need to review all our system configurations around limits to make sure they are correctly captured and documented and applied. We are in the process of slowly rebuilding all our environments using a fully IAC and gitops approach that uses flux and terraform. This is a fairly lengthy process to convert all our environments and will take someone time to complete, but hopefully, we should have all our existing environments done. It will also allow us to make changes from a central location and know that all environments will get that config change. All our new client environment builds are already using this methodology. This has allowed us to accelerate the speed at which we can deploy environments from days/weeks to hours and provides one place to adjust configurations.

We also learned that we need to have something in place to make sure pod sizing is more automated, no two customers are the same, and we've found the same service deployed in two different customers may not always behave the same. So using technology like the Virtual Pod Autoscaling service(VPA) will ensure that pod sizing is based upon actual real environment behavior rather than based on some predetermined limit values.

Finally, our load test team will be spending more time ensuring that testing is looking into ensuring HPA configurations always trigger horizontal scaling before the service goes into an OOM or Crash loop.

Interested in learning about our services? Request a demo here.

Find a Backend Solution for Your Game!

Reach out to the AccelByte team to learn more.