AccelByte Blog: Insights on Game Development & Backend

Scaling Matchmaking to One Million Players

Written by Steve Cawood | Nov 12, 2024 8:51:37 PM

Introduction

The definition of launch preparedness has evolved for AccelByte customers. Over the last few years, more games have reached one million concurrent users (CCU) than ever before. Because of this, scaling up to a large player load is now a requirement for many of our customer’s projects. In this blog, we’ll describe our process of testing AccelByte Gaming Services (AGS) to run a game smoothly with one million CCU.

Test Goals

Before running this test, one question we pondered internally was, “What should be our target CCU?” We chose one million as a nice round number, but ultimately, our goal was to identify chokepoints in our systems and ensure that AGS can efficiently scale horizontally. Another aspect that we had to address was the validity of the load testing result. Since video games have dramatically different profiles for their backends, one million CCU means something different for each customer–and each of their games. Since it isn’t possible to discuss all configurations in this article, we picked a real-world matchmaking scenario and flow that some of our customer’s games are already using.

Service Architecture

Three services are involved in our AGS one million CCU matchmaking load test: Matchmaking, Session, and Lobby. The Matchmaking service collects match tickets from players and parties and puts them in queues. The service then spawns worker processes to group tickets together into matches based on the matchmaking ruleset. The Session service stores and allows updates to various player sessions, as well as facilitates party formation (players matchmaking as a group) and party operations. Finally, the Lobby service provides a constant connection between game clients and the backend.

Fig. 1. AGS high-level architecture for one million CCU load test

Matchmaking Load Testing Scenario

To verify the accuracy of the load test, we conducted an extensive simulation of a matchmaking scenario that encompassed various elements, including parties, skill-based matching, and assigning players to teams. For this test, we connected virtual users to the Lobby service for roughly an hour until there were one million connected players. To ensure we were also testing stability, we maintained the load of one million CCU for a sustained period of one hour.

In this test, simulated users formed a variety of party sizes before submitting a matchmaking ticket request to the Matchmaking service. When a match session was formed, the simulated users entered the “playing” state for a random time between eight and 13 minutes. This timeframe was chosen based on a typical session length in first-person shooter (FPS) games. It’s key to highlight that every simulated user in this test was in a joinable party, either as a party leader or as a party member. We used this configuration to explore the boundaries of our system.

After each matchmaking session ended, the simulated users either disbanded their current party and formed new parties or kept the existing users in the party. This was done randomly for each party–with a 50% chance of the party being disbanded. The simulated client then submitted a new matchmaking ticket request to replicate searching for a new game session.

The matchmaking ruleset used for this load testing was for a 5v5 game and used a single match pool. A single pool is the worst-case scenario for the Matchmaking service. Most games will (and should) have multiple match pools, which makes handling the load easier. Furthermore, the matchmaking rules were set up to match players based on Matchmaking Rating (MMR) while maintaining low latency. We also flexed both the MMR and latency rules over time to ensure reasonable wait time. Each ticket contained an MMR value for each player in the party and latency array values (for example, “[us-east-1:29,eu-central-1:72,ap-southeast-1:125]”). The MMR values were assigned randomly to players but kept to a normal distribution (bell curve) similar to what you would observe in a large population. 

Fig. 2. MMR distribution of the simulated player population

Based on this configuration, the Matchmaking service initially attempted to group players in up to 10-player match sessions (two teams of 5) that had a maximum skill spread of 40 and where no player should have a latency higher than 50ms to the datacenter. If 10 players were not immediately available (which is rare at these population levels) the service would initially allow for games as small as 3v3.

Various flexing rules were also configured, to kick in over time. Flexing on MMR started at the spread of 40 and went to a spread of up to 100 MMR points after 60 seconds of waiting. This allowed the few players with very high or very low MMRs to be guaranteed a match. Flexing on latency increased linearly with an increase of 50ms allowable latency for every 5 seconds of waiting. In this test, auto backfill was enabled, so matchmaking automatically created a backfill ticket to fill incomplete match sessions with new tickets.

For more details about the configuration, here’s the full AGS matchmaking rule:

{

    "name": "match-load-test",

    "enable_custom_match_function": false,

    "data": {

      "auto_backfill": true,

      "alliance": {

        "min_number": 2,

        "max_number": 2,

        "player_min_number": 3,

        "player_max_number": 5

      },

      "alliance_flexing_rule": [

        {

          "min_number": 2,

          "max_number": 2,

          "player_min_number": 2,

          "player_max_number": 5,

          "duration": 240

        }

      ],

      "matching_rule": [

        {

          "attribute": "mmr",

          "criteria": "distance",

          "reference": 40

        }

      ],

      "flexing_rule": [

        {

          "duration": 20,

          "attribute": "mmr",

          "criteria": "distance",

          "reference": 60

        },

        {

          "duration": 30,

          "attribute": "mmr",

          "criteria": "distance",

          "reference": 80

        },

        {

          "duration": 60,

          "attribute": "mmr",

          "criteria": "distance",

          "reference": 100

        }

      ],

      "match_options": {

        "options": [

          {

            "name": "cross_platform",

            "type": "disable"

          },

          {

            "name": "DS_READY",

            "type": "disable"

          }

        ]

      },

      "region_expansion_rate_ms": 5000,

      "region_expansion_range_ms": 50,

      "region_latency_initial_range_ms": 50,

      "region_latency_max_ms": 250

    }

 

Fig. 3. Simulated user actions

Load Testing Results

The high-level summary of our load testing results reveals that the game's Party service, Session service, Lobby service, and matchmaking system successfully handled the load of one million players.

Throughout the load test, the connection curve to the Lobby service exhibited a smooth progression indicative of stable performance even as user numbers increased. At the peak of the test, we observed nearly 100,000 concurrent active match sessions and over one million CCU.

Overall, the load testing indicates robust performance and reliability of the game's infrastructure. AGS was proven capable of handling a substantial number of concurrent users with high efficiency and minimal latency. This establishes a strong foundation for scaling our services to accommodate growing user demand while also maintaining a quality gaming experience.

Concurrent Users

We saw a smooth curve of connected players to the Lobby service throughout the load test.

Fig. 4. Total users online

At peak, there were close to 100,000 concurrent active match sessions. Each active match session can contain up to 10 players, so the number of active users was naturally much higher than the concurrent match sessions.

Fig. 5. Rolling count of match sessions created over the trailing 15 minutes.
Session length is 8-13 minutes so this represents a bit under 100k concurrent sessions.

Matchmaking Speed

In terms of matchmaking speed, the 99th percentile (p99) for successful matchmaking time remained under 35 seconds for simulated test users. This demonstrated a timely matchmaking process. The percentiles of time to find a match are shown in the diagram below.

Fig. 6.  p5, p90, p95 and p99 of successful matchmaking time.

Matchmaking Quality

The matchmaking pool was configured to create matches based on MMR while keeping latency low, so we’ll use MMR and latency to judge the quality of the matchmaking. 

From an MMR standpoint, the primary metric we use to judge quality is the MMR delta within each game session (i.e., the difference in MMR between the highest and lowest-skilled players in the game). Our observations in this run indicate an average spread of about 50 MMR points, which is higher than we’d like it to be. However, we believe this is due not to a failure of matchmaking but to the fact that we created parties that had large internal MMR deltas. We have plans to address this in future test runs. 

Fig. 7.  Heatmap showing the delta of MMR within matched game sessions. Darker colors represent a higher occurrence of that MMR delta.

From a latency standpoint, our primary metric examines what percentage of parties were matched into a game that would be played in a region that was not the party’s best region. For example, an Asia-Pacific party being matched into a game hosted in North America would be bad. The test produced excellent results here, likely due to the large population of players in all regions. 

Ticket’s lowest-latency (best) region Count of tickets matched to their best region Count of tickets matched into another region Percentage of tickets matched into another region (%)
us-east-2 2933704 673 0.02
eu-central-1 1444781 774 0.05
ap-southeast-1 423875 2904 0.6

Chart. 1. Ticket region vs. game region results

Parties

Throughout the test, we saw healthy throughput and latency of the Party service operations. All Party operations had a latency below one second except for an increase during initial ramp-up.

Fig. 8. Duration of Party API operations

Additionally, all Party operations were executed swiftly–within one second for the 90th percentile of requests. 

Fig. 9.  Party API operations rate

Success Rate for Operations

The success rates for various operational commands were also exceptionally high. Most were nearly perfect (at > 99.9%), except the Leave Session operation, which had a slightly lower success rate of 98.73%. However, this minor drop does not impact players due to the implementation of automatic retries in our software development kit (SDK). These retries handle conflict errors during simultaneous session departures.

In addition, the p99 latency for all operations in the Matchmaking and Session services was under one second outside of the initial ramp period, where the p99 was under 2 seconds. These success rates and low latency values allow players to experience a smooth gameplay experience.

Operation Success Rate (%)
Create ticket 99.89
Match found 99.92
Create party 100
Join party 99.99
Get my party 100
Get my party by ID 100
Leave party 100
Update party 100
Get session by ID 100
Leave session 98.73
Get my user attributes 99.99
Update user attributes 100

Chart. 2.  Success rates for various operations

Conclusion

Today’s video games require both innovation and reliability to deal with the challenge of scaling to one million CCU. This modern problem requires a modern solution involving advanced technology and strategic planning to support dynamic player growth. We’ve been working for years to develop AccelByte Gaming Services (AGS), so that game studios can focus on their gameplay and not worry about scaling their backend infrastructure.

In this load test, we demonstrated the scalability of our end-to-end matchmaking system. During the test, we had a sustained rate of one million connected players. AGS successfully processed matchmaking tickets for parties and formed match sessions throughout. These results demonstrate that we can meet our load-testing goal of one million CCU. The level of player load allows us to support even the most successful modern game launches.

It’s worth noting that applying the right sizing of databases and other service configurations are as critical as the load-testing activities themselves. For these tests to be successful, we needed to adjust the database size and configuration. Having done this successfully in this one million CCU load test has helped us establish the right settings to apply for any upcoming launches. We also learned that having the right observability metrics and graphs is also key in validating the results of our tests.

If you have any questions or feedback about our matchmaking load test, we’d love to hear from you. Send us a message at hello@accelbyte.io.