Despite the lengthy periods of downtime, the Fortnite player count keeps on rising, with a new milestone of over 3.4 million concurrent players active during last Saturday. (Yes, that’s the same weekend during which the game was experiencing problems!)
Fortnite hit a new peak of 3.4 million concurrent players last Sunday… and that didn’t come without issues! This blog post aims to share technical details about the challenges of rapidly scaling a game and its online services far beyond our wildest growth expectations.
Epic Games shared this interesting new stat in a new blog post. In addition to boasting about just how successful the game has become, the developers also provided a “postmortem” of the Saturday and Sunday outages.
The report on what exactly went wrong is incredibly detailed. If you want to read the full thing, you can do so here. For this article, however, I’ll only be providing the highlights and the planned fixes.
The extreme load caused 6 different incidents between Saturday and Sunday, with a mix of partial and total service disruptions to Fortnite.
NEXT STEPS AND UPDATES
Our top focus right now is to ensure service availability. Our next steps are below:
- Identify and resolve the root cause of our DB performance issues. We’ve flown Mongo experts on-site to analyze our DB and usage, as well as provide real-time support during heavy load on weekends.
- Optimize, reduce, and eliminate all unnecessary calls to the backend from the client or servers. Some examples are periodically verifying user entitlements when this is already happening implicitly with each game service call. Registering and unregistering individual players on a game play session when these calls can be done more efficiently in bulk, Deferring XMPP connections to avoid thrashing during login/logout scenarios. Social features recovering quickly from ELB or other connectivity issues. When 3.4 million clients are connected at the same time these inefficiencies add up quickly.
- Optimize how we store the matchmaking session data in our DB. Even without a root cause for the current write queue issue we can improve performance by changing how we store this ephemeral data. We’re prototyping in-memory database solutions that may be more suited to this use case, and looking at how we can restructure our current data in order to make it properly shardable.
- Improve our internal operation excellence focus in our production and development process. This includes building new tools to compare API call patterns between builds, setting up focused weekly reviews of performance, expanding our monitoring and alerting systems, and continually improving our post-mortem processes.
- Improve our alerting and monitoring of known cloud provider limits, and subnet IP utilization.
- Reducing blast radius during incidents. A number of our core services are globally impacting to all players. While we operate game servers all over the world, expanding to additional cloud providers and supporting core services in multiple geographical locations will help reduce player impact when services fail. Expanding our footprint also increases our operational overhead and complexity. If you have experience in running large worldwide multi cloud services and/or infrastructure we would love to hear from you.
- Rearchitecting our core messaging stack. Our stack wasn’t architected to handle this scale and we need to look at larger changes in our architecture to support our growth.
- Digging deeper into our data and DB storage. We hit new and interesting limits as our services grow and our data sets and usage patterns grow larger and larger every day. We’re looking for experienced DBAs to join our team and help us solve some of the scaling bottlenecks we run into as our games grow.
- Scaling our internal infrastructure. When our game services grow in size so do our internal monitoring, metrics, and logging along with other internal needs. As our footprint expands our needs for more advanced deployment, configuration tooling and infrastructure also increases. If you have experience scaling and improving internal systems and are interested in what is going on here at Epic, let’s have a chat.
- Performance at scale. Along with a number of things mentioned, even small performance changes over N nodes collectively make large impacts for our services and player experience. If you have experience with large scale performance tuning and want to come make improvements that directly impact players please reach out to us.
- MCP Re-architecture – Move specific functionality out of MCP to microservices – Event sourcing data models for user data – Actor based modeling of user sessions
Here’s hoping this weekend will go a bit more smoothly!
In other Fortnite news, Epic Games is slowing down the development of Paragon to focus on Fortnite, here’s the latest community stats showcase, and get a look at the future updates coming to the game.
Source: Epic Games