26 Aug 2012

The Launch Day Blues

When was the last time you saw an MMO launch that went off without a hitch? From login issues preventing access, to major functions being disabled, it seems that almost every modern MMO has a rocky start in life. Even complex online games have been exposed to similar problems, from days of inaccessibility to weeks of poor performance.

But why does it happen? Why do experienced companies who are used to building and maintaining these types of systems continually run into these kinds of problems? Why are they, in many cases, unable to thoroughly test these kinds of systems. And what can be done to make sure that gamers still have a good experience, even if some features are unavailable?

I’m going to try to tackle a handful of these issues, explaining what companies can do, but why they also don’t help cater for every example. And while I have experience of designing massively multi-user systems myself, the usage patterns relating to MMO gaming are somewhat different to an online shop or customer service portal. Don’t start lambasting developers for what I write here, ‘kay?

The ‘Stress’ Test

MMO games are in a bit of a bind at the moment. Many of us play for the story or questing experience, which means that if you provide long beta sessions, players can experience the best parts of your game without paying a penny towards it. This isn’t great, as it can eat into your release-day sales. On top of that, some actively avoid betas so that their launch-day experience isn’t ruined by knowing all the spoilers.

Which is where beta weekend events and stress tests have come in.

These are great in theory: you can look at player behaviour, analyse how they interact with your underlying systems, and come up with reasonable predictions on how heavily these systems will be used on launch day. Which is great, for a given normalised scenario. If something doesn’t work, players tend to shrug, maybe file a report, and move on. Because hey, it’s beta.

The problem is, these tests don’t help with judging system performance under a live load. You have a small subset of users who actively test systems, who are like gold dust to you (especially if they file reports). And you might funnel them through a limited set of hardware in order to stress performance more, such as having a limited number of worlds up and running. But they don’t help with testing the limits of shared and centralised services such as A&R (access and registration, more commonly known as signing up and logging in), or item shops.

If you want to heavily test a service like A&R, it’s actually pretty difficult to do. The database cluster might be rated for so many thousands of operations per second, while the application cluster above might be rated in a similar way. Then there’s proxies and firewalls on top of that, as well as intrusion and DDOS detection and management. On top of that, MMO games are geographically distributed platforms. Your A&R service might need to support queries from a number of game servers located in data centres around the world.

The easiest battlecry is to say “Simulate everything!”. Build test harnesses that bombard your A&R service with millions of requests for account creation, payment and login. But:

  • How do you generate the test data? Random?
  • How do you perform random queries across the dataset?
  • Can you build a test system that can test – and validate – millions of simulated, geographically disparate,  user transactions, without spending a ridiculous amount of money on it?

This third point’s the killer. Anyone can generate masses of data, but it’s being able to shunt queries in at such a high speed that they actually stress the gargantuan service you’re trying to overload. That’s hard.

The Button Masher Senario

So I have my A&R service, and I flick the switch and allow gamers to start logging in. Word gets around and traffic starts quickly building up. But what happens when the number of requests coming in becomes greater than my service can handle?

  • Player presses the login button
  • Request gets sent to A&R platform
  • Request times out
  • Player presses the login button
  • and so on…

It’s likely that the application layer of the A&R service has solid buffer management that ditches stale requests that are older than the client timeout, but it’s the effect of unresponsive systems generating further requests into them that I wanted to demonstrate. From both a simulation standpoint and an engineering one, this is trickier still.

Setting the Engineering Bar

Here’s a question for you: do I design a system that can handle three times the anticipated normal traffic load, or do I design it for three times the predicted peak load?

As a supplemental, I will only ever see that peak load at launch. After that, I expect load to fall within the normal X3 threshold for the rest of the life of the game. I know that MMOs typically fall back to 40% to 40% of their subscriber base two months after launch, so I’ll have even more capacity.

Oh, and building the system to handle three times predicted peak will cost ten times more.

This is one of those difficult tradeoffs. On the one hand, I want to deliver the best possible experience to gamers. But on the other, that’s a truck ton of money that I’d like to spend on creating new content instead. This is the producer or project manager’s decision, and it’s not an easy one to make. Developers do not have a bottomless bucket of money to throw at a problem, especially after spending five years creating something.

The Solution?

There’s no perfect playout here. You’re either going to spend a heap of cash on something that’s going to benefit you a couple of months at most, or you’re going to have tight running systems and a mass of fallback plans. In 99% of cases, the second option is the one that gets picked. And that fallback plan is everything from leasing temporary additional capacity, through to tweaking system and queue parameters, and into community management and communication strategies.

From a design standpoint though, you decouple as much as humanly possible. A broken item shop shouldn’t bring down your game. Certain in-game services might be unavailable to ensure that your databases are prioritising certain key requests (such as logging in). Ensure that your core systems are up and running, and everything else can be toggled off until things stabilise. Other than that, keep talking to your players and let them know what’s going on.

And finally, for those asking why MMOs became this complex: it’s because we ask for them to be like this. We want flexible services that let us play with our friends in an easy and frictionless manner. We want to be treated as part of a global community. We want barriers to be broken down. What’s happening is a step in the right direction, but these aren’t easy things to deliver.

Like this? Try these other related posts:

  • No Related Posts

2 Responses to The Launch Day Blues