A great anecdote in the 'we can't test this code' discussion -- for an aggressive two-week release cycle of this massively popular game, this level of testing has probably paid for itself many times over.
It definitely has. League of Legends is considered a serious e-sport game. Individual tournaments for League of Legends have $2,000,000+ prize pools. If a new bug decided a round of the tournament, it would undermine LOL as a serious contender for this type of attention. There is a terrific business case for investing the time to make this game as stable as possible.
This has actually happened before at official League of Legends tournaments (that is, a fairly serious bug in a serious match). I believe their policy in the case of a serious bug is to re-do the match, but I am not 100% sure.
It depends. If the bug was caught before game of record, you just restart the game. Afterwards, it must qualify as a game deciding bug. If so, the game gets scrapped and the match is redone.
This actually happened last year at their biggest tournament. A well known bug caused several champions to be disabled midway through the tournament. This happens regularly that champions are disabled due to bugs at international events.
In this case? Probably. But the idea that catching a bug in production is a huge problem is a myth, pushed by research that was later completely withdrawn.
When customers discover bugs before you do, it reflects poorly on your reputation and harms the trust you have established with the consumer of your application. That's not a myth.
Or you might also work in a hyper-regulated marketplace dealing with huge amounts of trust and money, and so finding a bug in production can be a very, very, very big deal.
I could publish a study saying smoking is bad for you, intentionally make some mistakes, and have the study withdrawn using strong words like 'fabricated' or 'lied'.
The result would be a dimino-compliant way of establishing that smoking is not bad for you. Or at least, that "smoking is bad for you" is a myth.
Of course, for smoking we have dozens of studies and a strong understanding of the mechanisms, so this trick wouldn't fool anyone. But, just be careful about letting fraudsters determine your opinion on a topic, whether for or against, whether they're caught or not.
As for bugs specifically, there are huge benefits to being able to reason about systems as if they are bug-free. It's a PITA when your libraries, kernel, compiler, upstream API, etc. don't behave as expected. As professional software developers, we can work around them; users can sometimes get really confused when things don't work as expected. Sometimes they're scared to tell you.
Smaller companies can probably bear more bugs, since they can give individual users more attention. Larger ones tend to need to work more reliably.
I don't think the characterization you're making of my argument is accurate, and I don't get the sense you're interested in understanding what I'm explaining here, so I'm probably done.
I'll leave you with a recommendation to read The Startup Owner's Handbook, and listen to the Stanford lectures given by Sam Altman and friends, who talk a lot more about this.
Depends on the bug. Sticking to League of Legends as an example, is it really the worst thing if they have to disable a champion for a day or two to fix a bug? Not really. It still happens fairly often, and the impact is inconvenient but it's not the end of the world. Now how about a bug that prevents logins or crashes the backend? This completely take the game offline.
These sort of downtimes hurt customer satisfaction and the bottom line. If serious issues arise often enough, customers lose confidence and patience and may leave your product for a competitor. And while you're offline trying desperately to rush a fix, you're losing revenue.
If the service/product you're offering has competitors, quality matters. Sometimes it's not even the measurable quality, but the perception your customers have. Find me a business owner who thinks major bugs in production are not "a huge problem".
On the other hand, Valve tends to treat bugs differently with Dota 2.
They have a test client in which a small group of players can test out new patches and find bugs. Recently when patch 6.87 was released, they practically skipped the test client and deployed it straight to the main client. The result was that there were all sorts of bugs in the game, but they worked really quickly to fix them so that after a few days, they were almost all gone. I think Valve's mentality is that it is more efficient for their large player base to beta test for them instead.
Also, there's been a few times when the game is literally unplayable for one or two hours because Valve pushed a random update.
>> We used to have this famous mantra... and the idea here is that as developers, moving quickly is so important that we were even willing to tolerate a few bugs in order to do it. What we realized over time is that it wasn't helping us to move faster because we had to slow down to fix these bugs and it wasn't improving our speed.
> But the idea that catching a bug in production is a huge problem is a myth
You clearly have never worked in fintech, banks or payments. I personally witnessed the moment we catched a race condition which costed the company n-k euros in missed transactions.
And this isn't even considering mission critical software where bugs can (literally) kill.
So, excluding life-critical software, it's still a business decision.
Engineering will naturally tend to geek out on infrastructure projects like this and they need to be kept focused on the business case. There has to be some push-back.
What is the cost of even one engineer working full-time on build verification?
The thing you really want to avoid is getting blind-sided by devastating bugs or systemic process problems. So it's a balance.
I've just seen many cases where projects got bogged down as developers built their super-uber-build-test framework. It's easy for people to push these projects through without rationally investigating the cost / benefit, because ... because ... you're not seriously suggesting that we shouldn't test more, you monster!?
Relax, I don't think I said bugs are never bad, or devastating. I said the idea that catching a bug in production is a huge problem is a myth. It is not, by default, true. It is true given additional information for a specific case, but is not the general rule.
Bug impacts are highly non-linearly and non-Gaussianly distributed. It doesn't matter if the "average" bug is not that big a deal to find in production, what matters is what the worst thing you can find in production is. Even if the study is not literally correct in the average case, it doesn't take much modeling or much real-world experience to see one is still wise to develop with very similar ideas in mind, if not an even more intense focus on getting bugs identified before production because the paper probably understates the impact given real bug distributions.
(I say "identified" because you don't always fix identified bugs. But you really, really want to have as much as possible identified before production. Production is a terrible place to find bugs for the first time. Yeah, every once in a while an identified bug might have a far worse than expected impact, I've had that happen, but at least in my experience the completely unidentified bugs are what kill you the hardest.)
This is an older style of thinking that is, while safe, slower. Competition who doesn't follow this model and pushes fast, failing often will generally run faster to market than a company who follows this line of thinking. Their software might be buggier, but the concept of "good enough" applies.
If you're building rocket engines that carry people, you think like this. If you're building a social networking website, or a food ordering website, or a home sharing website, the time it takes to go from "oh there's a bug" to "that bug is fixed in production" is a matter of hours, if not minutes.
The problem with bugs is not the bugs. It's what the bugs do. Bork up your database and the bug fix is not "minutes". (Especially if you don't notice for a while.) Piss off your customers and the bug fix is not "minutes". Screw up your money handling and the fix is not "minutes". (Well, your part of the fix may be just one minute, though that minute will be "you're fired".)
If you've only ever encountered bugs that can be fixed in "minutes", you're either very lucky, or not working on anything all that important. Or possibly, not very perceptive and you've actually got a mess on your hands and you haven't realized it yet. I've inherited systems run by such people; they think everything's hunky dory but it turns out you can hardly figure out how to connect two records in the database together correctly because they were too smart for academic bullshit like referential integrity, and, lo, their database was low on the integrity. Tends to work, except the amount of elbow grease required increases without bound until it exceeds the capabilities of the developer in question, and suddenly, one day, they wake up and their job is in serious jeopardy because of what's going down and what they can't get back up.
You assert that this is the "older" style of thinking, which shows a lack of understanding of your computer programming history. What you advocate is "cowboy" thinking, and it predates what I'm discussing by quite a bit. The entire 1970s was basically run this way, and it shows in those handful of remaining technologies that are still with us, and their tendency to just pound through problems as if errors are an admission of failure.
I think you're having a reaction to the higher risk profile that comes with breaking in production. That's understandable, but wrongheaded. Plenty of companies are very successful with the mindset as I have described.
Netflix runs their Simian Army in production, for example. They sometimes pull infrastructure pieces offline intentionally to test their resilience. A good production infrastructure is incredibly resilient.
I am on mobile, so it's hard to provide links, but I'm referring to the "cost of bug in production" chart that's one of the more famous studies, from a citations perspective. Completely fabricated research.
Wow, this is pretty sophisticated and commendable. I'd love to have the resources to do something similar. In contrast to 'standard' application development, automated testing is really rare in the non-AAA games industry. At least in terms of logic-/active gameplay testing. It's pure luxury, you can only do it if you can afford it. In a project-based work for hire game shop this is an almost unthinkable thing to do, because you don't get it sold to your contractors/customers. They just won't pay for the effort you don't our directly into the game. The only thing you can do is to develop your own automated testing framework over time and over projects, which is a tedious thing to do because you cannot really focus on it (because it's not a first class citizen in your project schedule).
I hope it started as a much more humble tool, that kept growing in size as the company did. Each release it gets a bit more useful, until you hit that critical mass point where's it's a pillar of development.
On the other hand, just because you have a great testing framework, it doesn't mean you should launch new game features/changes left and right.
(LoL is structured around roughly yearly 'seasons' for competitive play and for ranked play rewards. This season they have released a far greater amount of game mechanic changes, which make it difficult to plan competitive strategy and difficult for regular players to keep up.)
Actually, the launcher was replaced under 2 years ago. The matchmaking client has entered alpha, and is accompanied with rewrites to a lot of Riot systems to help it all work together.
Nice writeup! Sounds pretty similar to the way serious web apps are being tested (using Selenium & co.). I find it interesting that they built their own testing system though - couldn't they have used some existing framework?
A third party solution would be as hard, or harder, than developing an in-house solution. The challenge is setting up the interactions between all gameplay elements- it's going to very unique to each game.
The amount you'd need to abstract to make it reusable for other developers would make it nearly useless.
Wow this is the first time I've seen automation testing applied to a video game. 2 things struck me as really great from this:
- Staging area for their tests, so many times I've lost confidence in our test suite because of a flaky test. We then delete that test forever. Adding tests to a staging area to very that they are stable is a great compromise
- each test is a class, with 'setup', 'execute' 'verify' - makes the test a lot easier to read and refactor
A release cycle of two weeks looks miracle to an outsider like me. Good to see what's going on internally. I loved the video showing automated champion moves, show me more! Also,
> In Wood 5 we don't use wards anyway, so I see no problem with this critical failure
100% agreed. There's no point in changing to lens because nobody buys a ward!
Are these all the automated test they run? The article doesn't seem to mention unit tests for example. Do they write and run unit tests? The test class example seems to assert a bunch of stuff that might be easier (and cleaner) to test in a set of unit tests.
Update: In particular this is weird:
> "Tests make use of remote procedure call (RPC) endpoints exposed on the client and the server in order to issue commands and monitor game state. For the most part, tests consist of a fairly linear set of instructions and queries—existing tests cover everything from champion abilities to vision rules to the expected rewards for a minion kill. "
They test "rules for expected rewards" with an out-of-process python test program that connects to the client and server via RPC. Seems unnecessary complex way to test specific game rules.
Is this a symptom of a separate QA team and no developer-written unit-tests?
No, it's because game engines don't unit test well. There is too much I/O, latency dependent behavior, and shared mutable state to make it feasible outside of some math and low level format or protocol tests. Many game bugs are purely data bugs and result from a misconfiguration or a malformed asset. Manual testing of the result, following a checklist and creatively hammering on the system, is thus the default method.
What Riot is using is a form of integration test where manual user input is emulated to produce a result data set. What you aren't seeing in the test code is "everything else" that was needed to set up a running game state. This technique makes it easy for QA to jump in and see what's happening visually when a failure occurs, eliminating the need to slave away at a checklist.
It's useful as a framework for testing all game interactions. When you think about the amount of unique abilities for the hundred(?) unique characters and how they interact against eachother, the test matrix that results is massive.
Additionally, you want to use production server/client code as much has possible, while getting through the tests as quickly as possible (skip front end flow, matchmaking, etc.).
Using client/server endpoint calls are great because it allows these integration tests to be a list of instructions (the same instructions used by production code) to create a very deep test suite.
Integration testing is critical for games, where you have so many independent units interacting and changing each other's states.
>> When you think about the amount of unique abilities for the hundred(?) unique characters and how they interact against eachother, the test matrix that results is massive.
Definitely. They're at 130 champions right now. Each with four spell cast abilities, as well as a passive ability, and unique auto attack mechanics on some of them. There are indeed many interactions with specific abilities between champions. Add in interaction with the map and terrain itself, like unit collision and NPC minions and monsters. Then deal with the 154 items (the current count on the main map) players can add to their champion, many of which also interact with abilities, and some of which introduce additional abilities.
It must be both a nightmare and yet very interesting to handle creating a new champion for the game. Once they get past the step of even deciding how the new champion's abilities should interact with other champions, they then need to code it all and manage to verify that everything works according to expectations. Every once in awhile, there are still bugs with how one ability interacts with another. You can plan it all out, but it has got to be easy to miss something. Too many interactions! :)
That's why the BVS they've built becomes so important. The amount of test-cases you have to satisfy go beyond the human mind's ability to hold.
But if you have a farm that tests v1 of your new champs abilities against all others, you know exactly where the 20% of problems will lie. This is where designers earn their paycheck- two abilities that won't resolve by design, and where the programmer puts 80% of their work.
With LoL being the most popular video game in the world right now, it's also the most relevant game of all time. How they build and test it is fascinating.
I've played this game for 7+ years now and as a programmer love to hear about all this. A Rioter said a couple years ago that they used Erlang[0] and as a longtime Erlang (and now Elixir) admirer and tinkerer that was great to know my favorite game uses it.
I'd love to hear more about automated stress testing and LoL's release schedule. Do they have beta releases or code freeze before the two week release dates?
They have a public beta environment that is set up for feature testing, rather than stress testing. The PBE does see code rollbacks (if features are incomplete) before releases, so I believe the game logic on the PBE represents basically what will be moved to the live server.
If using an svn/p4 source control method, they probably branch mainline into a release branch, test, then branch to a staging, test against production data, etc, all the while fixing any necessary bugs. This allows development work to continue on mainline while keeping the release testing continuous and stable.
Oh man with that kind of control of the game they could /certainly/ create some kind of sandbox mode.
The APIs that they have for testing alone are pretty incredible; I'd love to be able to play in a mode where you could change level on-demand, create dummy enemies, etc.
Not to say that DotA wouldn't benefit from more rigorous testing, but in Valve's defense I'd say that the complexity of interactions in DotA is far beyond LoL. Some of the bugs you're seeing might only arise in extremely rare situations.
DotA is balanced with creativity in mind. When an exploitative strategy is found the game is not changed to remove it but instead through some combination of tweaks and modifications, often to seemingly unrelated aspects it's made into something that's only situationally viable.
The reason the game is so deep/interesting is because this philosophy has been in place for many (10+) years now resulting in a game that's much more complex than pretty much anything else out there. DotA 2 retains edge case behavior that was present in the original Warcraft 3 mod. Things that were once engine limitations are now part essential parts of gameplay. Preserving these weird interactions instead of attempting to make the game more regular/comprehensible is one of the things that allows such a high level of creativity in competitive play. There's always a way you can outplay your opponent that's clever. In go parlance: there's always new tesuji to be found.
All that being said I think Riot's efforts here are quite laudable and DotA could certainly find benefit from more rigorous testing.
This is a silly comment. A cursory search of "League of Legends Bugs" yields similar results, lots of bugs. In fact, the Mac Client was rendered completely unplayable less than a month ago, which may have been fixed by now. Valve employ people that build automated testing (which very well may have built a system used by Dota 2 devs), so I think your last statement could be incorrect, but either way it is speculation, unless you have a source for that.
Maybe we extend the benefit of the doubt and perceive the comment as a reminder that having code that passes all the tests doesn't necessarily equate to having a good product?
It's obviously a stretch as that guy's username is the name of a DotA hero, but hey, rose colored glasses are fun sometimes.
They'd soon be drowning in individual tests. And it wouldn't necessarily fit to separate them, considering all those 3 have to be tested on each skill usage. One shot has to obey those 3 rules.
Making each an individual test would triple the test time, pollute the test codebase and not bring any advantage.
I think it is true. I mean, tripling the number of test cases when you already have 5500 of them is going to slow down your dev process in a significant way.
The advantage in splitting up the assertions is that you get more visibility. If one test fails and not the other two that's extra information you can use to debug you wouldn't have if you have all 3 in one test.
If that test fails, the dev will just run it locally with a debugger attached. The cost of extra time in the test for the normal passing case isn't worth it.
Right, but consider if all three test cases fail. Then devs (who didn't write those tests) need to determine what those three test cases are and if they're all the same issue or not, which takes time.
Ideally, if just one assert fails, it will produce an error message/logs with enough info to tell what went wrong. Having tests produce great error messages is way more valuable than arbitrary rules about assertion density.
I tend to agree, but sometimes in these kind of instrumented test environments, the setup cost of each test case might be very high so maybe the team found it reasonable to group all the assertions into a single test (especially given that the suite is already taking 1-2 hours to give feedback to developers).
I am well aware of that and I agree on being pragmatic, but more often than not, the reason for multiple assertions per test is developer laziness and/or unawareness of the benefits of isolating failures by having just one assertion per test.
Seems to me you actually want all three assertions checked in the same run. Otherwise you could get three runs where only the first assertion was true in the first run; only the second assertion succeeding in the second run with the others failing; etc.
I bet the tests aren't pixel-by-pixel frame-by-frame reproducible (they are talking about a one week staging period for tests to prove themselves, so surely there's some non-deterministic wiggle room in remote controlling the full graphical client), so you really want to cover all assertions together for each run.
They should be tested independently, each with a completely reproducible setup that does only what is required for the assertion that will be made. I'm not sure what's hard to understand here... Are you trying to say that they need to be run together? Because it doesn't seem like that to me.