Every so often, AWS offer up a Partner “Game Day” event, built around a fictional solution deployed on Amazon Web Services… Unicorn Rentals.
The idea is to artificially inject failure into what normally is a very reliable platform, develop team experiences around how best to resolve the torrent of frustrating issues that occur, keep your service alive and keep your Unicorn renting customers happy.
Each team was given a half-built AWS environment. Ultimately the task is to deploy (from some deliberately brief and shaky deployment instructions) three microservices, all of which are all really just variants of simple transactional based services.
One is hosted on AWS Elastic Beanstalk, one on Fargate, and another on API-Gateway/Lambda; three different flavours of hosting a microservice on AWS (automated, containers, serverless).
There is also a service router already deployed on EC2 and managed by an Auto Scaling Group. You are given a tip to make sure this is always available otherwise your score will heavily suffer, watching this is important as it’s easy to miss if you are just paying attention to your three published services in the service marketplace.
A dashboard is made available with the team scores, the score is generated by the number of successful transactions processed by your microservices and penalties for things like the service router being down.
Play commences, the race at first is to get those three microservices deployed pronto…. there is an interesting dynamic where you can utilise other team’s services, which are published on a service marketplace, along with stats about their latency and error rate. You ultimately benefit from doing this more than the team who’s services you use but there are still some tactical tricks to picking which services you consume.
Then things start failing… through some automation in the account and a couple of AWS rogue employees messing with root access. Bad things start happening, the experience varies amongst teams and the better you do the more it seems you are attacked.
For the sake of not providing spoilers, I won’t go into detail here. Suffice to say, don’t underestimate the dastardly ingenuity of the attackers or even the automated attacks.
Thanks to the immutable nature of the deployments, sometimes it’s quicker to just redeploy, but even the deployment assets aren’t immune from tampering. Neither is lower level infrastructure configuration such as route tables, which can impact all three services simultaneously.
Just as we come out of one service event and start creeping up the scoreboard it seemed, something else would break. At the same time, we were frantically trying to adjust a dynamoDB table, which is a directory of other team’s micro services we are depending on… they are all having similar issues.
All of us ended up in teams that spent some time at the top off the scoreboard… and even if we weren’t in the team that ultimately won, we still had lots of fun.
What I’d do differently and what worked
Setting up a comms channel really helped, I insisted we did it in our team and after the inevitable 10-minute debate about which tool to settle on, I just set up Amazon Chime. We used this to orchestrate tasks and for sharing config and code.
Our team took the three microservices and one “really important router” as a sign, four of us in the team and four logical service points. The benefit of a microservices architecture after all is task specific and manageable services. So, we made everyone responsible for one, which helped keep us on task when everything was breaking at once.
Those with unbroken services either maintained the directory of other services we were consuming, hardened their service against the last attack, or looked for ways of reducing our service response time in the marketplace.
Some teams automated the population of the service directory table… I think that was a good call, we felt it needed more tactical play than that so didn’t bother but it was a huge distraction in the end.
I quickly built some dashboards in Cloudwatch, one for each microservice and a couple of very rudimentary alarms. This helped but it was hard to keep an eye on it and do real work with one laptop screen… so next time I’d set myself up with dual monitors.
It’d be easy to critique this Game Day with statements such as “the environment doesn’t match anything we’d actually design ourselves” or “the ‘failures’ weren’t realistic” given that most of them fell into the realms of sabotage rather than genuine failure.
This I think is missing the point. The “Game” is about the microservices benefit and how best to organise your team to manage and improve them.
Following the “Single Responsibility Principle” right through to how you develop and manage your microservices at a team level is one of the key benefits over more monolithic architecture. It also becomes very clear on the day that the closer you move to true serverless and cloud native components, the easier it is to manage your microservice. The Microservice Madness Game Day did a great job of demonstrating both in a frantic, demanding, but ultimately fun filled day.