I’m Kolton Andrus, co-founder of Gremlin, where we help companies build more reliable software and prepare for the moments that matter with Chaos Engineering. In today's world, where systems and people alike are becoming more distributed, it's more important than ever that we proactively test how we respond to failure.
In the age of cloud and microservices, no one engineer or engineering team can truly comprehend their complex systems without doing some proactive testing. We believe that Chaos Engineering is the best way to understand how our systems behave. I began doing Chaos Engineering a decade ago at Amazon to improve the performance of their retail website. Then I joined Netflix and built their 2nd generation of more sophisticated fault injection tooling.
Amazon and Netflix led the way and many companies want to emulate them, but their tools are specific to their problems—problems you may not have and certainly don’t want to inherit. I started Gremlin to make Chaos Engineering accessible to everyone. To do this, we’ve built a platform that makes it safe and easy to inject failure into your systems and applications, in order to learn more about them.
Happy to answer your questions about Chaos Engineering, incident response, SRE, and building more reliable software.
Submit your questions below and tune in on Tuesday, April 14th from 9am to 10am PDT (12pm–1pm EDT) when I’ll be answering questions live.
Talk soon,
Kolton
PS Atlassian is an official partner in Failover Conf—a free, online conference focused on building reliable systems, hosted by Gremlin on April 21. You can find more information and register for free here.
Thanks for doing this Kolton!
I have a question. When injecting chaos into a system, how do you gauge what's the right amount of chaos? I'd love to hear if you have any experiences related to this.
When first getting started with Chaos Engineering the main concern is mitigating risk. How do I keep my experimentation from causing the very issue I’m trying to prevent? First, I recommend you limit the blast radius of your experiment. Start small, in staging with the smallest experiment that will teach you something. Confirm that the system responds in the way you expect and that you are able to detect the failure. As you gain trust and confidence in your systems and processes, expand the blast radius to include more hosts, traffic or environments. At small scale you’re testing for correctness, at the large scale you’re testing for how the system responds under duress.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Kolton, thanks for joining us. I have a couple of questions.
1. What is your advice to companies or people who generally say "We are not ready to run Choas Experiments in Production"
2. Any tips to improve monitoring, alerting, Incident management process by running Chaos experiments?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
When companies tell me they’re not ready for Chaos Experiments in production, it often stems from a fear of uncertainty—they don’t know what will happen in production because they’ve never done Chaos Experiments in any environment. Just like with code, starting Chaos Engineering in a development environment, then moving to staging will prepare you for what will happen in production and allow you to take less risk. And there’s a lot that you can learn about your applications by testing in development and staging.
But ultimately you need to graduate to production. In the same way that code is never truly deployed until it’s in production, Chaos Experiments are never really valid until they’ve been run in production because no two environments are ever the same. Your second question is a great example of this: your monitoring, alerting, and incident management processes will be drastically different in production than in staging.
When it comes to observability and Chaos, I like to think of it as boosting signals and reducing noise. Consider what sort of signals (dashboards, alerts, etc.) would give you more visibility into the Chaos experiment or scenario that you’re running. But equally important, where can you reduce noise—evaluate old monitors that are no longer providing useful information, test alerts to ensure they’re triggering on the right criteria (e.g. degradation of application performance rather than just increased CPU) and go to the appropriate people.
Chaos experiments or Fire Drills are a great way to improve incident management. A key aspect to this is to practice them regularly. Code and infrastructure changes happen frequently, so it’s easy for documentation and runbooks to get out-of-date. Regular fire drills give you an opportunity to validate them and update them if necessary.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Kolton:
I have some questions related with Security Chaos Engineering:
1. Are there is a list of a common attacks when you are considering experimenting with the security of the systems?
2. Should we have special considerations when the attacks are involved with security instead of infrastructure?
Thanks in advance!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Reliability is a core pillar of security testing and offensive security testers (Red Teams) will often try to exploit reliability failures to penetrate systems. Although penetration testing and Chaos experiments share some parallels, they have different goals. Ultimately, Chaos Engineering is focused on making systems more reliable in any situation, regardless of whether it’s a real event or one simulated by a hacker trying to find a weakness in your systems.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Kolton, thanks for joining us :)
How would a team or organization know if they need to do chaos engineering? Are there any prerequisites for teams looking to get their chaos engineering journey started?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Everyone needs some Chaos Engineering, but it's important to define how much is the right amount for your organization depending on your goals.
I'll start by backing up the first comment, "Everyone needs some Chaos Engineering." Across nearly every industry, we're moving to distributed systems, migrating to the cloud, and adopting microservices. Each of these help us innovate faster, but increase the challenge of keeping our systems reliable. Traditional testing methods aren't enough and that's where Chaos Engineering comes in. Chaos Engineering allows us to safely test the behavior complex systems and build confidence in their reliability.
Ok, so how much Chaos Engineering is enough for your organization?
Spending a couple hours each month will likely save you 10x that time in incidents and toil over the course of the year. You want to start by building a foundational SRE program. Create a formal Incident Management process, and validate it by running a fire drill. Validate your monitoring and alerting with some simple Chaos Experiments to make sure you're not missing alerts and that MTTD and MTTR are as low as possible. Basically, spend an hour each week to review incidents as a team, run monthly Game Days and quarterly fire drills.
From here, you can invest more time to further increase your availability. You can reduce the noise in your monitoring and alerting by regularly tuning thresholds. Roll out new features by using canary deploys to a small number of users. Begin by doing failure testing for the services that are feeling the most pain from incidents, or the most critical ones. Test delay and failure between your internal services. Add in regional redundancy. By having each team invest an hour a week, you can keep your speed of innovation without impacting your uptime.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
How much of your approach to Chaos Engineering relies on machine learning?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
This is where the future lies! How do we apply machine learning to Chaos Engineering to learn from past incidents, or uncover unknown unknowns? I was able to collaborate with academia to test some of these early ideas while at Netflix: Automating Failure Testing Research at Internet Scale. This is where we plan to push the boundaries of Chaos Engineering over the coming months and years. Can we safely explore and validate our systems in an automated fashion? That is the holy grail.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello Kolton, I have one more question. What are the core technical competencies that one should have a control over to be able to practice chaos engineering effectively using Gremlin?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Every engineer needs basic operational skills: being able to run system commands, understand resource utilization, debugging network traffic, etc. Leveraging your monitoring and logging to debug issues as they arise is critical. As with exercise, if one does not practice these skills they will atrophy. Chaos Engineering is a good forcing function to train new team members the skills they need to be on-call and feel comfortable operating their systems. We’ve seen the confidence in teams grow drastically by running fire drills and having the opportunity to ask questions and learn in advance. As you grow into your Site Reliability Engineering skill set you will continually find newer and more useful things to test in your system.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Is there still a strict no feeding the Gremlin's after midnight policy in place?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
There are many types of gremlins. Some are larger, some are smaller. Some will cause you significant problems if you feed them after midnight or expose them to water. All gremlins are creatures of chaos and wherever they appear you can expect a flurry of activity, usually of a destructive nature. That isn’t always the result of a malicious intent, but just an inherent need to break things on purpose. We at Gremlin have found a way to leverage these gremlins toward positive uses. In fact, because we feed them on-time, our gremlins are happy to unleash the right amount of chaos on demand to help strengthen our systems.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
In gremlin, we need to install gremlin agents on host in order to inject failure. Is there any plan to build some kind of features using which we can connect when introducing failure and disconnect from host after experiments. It might be safer way to connect to host than having gremlin agents running on them.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I love seeing engineers considering safety and security when implementing solutions. An agent is required to realistically simulate the failures that can and do occur on real systems. The Gremlin agent works similarly to how you describe: it polls the Gremlin platform and only activates when an attack is run, then deactivates when the attack is completed. This deactivated-by-default design ensures that the agent will never begin an attack without your explicit command and will halt attacks should any external failure occur (e.g. being cut off from the Gremlin platform). From a security perspective, an agent allows limited remote access without opening unneeded avenues for attack (vs connecting directly to a host).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thank you everyone for your time and questions! I love sharing what I've learned along the way and helping engineers sleep through the night peacefully. If you have further questions that weren’t answered, or want to dive deeper come find me in our public Chaos Engineering Slack. And if you’re new and getting started with Chaos Engineering, check out Gremlin Free!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hey Kolton! Thanks for doing this.
Seriously: how did you get the gremlin.com URL? There has to be a story there.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.