Preventing the Next CapitalOne Cloud Breach

Configuration mistakes.  This is not a new issue. IT and Security Operations teams have been struggling with managing configurations for as long as they have existed. As organizations start down the cloud path, the problem becomes more acute. There are simply too many opportunities in a massively complex cloud environment to make errors.  Over the years I’ve met some of the security team at Capital One, and they are outstanding professionals who fought every day to prevent what happened. In fact, Capital One is widely known to have some of the most mature and sophisticated cloud operational capabilities — which just underscores how hard a problem configuration management is at scale.  This time the problem was a SSRF vulnerability that exposed the AWS EC2 metadata service and allowed the attacker to extract S3 access keys, but honestly that’s one of a couple dozen configuration issues that could have exposed data.

The Idea of Prevention

This idea of ‘prevention’ of an ‘attack’ like this is a little weird for me.  We always strive to prevent unforced errors, but unintended configuration mistakes are going to happen.  It reminds me of the networking discussion a couple decades ago around uptime. When the network would lose a box, outages would happen and, for a while, the conversation was about how we could identify that before the box went down and quickly swap it out.  But eventually another concept won out: resiliency. It made more sense to architect for failure, rather than try desperately to respond quickly enough to minimize every downtime.

In some ways resiliency meant giving up on the idea that we could keep our single critical paths open 100% of the time.  Instead we’d create multiple paths, and the odds of losing them all at the same time were dramatically diminished. We solved an intractable problem with a different approach.

I feel the same way about preventing the next configuration mistake. Perfection is not achievable. So instead of focusing on preventing all configuration errors I think the best answer is continuous monitoring and automated response. To be clear, these approaches are not mutually exclusive. You can focus on both preventing errors and also taking care to monitor for issues and responding accordingly. For security professionals this idea of monitoring and detection is familiar.  It’s the automated response piece we’ve been more reticent to adopt because of the potential negative consequences (i.e., downtime).

Continuous Monitoring and Automated Response

Here at DisruptOps, our Guardrails are designed to bridge that gap.  Monitoring and detection is a must, and we can do it at scale across all your accounts and regions with an integrated user experience. Then it’s about getting comfortable with automation to fix the errors.

Ultimately, security in the cloud at scale is only possible through automation, but trusting full automation isn’t going to happen in a day.  You gain comfort by incrementally automating changes. First pick a fairly low-risk situation, like rolling back an unauthorized security group change. Worst case a network path is impacted, and that’s an easy fix. First you trigger alerts, which show exactly what changes would happen and when. After you see what the tool would do, then you can decide to activate automation to run without human intervention.

DisruptOps allows you to embrace automation at your own pace. You can alert when there is an issue, and then give your Ops team a window to address the issue. If they don’t get it done, you can automatically enforce the guardrail. Or you could just run in alert mode for a while, until you decide to activate the fix. It’s not an all-or-nothing proposition. We call this customer-controlled automation, and that comfort can make all the difference in the success of your cloud security strategy.

To learn more about what we do, check out some of the Guardrails we’ve already published.  We are building more every day, but these guardrails represent mistakes that you no longer have to worry about once you’ve deployed DisruptOps.  And if you’re ready check out our free trial.  We can be monitoring your environment, finding issues, and automating at your comfort level in less than 30 minutes.