Forget the checkboxes, make your security BLISS

When it comes to cybersecurity approaches, there are two main camps. One is rooted in the traditional defense-focused mindset, very common among security professionals. We have to make our internal assets impenetrable to the attackers! We have to close every gap they can sneak through, eliminate all vulnerabilities! This mindset leads to a heavy handed process that is poorly understood by engineering teams. Security becomes something that’s a concern for the security team only. It’s us vs them. It’s not helped by the fact that the software engineers themselves are often seen as a potential threat, with corresponding, very obvious protections to counter it.

The second camp recognizes that this traditional approach to security is costly. It slows everything down and adds extra steps no one really understands. And for what? Most businesses can get away without thinking about security until their first serious incident, which may or may not happen and if it does happen, maybe it won’t happen this year. It’s a risk that a lot of companies are willing to take. But then those pesky regulators demand that businesses protect their users’ data and insurance companies offer enticing discounts for those who implement certain traditional cybersecurity practices. What ends up happening is some flavor of security theater where we do things that sure look good on paper, but that don’t actually make us more secure. Security becomes a set of checkboxes to file and forget.

Neither camp is right. With the sheer amount of attacks carried out on everything that’s connected to the internet and with the amount of resources poured into cyber crime by certain nation states, every software system will get compromised eventually. Ignoring this may pay off in the short term, and keeps paying off when the alternative is to require a team of security specialists and practices that slow every engineer down. But what if there was a third way, a pragmatic way, that actually improves your security while being easy to grasp by every engineer? And what if that way was so cheap to implement that it changes your traditional risk equation? What if you could make your security BLISS?

This may be disappointing, but it’s not the latest, coolest product nor a service you can pay for and keep ignoring the topic. BLISS is a framework that helps you think about security in a healthy, pragmatic way so that you don’t have to spend a whooping 13% of your IT budget on security with mixed results. It stands for Bulkheads, Levels, Impact, Simplicity, and pit of Success. Let’s dive into details of what this means.

Bulkheads - limiting the blast radius

Bulkheads are a feature on all submarines (well, maybe not Titan) — where in case a section of a ship has some catastrophic failure, it gets sealed so that the problem doesn’t spread to other sections. We can use the same principle in software security — if a part of our system gets compromised, we should be able to seal that part off, or remove it so that the problem is contained. Think of it as limiting the blast radius.

The ability to isolate, turn off, and swap or redeploy a component of a system in the event of an attack could make a difference between a total hard down of production and only a partial loss of functionality. It is also something that costs almost nothing to introduce, as long as it’s been considered when designing the system. Good security starts with architecture and this is a great example. The extent to which this principle can be applied and its effectiveness are going to be constrained by your architecture choices.

In the case of microservice architectures, where each service is deployed independently and services are communicating solely through REST APIs, the service boundaries are our natural bulkheads. The risk of the whole system going down when one microservice gets compromised is very small and the ability to redeploy the service quickly further minimizes the risk. In the case of a monolith, we don’t have this capability available, so we need to get more creative with placing our bulkheads. Things like deploying separate instances for different customers or using containers would be good examples.

Regardless of architecture choices, it’s worth adding isolation to the infrastructure to limit the impact when some part of it inevitably gets compromised. Consider having separate git repositories for different projects, and separate cloud accounts for deploying different environments and modules, each with separate access controls only authorizing people involved in respective work. That sort of separation can be added without great effort even for legacy software.

It’s important to keep in mind that adding bulkheads is just one tool out of many for improving security. As such, it makes no sense to overdo it. If your software is already serverless, meaning each function runs in isolation by definition, then deploying a separate instance for each customer would be an overkill, at least for the purpose of limiting the blast radius of security incidents. If you have a big-ball-of-mud-style monolith where everything is stored in a single git repository and has to be deployed all together, trying to add bulkheads may not be the most prudent first step.

Levels - protection proportionate to the risk

Protecting data and operations necessarily means adding little obstacles — extra login factors or additional steps to perform to either prove who the user is or reduce the exposure. However, not all data is equally sensitive and not all operations equally critical. It makes no sense to have the same level of protection applied to all of them because that ends up in a compromise that is either too strict for regular, daily operations or too lax for the sensitive ones. Instead, it’s more practical to introduce different levels of protection proportionate to the security risk involved.

Having multiple levels in protection strategy enables fine tuning the approach to each situation. That way, people doing the work and operating the systems don’t get overburdened with security where it’s not needed, and are more likely to pay attention in areas where targeted security measures indicate heightened risk. Last thing we want is people getting desensitized to security and becoming lax or complacent. Most successful attacks happen due to human error after all.

When defining security measures, it makes sense to look at your whole scope holistically as well as from the user or operator perspective. Consider groups of actions and the data involved - are there sets that get performed frequently together? What is the sensitivity of the underlying data? Are there distinct steps that can be treated separately? What is the security risk associated with each? Once you have that granular view, consider:

Which groupings can get away with just a simple corporate login (hint: that should be the majority of them), which need access restricted to only specific people
Which require a second factor
Which may need a second set of credentials or manual approval

It may be tempting to put a single, reinforced gate in front of everything and call it a day, resist that temptation. If an operation consists of multiple steps, it’s rare that the first step already requires the strictest protection, so consider layering different means throughout the process.

Let’s use a developer machine as an example, treating the whole machine as the most precious thing deserving ultimate protection would be a mistake. Instead, it makes sense to protect individual data types and operations. After all, accessing cat memes on the internet is not the same as accessing your company’s compensation data even though they both may use the browser and https as protocol. One of them likely happens every day and should be possible to perform freely. The other one happens rarely and should require strict identity checks and access control behind the scenes. Likewise, viewing source code is not the same as committing code — which is not the same as merging the commits — which is not the same as deploying those commits to production. Each of these steps can have its own protection, with the strictness and corresponding obstacles increasing progressively at each step.

Notice how the step or action that has the highest risk in these examples - accessing salaries and deploying the code - is also one that happens less frequently than the other ones? It’s not a coincidence. In normal business operations, the vast majority of things we do aren’t inherently risky, so trying to protect all of them is a waste.

Impact - minimizing the consequences of an incident

The traditional mindset among security professionals is to protect, with a heavy focus on preventing the attacks. This was a very good approach in the early days of the internet when we still had a lot of low hanging fruit in securing our software and infrastructure. Nowadays, a lot of the measures intended to reduce the likelihood of a breach are built into the technology we’re using and are so ubiquitous that we’re not even paying attention to them. Cloud technologies, Single Sign-On (SSO), and deployment automation are just some of the examples. Attempting to further strengthen prevention is providing diminishing returns. It makes more sense to focus on reducing the impact of a successful attack when it inevitably happens.

Think of what happens when something gets compromised. Are there bulkheads in place that could contain the incident to a relatively small scope? What can the attacker do with their access - are there more levels of protection further down the line before they get to the juicy bits? What’s the worst that can happen? How much will it cost the company or your team, including the cleanup? What can be done to further reduce this cost? There are a lot of techniques that help limit the impact of a security incident. Other than the above mentioned bulkheads and levels, there is encryption — in transit and at rest — granular permissions, and restricting who and when can grant other people access. Let’s not ignore measures typically used to ensure high reliability, such as backups, redundancies, and rollback strategies - they also serve as means to minimize the damage done in an attack.

Let’s take an example of a common threat: a malicious actor getting their hands on one of our engineer’s cloud account credentials. What’s the worst that can happen here? If we applied the principle of bulkheads, then those credentials are limited in scope to a single cloud account, which may or may not host our production environment. Even if it does, if we utilize modern software development practices, the access would be limited to read-only and innocuous configuration changes, since resource creation and deletion would be automated and only permitted as part of CI/CD pipeline. Let’s say that account includes access to a database containing user data if we applied proper encryption, the data is effectively useless to the attacker.

Another common threat that could have business-ending consequences is a ransomware attack. If we make full use of bulkheads combined with reliability techniques, then our production environment is distributed, meaning only part of our system would be affected. What’s more, we would be able to roll back the deployment that included the malware (if that was the attack vector), restore a database backup in a matter of minutes, or quickly switch to an unaffected alternative region/server.

Trying to simply prevent the incidents in the above examples would be a prohibitively costly endeavor, involving multiple remediations to account for all the different attack vectors. Limiting the impact, on the other hand, is a holistic solution that doesn’t take as much effort or cost, and often comes with additional benefits of increased robustness. This isn’t to say it’s ok to completely ignore preventing the attacks. There may still be places where investing a little in prevention will return outsized benefits. In most cases though, looking at impact first is more pragmatic.

Simplicity - simpler means more secure

The more complex something is, it is expected that the more errors or accidents will happen. If a process is difficult to follow, people find workarounds. More tools added to the tool stack mean more vulnerabilities to patch. A complicated security strategy filled with technical jargon is going to be difficult to execute by anyone outside the security team. If a security practice causes friction for the users, it’s less likely to be correctly applied. As a rule of thumb — the simpler something is, the easier to secure.

Biometric login is going to be more secure than forcing the users to type in their passwords, because there’s nothing to write down on a sticky note, nothing to reset, and stealing biometric credentials is difficult to execute. Having to log in once a day is going to be more secure than forcing the user to log in every hour, because it creates fewer points in which the credentials can be sniffed or hijacked. Having a basic CI/CD pipeline is going to be more secure than adding every possible vulnerability scanner, end to end test, or multiple manual gates, because of fewer opportunities for human errors that could be exploited and fewer temptations to just log onto the server and make “just this little fix” directly.

Notice there is some tension between the principle of simplicity and the other principles of the BLISS framework. Adding bulkheads means adding potential complexity, it means we have to deal with multiple parts instead of a single one. On the other hand, which system is more complex - one big thing that does everything, or multiple small ones that do one thing each? The answer depends on the situation. Similarly, multiple levels of protection make things more complicated, because we have multiple strategies to apply in different situations. On the other hand, having those levels helps keeping things simple for, hopefully, the majority of day to day operations. Simple doesn’t mean simplistic. The best way to think about it is in terms of tradeoffs - when there’s a choice, the simpler option is better.

Pit of Success - security by default

A pit of success is when it’s so easy to do the right, the secure thing, that it happens almost by accident, that people fall into a pit of success. It should be obvious what the correct thing is, the default option should be the right one. Or to put it differently, doing the wrong thing should be so difficult that people have to go out of their way to make it happen. This means that the obstacles should be reserved for the attackers or operations that are very risky and should happen only rarely.

You can’t have security without considering the user experience. This includes developer experience, because software engineers have the power to bring the whole system down and introduce new security vulnerabilities. If maintaining security requires people to take extra steps or makes their daily jobs even a little harder, no one is going to do it. Good security should be invisible, built into the way a company operates.

A very good yet underrated way to create a pit of success in security is automation. It is rare that companies recognize all the things we do to make our code deployments smoother and more reliable as improving security. But continuous integration, infrastructure as code, utilizing serverless technologies - just to name a few, all make it difficult to make mistakes that would result in severe security incidents. Continuous integration means a robust, repeatable process to integrate new code into the codebase without breaking anything along the way, which means a malicious piece of code has to go through multiple testing and review steps to remain undetected - it can’t be included by accident. Infrastructure as code means that getting new code into production is fully automated and standardized, which means that tampering and potentially planting backdoors with production servers is something that requires a conscious effort. Going serverless means there is no server to install malware on, nothing for an engineer to log in and accidentally infecting the production machine. While it’s not enough to simply eliminate the ways in which engineers’ negligence could cause a problem, it goes a long way. Social engineering techniques are one of the most common attack vectors and people are likely to think twice when asked to do something that requires them to go out of their way.

If you want people to follow your security policies, make them easy to read and understand. If you want good password hygiene, don’t force arbitrary complexity requirements or password rotations. If you don’t want your test environments to be an open invitation to a breach, make them short-lived, created and torn down automatically, and scoped to a single pull request.

Wherever there is a potential for a data leak, credential theft or other security incident, consider: can this process be simplified? What is the impact in the worst case, what can be done to reduce it? Is this a reasonable level to introduce protection? Are there bulkheads in place that would limit the scope of this problem? This is all in contrast to a more typical, adversarial approach, where security teams focus on making it harder for people to do the wrong thing. Instead of putting all the effort towards creating obstacles for engineers and other employees preventing them from compromising the security, investing that effort in creating a pit of success results in better outcomes.

Summary

To make security BLISS instead of a checkbox filled theater, use Bulkheads to introduce isolation that limits the blast radius of incidents, apply protection on multiple Levels proportionate to the level of risk, focus on reducing the Impact of security incidents when they inevitably happen, keep your tools and processes Simple, and create a pit of Success for your team members so that they do the right thing by default.

Will all of this help you pass your next certification audit? Not necessarily. But it’s not going to make it harder and, at the same time, it will make everyone’s lives easier. This is how you create a foundation for security by default and how you can stay secure without spending a fortune on fancy tools, extra headcount, or engineering productivity loss.

Bulkheads - limiting the blast radius​

Levels - protection proportionate to the risk​

Impact - minimizing the consequences of an incident​

Simplicity - simpler means more secure​

Pit of Success - security by default​

Summary​