A Guide to Building a Resilient Security Operations Program

By Brian Magner, Vice President of Solutions Architecture

Estimated Reading Time: 11 minutes

Our vision at Deepwatch is to be the cybersecurity partner every enterprise relies on to deliver mission critical cyber resilience in an increasingly digital world. With that said, we run one of the largest Security Operations Centers (SOC) in the business giving us incredible insights into countless security operations approaches, models, and program processes. I bring this up as our scale, visibility and experience provides important lessons for any enterprise building a cyber resilient security operations program. 

In this guide, we’ll review the key considerations and recommendations in building a resilient security operations center, particularly your logging strategy, the priority of defining your unique risk profile, ways to be smarter with your detection content, evaluating your response maturity, and finally ways to measure, adapt and improve your program.

Start with the RIGHT data logging strategy

Developing a resilient Security Operations program has to start at the ground level; with the data. In the modern enterprise there are simply too many security tools generating too much data for us to reasonably think that we can make use of it all in a meaningful way. For this reason, every organization needs to align on a centralized data logging strategy that makes sense based on current technology, resource, and budget constraints within your business. In other words, don’t log all the things if you can’t make use of all the things. 

The data generated across your environment will lend itself to varying facets of security action from detection, enrichment, investigation, response, and hunting. It’s important to understand what outcome you are trying to achieve and then prioritize your available data sources based on that. Stack rank them based on importance and create a data source roadmap so that you have a strategy of what data to incorporate into the SOC next as you bring on resources and mature your ability to make use of that data. 

As you prioritize these data sources to create your logging strategy, consider the level of effort to collect and log that data as well as any overlap there may be between that data and other sources you already collect from. Does onboarding the data to the SOC require custom parsing and collection methods or do native integrations exist? 

As you gain a level of comfort with your data logging strategy you can then take it to a more granular level and execute the same approach within each data source you onboard. Rather than prioritizing one source against another, you can begin thinking about prioritizing specific events within the same data source. Data sources don’t have to be all or nothing and with so many permutations of how this can be done, defining your strategy is foundational to building a resilient security operations program.

Prioritize based on your RISK profile

Understanding what makes you unique and how that translates to risk within your security program is critical. A resilient SOC must prioritize security measures based on the unique risk profile of the organization. In order to accomplish this you must first have a foundational understanding of the assets and identities within your environment. Which would you deem critical and why? This will depend on the core functions of your business and the industry in which you operate. Crown jewels, domain controllers, file servers, production databases, web servers, ecommerce platforms, privileged users, service accounts, high ranking executives… all of these needs to be considered and tracked through healthy Change Management Database (CMDB) and asset inventory practices.

With key assets and identities outlined, you can then put a system in place to dynamically assign inherent risk values to those entities. As risks within the organization will dynamically and continuously evolve, you can use those inherent risk values to create a baseline for all of the downstream detection and response decisions. Adjusting the risk in real-time based on factors such as alert frequency, severity, appropriate threat intelligence and behavioral analytics, as an example, will allow for innate contextualization of the detections and a more tailored experience for the analysts working those investigations.

At Deepwatch, we approach the above through a proprietary method we call Dynamic Risk Scoring. The goal is simple – decrease noise, increase fidelity. The advanced correlation engine will look across extended periods of time to analyze and aggregate all of these risk values across a customer’s environment to look at the aggregate risk that a series of events poses to the business.

Assigning and aggregating risk is one thing but in order for a system like this to be operationalized, you must also define a minimum risk threshold for your environment. At what point does the risk of the activity become so great that you are to now take action? Implementing a threshold or a tolerance level is where you really start to see the efficiency gains of decreasing the volume of alerts that require action as the alert or series of alerts must now meet the minimum risk level. Again, the outcome of this sounds simple but can be challenging to attain without a system aligned to your risk profile such as this. 

Be SMARTER with your detection content 

It’s important to approach detections within the SOC differently than detections within your independent security tools. Each security tool you have deployed has its own set of targeted detections designed specifically for that tool. Detections within the SOC need to be focused more holistically on the environment and built for flexibility. You will undoubtedly add new tools and replace legacy tooling with more modern tooling in a pretty regular manner. Your SOC should be built with this mind.

Proper field mapping, data modeling, and framework standardization, such as the AWS-led Open Cybersecurity Schema Framework (OCSF) also become critical to the way you build your detections. When done right, detections are built against a framework or model and the underlying technology becomes far less relevant. What matters now is the type of data; endpoint, cloud, authentication, network, etc. With detection written against the framework you can now swap security tools in and out as needed within your organization with little to no impact on your SOC detection coverage. 

Another important piece of detection development is creating and defining the steps an analyst needs to take when that detection triggers on the playbook associated with that detection. Most SOCs treat these as separate functions but they need to be considered as one and the same. A detection that triggers without a defined investigation process can be as inefficient, and borderline useless, as an investigation playbook with a defined detection to trigger it. Creating and defining the proper investigation steps as part of building and deploying the detection ensures you have a thorough and consistent investigation from the first time that detection triggers. No gaps, no inconsistencies, and a more programmatic approach to the investigation. 

Lastly, as you think about the future and your detection roadmap, it’s nearly impossible to define the right targets and goals without knowing where the holistic coverage gaps are within the organization. In order to identify gaps you must have a framework that you map against. MITRE ATT&CK seems to be the de facto adversarial behavior framework that many SOCs leverage but it isn’t the only one (CIS, NIST, CSF, etc). Pick the one that meets your needs and map your detection content to it. This will help guide your future content development decisions, and will even help inform your data logging strategies. It can help identify blockers to building detections, such as a lack of data collected and sent to the SOC.

Start thinking about response MATURITY

Response is always a spirited topic within the SOC and among Security Operations technology and service providers. It can mean many different things to many different people. For instance, “response” means different things to different vendors. It’s important to tackle it at the most basic level though. When I have a high confidence level of a confirmed threat in my environment, what action(s) am I going to take to mitigate and eradicate that threat. While speed can be a critical element of your response, proper planning, preparedness and testing are your best friends. 

Similar to building and mapping analyst playbooks and workflows to every detection, I would challenge you to also start mapping a defined response or containment action to each detection as well. If this detection triggers AND it either has an adequate historical true positive rate OR analysis has confirmed it to be a true positive, take this action immediately. As you map this out, you’ll find that many of your detections will feed a small subset of response actions at the endpoint, network, and identity level; isolate a host, update a policy, reset a password. Start there and as you mature your SOC you can expand upon those on a more granular scale. 

This is also a great opportunity to introduce automation into the SOC. You’ll need to get to a certain comfort and accuracy level first, but automation is where you’ll get your speed and efficiency. To that end, here are three considerations to keep in mind. 1. Start small – identify lower risk endpoints and users to start with. 2. Build an exit strategy – for any automated action you build you’ll want an “undo” action as well. (Trust me you’ll need it.) 3. Don’t try to automate out the people – when we hear automation in security our brains immediately think “end to end” automations and that’s just not the reality. You are automating very targeted steps of a much larger workflow so ensure you have the proper human check, balance, and validations wrapped around your automations.

Define metrics & KPIs that drive IMPROVEMENT

Security Operations leaders must become better at creating clear narratives of security improvement to stakeholders. While it may be impressive to cite the thousands of alerts and potential threats the teams see, the SOC must do more. They must define metrics that show threat readiness, offer accountability metrics that are reasonable, and weave those metrics into narratives that various stakeholders can understand and get behind. 

One area in particular that I find to be of great value is focusing on how prepared your security operations program is to handle a particular type of threat. At Deepwatch we call these “Threat Ready” metrics. Let’s use ransomware as an example. You can determine, with a high degree of confidence, what data sources you would want visibility into, what types of behaviors you would want to look for or detect and what your immediate response would be should you see those behaviors within those data sources. 

Map that out specific to a ransomware event and compare the current state of your SOC to that. Do you have what you would need? How prepared or “threat ready” are you? Repeat the exercise for the threats that concern you most; phishing, insider threat, the latest zero-day, etc. This creates a metric tied to a type of threat that non-security stakeholder can understand while also forcing you to think beyond the standard SOC metrics.

Another great measure of the health of your security operations processes or workflow is to look for what we call “Accountability Metrics”. These are metrics that come in pairs with an independent owner of each within the pair. A great example of this is Mean Time to Notify (how long did it take the SOC to notify the business owner of a threat) and Mean Time to Acknowledge (how long did it take the business owner to acknowledge that the SOC notified them of a threat). This is very much a yin and yang where a focus on improving one without the other doesn’t yield the desired outcome. In order to improve the process both ends of the Accountability Metric must be addressed. Measuring and tracking these allow for quick identification of exactly where the process is broken. 

Set Expectations and Build Resilience

Cyber attacks are inevitable, organizations must focus on more than simply firewalls and antivirus protection or endpoint detection. Enterprise security teams must set these expectations, and aim for cyber resilience. To anticipate, withstand and recover, then adapt to threats, the SOC must establish effective and dynamic logging strategies with detection and response processes that are prioritized and aligned to business risks. Teams must measure progress, and continuously mature through metrics that can be communicated to stakeholders throughout the organization. 

Finally, whatever metrics and KPIs you choose to focus on and make paramount to your security operation program, it is so critical to learn how to effectively communicate them in a narrative that can be easily understood by the business. Something I often say is “practice translating one level up.” Analysts communicate to SOC Managers who communicate to VPs who communicate to CISOs and on we go up to the board. Use the data to build the narrative in a way that effectively captures your audience.

In the end, to build a cyber resilient SecOps program, start with the right logging strategy, prioritize actions based on unique risk profiles, be holistic and flexible with detection content, and work to measure and improve the maturity of your team and effort.

Brian Magner, Vice President of Solutions Architecture

Brian has over a decade of experience across software, technology, and cybersecurity organizations ranging in focus from risk management to network security to security operations.

Read Posts


LinkedIn Twitter YouTube

Subscribe to the Deepwatch Insights Blog