Automated Sensitive Data Leak Detection

The average multinational spends several million dollars a year on compliance, while in highly regulated industries — like financial services and defense — the costs can be in the tens or even hundreds of millions. Despite of conducting these rigorous assessments yet we wake up data breach announcements on an hourly basis.

Why is tracking sensitive data a hard problem?

Redacting sensitive data might seem simple at first: just overwrite the most or least significant bits of data with zeros or XXXs and carry on, right? A less trivial analysis reveals important questions, including:

  • Imagine that we have two variables X and Y which are defined to be “sensitive.” We then declare and initialize another local variable Z with an initial value derived from some function that combines X and Y.
    Is Zsensitive” ? Does Z need to be redacted/obfuscated?
  • Exactly when should redaction/obfuscation be performed, relating to the scope and lifetime of data objects which is, in turn, intricately entwined with a particular programming language’s model (functional, object oriented) of how data should be organized and (de-)allocated?
  • How can we regulate the flow of sensitive data and its derivatives throughout the scope of the running application. Can sensitive and regulated data be sent to 3rd party analytics services to measure DAUs and MAUs without consent of consumer who owns her/his data?
  • How do we verify that redaction/obfuscation really has been performed correctly, to the satisfaction of ourselves, our customers, and regulators?

The various shapes of data

Data originates when consumers subscribe to interact with value add services. When a consumer registers or signs in to the service, data objects are created to represent the customer persona. The lifetime of these objects is restricted to the scope of the customer session. A typical customer session triggers various functional flows to serve his or her needs, leading to the creation of many communication paths, both within the core application and across its boundary to other SaaS applications.

  • have the ability to classify these detected types as sensitive based on a supervised model using natural language processing that is trained upon corpus of compliance mandates.
  • track all transformations, lineage and provenance of such sensitive types
  • finally measure if such sensitive types are violating any current (SOC-2, GDPR) or forthcoming (CCPA) compliance constraints.

Prevent compliance metrics from going astray

So how do we create models that can credibly evaluate the impact of a data driven compliance program?

  • Write operations — The termination point (or endpoint) of an ordered data flow.
  • Transformations — In addition to read and write interface interactions, we identify data transformations, for example, encryption/decryption, redaction, escape routines, etc.

Compliance Engineering using Code-as-Data

ShiftLeft’s Ocular is an application security platform built over the foundational Code Property Graph that is uniquely positioned to deliver a specification model to query for sensitive data leaks that might exist in your application’s codebase.

Engineer, InfoSec tinkerer, Seed Investor, Founder/CTO of ShiftLeft Inc., (Opinions, my own)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store