DSPM Scanning Limits and Constraints

Understanding how DSPM scanning engines process and report sensitive data

Overview

This document describes the processing limits and constraints applied by the DLP scanning engines. Understanding these limits helps set accurate expectations for how sensitive data is detected, reported, and surfaced within your environment.

Two scanning engines are used depending on the deployment context: the Normalyze (NZ) Engine (HyperScan-based) and the UDLP Engine (Java RegEx-based). Each has distinct limits for data volume, match counts, and result reporting.

Quick Reference

Limit

NZ Engine

UDLP Engine

Maximum data scanned

1 MB per scan

16 MB per scan

Maximum total matches

1,000 entities

1,000 matched detectors

Maximum matches per type

Not independently capped

30 per detector (SmartID)

Dictionary match limit

N/A

255 per dictionary

Maximum snippets captured

20 per entity type

Distributed across detectors

Scan technology

HyperScan (C++)

Java RegEx with HyperScan pre-filter

NZ Engine

How It Works

The NZ Engine uses HyperScan, a high-performance regular expression library, to scan text data. It processes content in chunks and applies limits to both the volume of data scanned and the number of results returned.

Limits

  • Text chunk size: 1 MB — only the first 1 MB of content is scanned
  • Maximum results: 1,000 entities across all sensitive data types combined
  • Snippet capture: Up to 20 snippets per entity type, drawn from the 1,000-entity result pool

Behavior

The NZ Engine scans the first 1 MB of text data and returns the first 1,000 matching entities found, regardless of type. Results are reported in the order they appear in the content.

Example: If the first 1,000 entities found are all dates, only dates will be reported — even if other sensitive data types (such as SSNs or emails) appear later in the document.

UDLP Engine

How It Works

The UDLP Engine uses a Java-based regular expression library with a HyperScan pre-filter. HyperScan first scans text to identify potential match candidates; only those candidates are then evaluated by the Java RegEx engine. This two-step approach allows for advanced RegEx features such as lookahead assertions, which HyperScan alone does not support.

Limits

  • Text data: Up to 16 MB scanned per document (enforced by the UDLP text extractor)
  • Matched detectors: Up to 1,000 detectors can be reported per scan
  • Matches per SmartID (detector): Capped at 30 per type
  • Dictionary matches: Capped at 255 per dictionary
  • Find limit: 10x the matcher count (300 find attempts per detector) to maximize the chance of capturing 30 confirmed matches

Behavior

The UDLP Engine scans the first 16 MB of document content. It independently caps matches at 30 per detector type, regardless of how many total instances exist in the document.

Example: If a document contains 900 SSNs and 500 email addresses, the engine reports 30 SSNs and 30 emails. If a document contains 10,000 SSNs followed by just 5 emails, it reports 30 SSNs and 5 emails.

Snippet Distribution

Snippets (the text excerpts surfaced alongside match results) are distributed evenly across all detectors that produce matches in a given scan.

Detectors with Matches

Snippets per Detector

1,000 detectors

1 snippet each

100 detectors

10 snippets each

10 detectors

Up to 100 snippets each (subject to overall cap)

Pre-Filter Mode & SmartID 2.0

Current Architecture

In the current implementation, HyperScan acts as a pre-filter: it performs a fast initial scan to identify candidate text regions. Those regions are then passed to the Java RegEx engine for precise matching. This allows the use of advanced RegEx constructs (such as lookahead) that HyperScan natively does not support.

SmartID 2.0

With the release of SmartID 2.0, all regular expressions will be rewritten to be natively compatible with HyperScan. This eliminates the need for the Java RegEx pre-filtering step entirely. Key implications:

  • No pre-filter stage:  HyperScan will handle all matching directly.
  • Improved performance:  Removing the Java RegEx layer reduces processing overhead.
  • Full HyperScan compatibility required:  All detectors must conform to HyperScan’s RegEx syntax.

Key Takeaways

  • Only the first 1 MB (NZ) or 16 MB (UDLP) of a document is scanned — content beyond these limits is not evaluated.
  • NZ Engine results are first-come, first-served across all types; a single prevalent type can exhaust the 1,000-entity limit.
  • UDLP Engine enforces per-type caps of 30 matches, ensuring multiple data types are represented in results even when one type is extremely common.
  • Snippet availability depends on how many detector types match — more matching types means fewer snippets per type.
  • SmartID 2.0 will simplify the UDLP architecture by removing the Java RegEx pre-filter layer.