DSPM Scanning Limits and Constraints

Understanding how DSPM scanning engines process and report sensitive data

Overview

This document describes the processing limits and constraints applied by the DLP scanning engines. Understanding these limits helps set accurate expectations for how sensitive data is detected, reported, and surfaced within your environment.

Two scanning engines are used depending on the deployment context: the Normalyze (NZ) Engine (HyperScan-based) and the UDLP Engine (Java RegEx-based). Each has distinct limits for data volume, match counts, and result reporting.

Quick Reference

Limit	NZ Engine	UDLP Engine
Maximum data scanned	1 MB per scan	16 MB per scan
Maximum total matches	1,000 entities	1,000 matched detectors
Maximum matches per type	Not independently capped	30 per detector (SmartID)
Dictionary match limit	N/A	255 per dictionary
Maximum snippets captured	20 per entity type	Distributed across detectors
Scan technology	HyperScan (C++)	Java RegEx with HyperScan pre-filter

NZ Engine

How It Works

The NZ Engine uses HyperScan, a high-performance regular expression library, to scan text data. It processes content in chunks and applies limits to both the volume of data scanned and the number of results returned.

Limits

Text chunk size: 1 MB — only the first 1 MB of content is scanned
Maximum results: 1,000 entities across all sensitive data types combined
Snippet capture: Up to 20 snippets per entity type, drawn from the 1,000-entity result pool

Behavior

The NZ Engine scans the first 1 MB of text data and returns the first 1,000 matching entities found, regardless of type. Results are reported in the order they appear in the content.

Example: If the first 1,000 entities found are all dates, only dates will be reported — even if other sensitive data types (such as SSNs or emails) appear later in the document.

UDLP Engine

How It Works

The UDLP Engine uses a Java-based regular expression library with a HyperScan pre-filter. HyperScan first scans text to identify potential match candidates; only those candidates are then evaluated by the Java RegEx engine. This two-step approach allows for advanced RegEx features such as lookahead assertions, which HyperScan alone does not support.

Limits

Text data: Up to 16 MB scanned per document (enforced by the UDLP text extractor)
Matched detectors: Up to 1,000 detectors can be reported per scan
Matches per SmartID (detector): Capped at 30 per type
Dictionary matches: Capped at 255 per dictionary
Find limit: 10x the matcher count (300 find attempts per detector) to maximize the chance of capturing 30 confirmed matches

Behavior

The UDLP Engine scans the first 16 MB of document content. It independently caps matches at 30 per detector type, regardless of how many total instances exist in the document.

Example: If a document contains 900 SSNs and 500 email addresses, the engine reports 30 SSNs and 30 emails. If a document contains 10,000 SSNs followed by just 5 emails, it reports 30 SSNs and 5 emails.

Snippet Distribution

Snippets (the text excerpts surfaced alongside match results) are distributed evenly across all detectors that produce matches in a given scan.

Detectors with Matches	Snippets per Detector
1,000 detectors	1 snippet each
100 detectors	10 snippets each
10 detectors	Up to 100 snippets each (subject to overall cap)

Pre-Filter Mode & SmartID 2.0

Current Architecture

In the current implementation, HyperScan acts as a pre-filter: it performs a fast initial scan to identify candidate text regions. Those regions are then passed to the Java RegEx engine for precise matching. This allows the use of advanced RegEx constructs (such as lookahead) that HyperScan natively does not support.

SmartID 2.0

With the release of SmartID 2.0, all regular expressions will be rewritten to be natively compatible with HyperScan. This eliminates the need for the Java RegEx pre-filtering step entirely. Key implications:

No pre-filter stage: HyperScan will handle all matching directly.
Improved performance: Removing the Java RegEx layer reduces processing overhead.
Full HyperScan compatibility required: All detectors must conform to HyperScan’s RegEx syntax.

Key Takeaways

Only the first 1 MB (NZ) or 16 MB (UDLP) of a document is scanned — content beyond these limits is not evaluated.
NZ Engine results are first-come, first-served across all types; a single prevalent type can exhaust the 1,000-entity limit.
UDLP Engine enforces per-type caps of 30 matches, ensuring multiple data types are represented in results even when one type is extremely common.
Snippet availability depends on how many detector types match — more matching types means fewer snippets per type.
SmartID 2.0 will simplify the UDLP architecture by removing the Java RegEx pre-filter layer.