DSPM Scanning Limits and Constraints
Understanding how DSPM scanning engines process and report sensitive data
Overview
This document describes the processing limits and constraints applied by the DLP scanning engines. Understanding these limits helps set accurate expectations for how sensitive data is detected, reported, and surfaced within your environment.
Two scanning engines are used depending on the deployment context: the Normalyze (NZ) Engine (HyperScan-based) and the UDLP Engine (Java RegEx-based). Each has distinct limits for data volume, match counts, and result reporting.
Quick Reference
|
Limit |
NZ Engine |
UDLP Engine |
|---|---|---|
|
Maximum data scanned |
1 MB per scan |
16 MB per scan |
|
Maximum total matches |
1,000 entities |
1,000 matched detectors |
|
Maximum matches per type |
Not independently capped |
30 per detector (SmartID) |
|
Dictionary match limit |
N/A |
255 per dictionary |
|
Maximum snippets captured |
20 per entity type |
Distributed across detectors |
|
Scan technology |
HyperScan (C++) |
Java RegEx with HyperScan pre-filter |
NZ Engine
How It Works
The NZ Engine uses HyperScan, a high-performance regular expression library, to scan text data. It processes content in chunks and applies limits to both the volume of data scanned and the number of results returned.
Limits
- Text chunk size: 1 MB — only the first 1 MB of content is scanned
- Maximum results: 1,000 entities across all sensitive data types combined
- Snippet capture: Up to 20 snippets per entity type, drawn from the 1,000-entity result pool
Behavior
The NZ Engine scans the first 1 MB of text data and returns the first 1,000 matching entities found, regardless of type. Results are reported in the order they appear in the content.
Example: If the first 1,000 entities found are all dates, only dates will be reported — even if other sensitive data types (such as SSNs or emails) appear later in the document.
UDLP Engine
How It Works
The UDLP Engine uses a Java-based regular expression library with a HyperScan pre-filter. HyperScan first scans text to identify potential match candidates; only those candidates are then evaluated by the Java RegEx engine. This two-step approach allows for advanced RegEx features such as lookahead assertions, which HyperScan alone does not support.
Limits
- Text data: Up to 16 MB scanned per document (enforced by the UDLP text extractor)
- Matched detectors: Up to 1,000 detectors can be reported per scan
- Matches per SmartID (detector): Capped at 30 per type
- Dictionary matches: Capped at 255 per dictionary
- Find limit: 10x the matcher count (300 find attempts per detector) to maximize the chance of capturing 30 confirmed matches
Behavior
The UDLP Engine scans the first 16 MB of document content. It independently caps matches at 30 per detector type, regardless of how many total instances exist in the document.
Example: If a document contains 900 SSNs and 500 email addresses, the engine reports 30 SSNs and 30 emails. If a document contains 10,000 SSNs followed by just 5 emails, it reports 30 SSNs and 5 emails.
Snippet Distribution
Snippets (the text excerpts surfaced alongside match results) are distributed evenly across all detectors that produce matches in a given scan.
|
Detectors with Matches |
Snippets per Detector |
|---|---|
|
1,000 detectors |
1 snippet each |
|
100 detectors |
10 snippets each |
|
10 detectors |
Up to 100 snippets each (subject to overall cap) |
Pre-Filter Mode & SmartID 2.0
Current Architecture
In the current implementation, HyperScan acts as a pre-filter: it performs a fast initial scan to identify candidate text regions. Those regions are then passed to the Java RegEx engine for precise matching. This allows the use of advanced RegEx constructs (such as lookahead) that HyperScan natively does not support.
SmartID 2.0
With the release of SmartID 2.0, all regular expressions will be rewritten to be natively compatible with HyperScan. This eliminates the need for the Java RegEx pre-filtering step entirely. Key implications:
- No pre-filter stage: HyperScan will handle all matching directly.
- Improved performance: Removing the Java RegEx layer reduces processing overhead.
- Full HyperScan compatibility required: All detectors must conform to HyperScan’s RegEx syntax.
Key Takeaways
- Only the first 1 MB (NZ) or 16 MB (UDLP) of a document is scanned — content beyond these limits is not evaluated.
- NZ Engine results are first-come, first-served across all types; a single prevalent type can exhaust the 1,000-entity limit.
- UDLP Engine enforces per-type caps of 30 matches, ensuring multiple data types are represented in results even when one type is extremely common.
- Snippet availability depends on how many detector types match — more matching types means fewer snippets per type.
- SmartID 2.0 will simplify the UDLP architecture by removing the Java RegEx pre-filter layer.