Data Scan

Find the risks in your data

After a cloud account is onboarded and the data stores are discovered, comes the most critical aspect, visibility into your Data. The risks associated with data is at the core of DSPM and this task is executed by the Data Scan operation.

The Data Scan crawls data stores associated with your cloud accounts that are onboarded on the platform and scans data to find the potential risks associated with each of these stores. After both operations are completed, each of the data stores are then tagged with the appropriate values under the Entities, Profiles, and Data Classification columns.

This operation can be run in two ways: UI based or command line based. The latter option is used in scenarios where the data stores are either on the compute instances like EC2, Azure VM or does not have public hosts exposed like AWS Redis, AWS Memcache, etc.

Note: In AWS, the Normalyze Data Scanner Lambda requires 8gb of memory (this is the default deployment configuration), new AWS accounts have reduced concurrency and memory quotas that may cause the Lambda to deploy with less memory than this amount, which will cause Data Scans to fail. See AWS’s documentation on limits for more details: https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html

Please ensure your AWS account does not have quota limitations on Lambda functions, and if an increase is required please see AWS’s documentation on increasing quotas:

https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html

Details for each of these scan types are explained in the following sections.

Incremental Data Scan:

In the default Data Scan operation each time a scan is run, all the objects in its list will be scanned even if some of the them were already scanned moments ago. In order to optimize the scan functionality and eliminate this redundancy, “Incremental Scan” feature is available where in the Data scan logic would automatically identify the objects that were added or modified recently and scan only those.

The feature is currently available for AWS S3 datastores and Microsoft Sharepoint and OneDrive datastores. To leverage this feature, the specific Scan Profiles have to be updated and the option to scan incrementally should be enabled.

For AWS S3 buckets the pre-condition for this feature to work is to have Data Events enabled CloudTrail set up for S3 buckets and it should be logging Write events. The Normalyze scan resource looks up the cloudtrail and identifies the files which have been added or modified and runs the scan operation only those items the next time scan task is run.

More details on how to set up the CloudTrail for DataEvents are here.

For Microsoft Sharepoint and OneDrive datastores, the option for Incremental Scan should be enabled only after the full data scan is completed atleast once. Once it is enabled, the Normalyze resource would be invoking the scan operation for a batch of files that are added or modified in either of these datastores.

The steps to set up Incremental Data Scan for S3 buckets and Microsoft Sharepoint / OneDrive are detailed in the Scan Scheduler section.