Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships in text. The service can extract people, places, sentiments, and topics in unstructured data. You can now use Amazon Comprehend ML capabilities to detect and redact personally identifiable information (PII) in application logs, customer emails, support tickets, and more. No ML experience required. Redacting PII entities helps you protect privacy and comply with local laws and regulations.
Use case: Applications printing PII data in log output
Some applications print PII data in their log output inadvertently. In some cases, this may be due to developers forgetting to remove debug statements before deploying the application in production, and in other cases it may be due to legacy applications that are handed down and are difficult to update. PII can also get printed in stack traces. It’s generally a mistake to have PII present in such logs. Correlation IDs and primary keys are better identifiers than PII when debugging applications.
PII in application logs can quickly propagate to downstream systems, compounding security concerns. For example it may get submitted to search and analytics systems where it’s searchable and viewable by everyone. It may also be stored in object storage such as Amazon Simple Storage Service (Amazon S3) for analytics purposes. With the PII detection API of Amazon Comprehend, you can remove PII from application log output before such a log statement even gets printed.
In this post, I take the use case of a Java application that is generating log output with PII. The initial log output goes through filter-like processing that redacts PII before the log statement is output by the application. You can take a similar approach for other programming languages.
The application can be repackaged by changing its log format file, such as
log4j.xml, and adding one Java class from this sample project, or adding this Java class as a dependency in the form of a .jar file.
The sample application is available in the following GitHub repo.
PII entity types
The following table lists some of the entity types Amazon Comprehend detects.
|PII Entity Types||Description|
|An email address, such as [email protected]|
|NAME||An individual’s name. This entity type does not include titles, such as Mr., Mrs., Miss, or Dr. Amazon Comprehend does not apply this entity type to names that are part of organizations or addresses. For example, Amazon Comprehend recognizes the “John Doe Organization” as an organization, and it recognizes “Jane Doe Street” as an address.|
|PHONE||A phone number. This entity type also includes fax and pager numbers.|
|SSN||A Social Security Number (SSN) is a 9-digit number that is issued to US citizens, permanent residents, and temporary working residents. Amazon Comprehend also recognizes Social Security Numbers when only the last 4 digits are present.|
For the full list, see Detect Personally Identifiable Information (PII).
The API response from Amazon Comprehend includes the entity type, its begin offset, end offset, and a confidence score. For this post, we use all of them.
Our example application is a very simple application that simulates opening a bank account for a user. In its current form, the log output looks like the following code. We can see this by making requests to the endpoint
The output prints
Log4j 2 is a common Java library used for logging.
Appenders in Log4j are responsible for delivering log events to their destinations, which can be console, file, and more. Log4j also has a
RewriteAppender that lets you rewrite the log message before it is output.
RewriteAppender works in conjunction with a
RewritePolicy that provides the implementation for changing the log output.
The sample application uses the following
log4j.xml file for log configuration:
RewritePolicy we created for this project is named
SensitiveDataPolicy. It uses four parameters:
maskMode – This parameter has two modes:
REPLACE – The policy replaces discovered entities with their type names. For example, in case of social security numbers, the replaced string is
MASK – The policy replaces the discovered entity with a string consisting of the character provided as a
- REPLACE – The policy replaces discovered entities with their type names. For example, in case of social security numbers, the replaced string is
mask – The character to use to replace the discovered entity with. Only relevant if
- minScore – The minimum confidence score acceptable to us.
entitiesToReplace – A comma-separated list of entity type names that we want to replace. For example, we’re choosing to replace social security number and email, so the string value we provide is
SSN,EMAIL. Amazon Comprehend also detects
NAMEin our application, but it’s printed as is.
Choosing redaction vs. masking is a matter of preference. Redaction is usually preferred when the context needs to be preserved, such as in natural text, whereas masking is best for maintaining text length as well as structured data such as formatted files or key-value pairs.
Detecting PII is as simple as making an API call to Amazon Comprehend using the AWS SDK and providing the text to analyze:
Because our policy makes synchronous calls to Amazon Comprehend for PII detection, we want this processing to happen asynchronously, outside of customer request loop, to avoid introducing latency. For instructions, see Asynchronous Loggers for Low-Latency Logging. We add the Disruptor library to our classpath by adding it to
We also need to set a system property. After we package our application with mvn package, we can run it as in the following code:
Updated log output
The log output from this application now looks like the following. We can see that
We learned how to use Amazon Comprehend to redact sensitive data natively within next-generation applications. For information about applying it as a postprocessing technique for logs in storage, see Detecting and redacting PII using Amazon Comprehend. The API lets you have complete control over the entities that are important for your use case and lets you either mask or redact the information.
For more information about Amazon Comprehend availability and quotas, see Amazon Comprehend endpoints and quotas.
About the Author
Pradeep Singh is a Solutions Architect at Amazon Web Services. He helps AWS customers take advantage of AWS services to design scalable and secure applications. His expertise spans Application Architecture, Containers, Analytics and Machine Learning.