Big Data

PII Detection and Masking in RAG Pipelines


Introduction

In today’s data-driven world, safeguarding Personally Identifiable Information (PII) is paramount. PII encompasses data like names, addresses, phone numbers, and financial records, vital for individual identification. With the rise of artificial intelligence and its vast data processing capabilities, protecting PII while harnessing its potential for personalized experiences is crucial. Retrieval Augmented Generation (RAG) emerges as a solution, blending information retrieval with advanced language generation models. These systems sift through extensive data repositories to extract relevant information, refining AI-generated outputs for precision and context.

Yet, the utilization of user data poses risks of unintentional PII exposure. PII detection technologies mitigate this risk, automatically identifying and concealing sensitive data. With stringent privacy measures, RAG models leverage user data to offer tailored services while upholding privacy standards. This integration underscores the ongoing endeavor to balance personalized data usage with user privacy, prioritizing data confidentiality as AI technology advances.

Learning Objectives

  • The article delves into developing a potent PII detection tool with the Llama Index and Presidio, a Microsoft anonymization library.
  • Presidio swiftly detects and anonymizes sensitive personal data, offering users customizable PII detection tools with advanced techniques like NER, Regular Expressions, and checksum algorithms.
  • Users can customize the anonymization process with Presidio’s flexible framework, enhancing control.
  • Llama Index seamlessly integrates Presidio’s functionality for an accessible solution.
  • The article compares Presidio with NER PII post-processing tools, showcasing Presidio’s superiority and practical benefits.
PII Detection and Masking in RAG Pipelines

This article was published as a part of the Data Science Blogathon.

Hands-on PII detection using Llama Index Post-processing tools

Let’s start our exploration with the NERPIINodePostprocessor tool from Llama Index. For that, we will need to install a few necessary packages.

The list of necessary packages is listed below:

llama-index==0.10.22
llama-index-agent-openai==0.1.7
llama-index-cli==0.1.11
llama-index-core==0.10.23
llama-index-indices-managed-llama-cloud==0.1.4
llama-index-legacy==0.9.48
llama-index-multi-modal-llms-openai==0.1.4
llama-index-postprocessor-presidio==0.1.1
llama-parse==0.3.9
llamaindex-py-client==0.1.13
presidio-analyzer==2.2.353
presidio-anonymizer==2.2.353
pydantic==2.5.3
pydantic_core==2.14.6
spacy==3.7.4
torch==2.2.1+cpu
transformers==4.39.1

To test the tool, we require dummy data for PII detection. For experimentation, handwritten texts containing fabricated names, dates, credit card numbers, phone numbers, and email addresses were utilized. Alternatively, any text of choice can be used for testing, or GPT can be employed to generate text. The following texts will be utilized for our experimentation:

text = """
Hi there! You can call me Max Turner. Reach out at [email protected],
and you'll find me strolling the streets of Vienna. My plastic friend, the 
Mastercard, reads 5300-1234-5678-9000. Ever vibed at a gig by Zsofia Kovacs? 
I'm curious. As for my card, it has a limit I'd rather not disclose here; 
however, my bank details are as follows: AT611904300235473201. Turner is the 
family name. Tracing my roots, I've got ancestors named Leopold Turner and
Elisabeth Baumgartner. Also, a quick FYI: I tried to visit your website, but 
my IP (203.0.113.5) seems to be barred. I did, however, manage to post a 
visual at this link: http://MegaMovieMoments.fi.
"""

Step 1: Initializing the Tool and Importing Dependencies

With the packages installed and sample text prepared, we proceed to utilize the NERPIINodePostprocessor tool. Importing NERPIINodePostprocessor from Llama Index is necessary, along with importing the TextNode schema from Llama Index to create a text node. This step is crucial as NERPIINodePostprocessor operates on TextNode objects rather than raw strings.

Below is the code snippet for imports:

from llama_index.core.postprocessor import NERPIINodePostprocessor
from llama_index.core.schema import TextNode
from llama_index.core.schema import NodeWithScore

Step 2: Creating TextNode Objects

Following the imports, we proceed to create a TextNode object using our sample text.

text_node = TextNode(text=text)

Step 3: Post-processing Sensitive Entities

Subsequently, we create a NERPIINodePostprocessor object and apply it to our TextNode object to post-process and mask the sensitive entities.

processor = NERPIINodePostprocessor()

new_nodes = processor.postprocess_nodes(
    [NodeWithScore(node=text_node)]
)

Step 4: Reviewing Post-Processed Text and PII Entity Mapping

After completing the post-processing of our text, we can now examine the post-processed text alongside the PII entity mapping.

pprint(new_nodes[0].node.get_content())

# OUTPUT
# 'Hi there! You can call me [PER_26]. Reach out at [email protected], '
# "and you'll find me strolling the streets of [LOC_122]. My plastic friend, "
# 'the [ORG_153], reads 5300-1234-5678-9000. Ever vibed at a gig by [PER_215]? '
# "I'm curious. As for my card, it has a limit I'd rather not disclose here; "
# 'however, my bank details are as follows: AT611904300235473201. [PER_367] is '
# "the family name. Tracing my roots, I've got ancestors named Leopold "
# '[PER_367] and [PER_456]. Also, a quick FYI: I tried to visit your website, '
# 'but my IP (203.0.113.5) seems to be barred. I did, however, manage to post a '
# 'visual at this link: [ORG_627].fi.')

pprint(new_nodes[0].node.metadata)

# OUTPUT
# {'__pii_node_info__': {'[LOC_122]': 'Vienna',
#                        '[ORG_153]': 'Mastercard',
#                        '[ORG_627]': 'MegaMovieMoments',
#                        '[PER_215]': 'Zsofia Kovacs',
#                        '[PER_26]': 'Max Turner',
#                        '[PER_367]': 'Turner',
#                        '[PER_437]': 'Leopold Turner',
#                        '[PER_456]': 'Elisabeth Baumgartner'}}

Assessing the Limitations of NERPIINodePostprocessor and Introduction to Presidio

Upon reviewing the results, it’s evident that the postprocessor fails to mask highly sensitive entities such as credit card numbers, phone numbers, and email addresses. This outcome deviates from our intention, as we aimed to mask all sensitive entities including names, addresses, credit card numbers, and email addresses.

While the NERPIINodePostprocessor effectively masks Named Entities like person and company names, with their respective entity type and count, it proves inadequate for masking texts containing highly sensitive content. Now that we understand the functionality of the NERPIINodePostprocessor and its limitations in masking sensitive information, let’s assess the performance of Presidio on the same text. We’ll explore Presidio’s functionality first and then proceed with utilizing Llama Index’s Presidio implementation.

Assessing the Limitations of NERPIINodePostprocessor and Introduction to Presidio

Importing Essential Packages for Presidio Integration

To begin, import the requisite packages. This includes the AnalyzerEngine and AnonymizerEngine from Presidio. Additionally, import the PresidioPIINodePostprocessor, which serves as the Llama Index’s integration of Presidio.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from llama_index.postprocessor.presidio import PresidioPIINodePostprocessor

Initializing and Analyzing Text with the Analyzer Engine

Proceed by initializing the Analyzer Engine using the list of supported languages. Set it to a list containing ‘en’ for the English language. This enables Presidio to determine the language of the text content. Subsequently, utilize the analyzer instance to analyze the text.

analyzer = AnalyzerEngine(supported_languages=["en"])

results = analyzer.analyze(text=text, language="en")

Below is the result after analyzing the text content. It shows the PII entity type, its star and end index in the string and the probability score.

Initializing the Anonymizer Engine

After initializing the Analyzer Engine, proceed to initialize the Anonymizer Engine. This component will anonymize the original text based on the results obtained from the Analyzer Engine.

engine = AnonymizerEngine()

new_text = engine.anonymize(text=text, analyzer_results=results)

Below is the output from the anonymizer engine, showcasing the original text with masked PII entities.

pprint(new_text.text)

# OUTPUT
#  "Hi there! You can call me <PERSON>. Reach out at <EMAIL_ADDRESS>, and you'll "
#  'find me strolling the streets of <LOCATION>. My plastic friend, the '
#  "<IN_PAN>, reads <IN_PAN>5678-9000. Ever vibed at a gig by <PERSON>? I'm "
#  "curious. As for my card, it has a limit I'd rather not disclose here; "
#  'however, my bank details are as follows: AT611904300235473201. <PERSON> is '
#  "the family name. Tracing my roots, I've got ancestors named <PERSON> and "
#  '<PERSON>. Also, a quick FYI: I tried to visit your website, but my IP '
#  '(<IP_ADDRESS>) seems to be barred. I did, however, manage to post a visual '
#  'at this link: <URL>.'

Also Read: RAG Powered Document QnA & Semantic Caching with Gemini Pro

Analyzing PII Masking with Presidio

Presidio effectively masks all PII entities by enclosing their entity type within ‘<‘ and ‘>’. However, the masking lacks unique identifiers for entity items. Here, Llama Index integration enhances the process. The Presidio implementation of Llama Index not only returns the masked text with entity type counts but also provides a deanonymizer map for deanonymization. Let’s explore how to utilize these features.

First create a TextNode object using the input text.
text_node = TextNode(text=text)
Next, create an instance of PresidioPIINodePostprocessor and run the postprocessor on the TextNode.
processor = PresidioPIINodePostprocessor()

new_nodes = processor.postprocess_nodes(
    [NodeWithScore(node=text_node)]
)
Finally, we get the masked text from the anonymizer along with the deanonymizer map.
pprint(new_nodes[0].node.get_content())

# OUTPUT
#  'Hi there! You can call me <PERSON_5>. Reach out at <EMAIL_ADDRESS_1>, and '
#  "you'll find me strolling the streets of <LOCATION_1>. My plastic friend, the "
#  '<IN_PAN_2>, reads <IN_PAN_1>5678-9000. Ever vibed at a gig by <PERSON_4>? '
#  "I'm curious. As for my card, it has a limit I'd rather not disclose here; "
#  'however, my bank details are as follows: AT611904300235473201. <PERSON_3> is '
#  "the family name. Tracing my roots, I've got ancestors named <PERSON_2> and "
#  '<PERSON_1>. Also, a quick FYI: I tried to visit your website, but my IP '
#  '(<IP_ADDRESS_1>) seems to be barred. I did, however, manage to post a visual '
#  'at this link: <URL_1>.'


pprint(new_nodes[0].metadata)

# OUTPUT
# {'__pii_node_info__': {'<EMAIL_ADDRESS_1>': '[email protected]',
#                        '<IN_PAN_1>': '5300-1234-',
#                        '<IN_PAN_2>': 'Mastercard',
#                        '<IP_ADDRESS_1>': '203.0.113.5',
#                        '<LOCATION_1>': 'Vienna',
#                        '<PERSON_1>': 'Elisabeth Baumgartner',
#                        '<PERSON_2>': 'Leopold Turner',
#                        '<PERSON_3>': 'Turner',
#                        '<PERSON_4>': 'Zsofia Kovacs',
#                        '<PERSON_5>': 'Max Turner',
#                        '<URL_1>': 'MegaMovieMoments.fi'}}

The masked text generated by PresidioPIINodePostprocessor effectively masks all PII entities, indicating their entity type and count. Additionally, it provides a deanonymizer map, facilitating the subsequent deanonymization of the masked text.

Applications and Limitations

By leveraging the PresidioPIINodePostprocessor tool, we can seamlessly anonymize information within our RAG pipeline, prioritizing user data privacy. Within the RAG pipeline, it can serve as a data anonymizer during data ingestion, effectively masking sensitive information. Similarly, in the query pipeline, it can function as a deanonymizer, allowing authenticated users to access sensitive information while maintaining privacy. The deanonymizer map can be securely stored in a protected location, ensuring the confidentiality of sensitive data throughout the process.

Applications and Limitations

The PII anonymizer tool finds utility in RAG pipelines dealing with financial documents or sensitive user/organization information, necessitating protection from unidentified or unauthorized access. It ensures secure storage of anonymized document contents within the vector store, even in the event of a data breach. Additionally, it proves valuable in RAG pipelines involving organization or personal emails, where sensitive data like addresses, password change URLs, and OTPs are prevalent, necessitating ingestion in an anonymized state.

Limitations

While the PII detection tool can be useful in RAG pipelines, there are some limitations to implementing it into an RAG pipeline. 

  • Adding PII detection and masking can introduce additional processing time to the RAG pipeline, which may impact the overall performance and latency of the system, especially with large datasets or when real-time processing is required.
  • No PII detection tool is perfect; there can be instances of false positives, where non-PII data is mistakenly masked, or false negatives, where actual PII is not detected. Both scenarios can have implications for user experience and data protection efficacy.
  • Presidio may have limitations in understanding context and nuances across different languages, potentially reducing their effectiveness in accurately identifying PII in multilingual datasets. 
  • While the PII anonymization tool can mask sensitive information accurately, the initial ingestion of data still requires careful handling. If a breach occurs before the data is anonymized, sensitive information could be exposed.
  • In cases where anonymization needs to be reversible, maintaining secure and controlled access to deanonymization keys or maps is critical, and failure to do so could compromise the integrity of the anonymization process.

Conclusion

In conclusion, the incorporation of PII detection and masking tools like Presidio into RAG pipelines marks a notable stride in AI’s capacity to handle sensitive data while upholding individual privacy. Through the utilization of advanced techniques and customizable features, Presidio elevates the security and adaptability of text generation, meeting the escalating need for data privacy in the digital era. Despite potential challenges such as latency and accuracy, the advantages of safeguarding user data with sophisticated anonymization tools are undeniable, positioning it as a crucial element for responsible AI development and deployment.

Key Takeaways

  • With the increasing use of AI and big data, the need to protect Personally Identifiable Information (PII) in any system that processes user data is critical.
  • Retrieval Augmented Generation (RAG) systems, which combine information retrieval with language generation, can potentially expose PII. Therefore, incorporating PII detection and masking mechanisms is essential to maintain privacy standards.
  • Microsoft’s Presidio offers robust PII detection and anonymization capabilities, making it a suitable choice for integrating into RAG pipelines. It provides predefined and customizable PII detectors, leveraging NER, Regular Expressions, and checksum.
  • Presidio is preferred over basic NER PII post-processing tools due to its sophisticated anonymization features, flexibility, and higher accuracy in detecting a wide range of PII entities.
  • The PII anonymization tool is particularly useful in RAG pipelines dealing with financial documents, sensitive organizational data, and emails, ensuring that private information is not exposed to unauthorized users.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.