Learnixo

Scenario Based Questions · Lesson 9 of 13

Scenario: PII Is Appearing in Your Logs

The Scenario

Your quarterly security audit returns a critical finding: patient names, date-of-birth values, and medication dosages are appearing in your application's structured logs in Azure Monitor. A log entry looks like this:

JSON
{
  "timestamp": "2026-05-15T09:23:14Z",
  "level": "INFO",
  "service": "rag-api",
  "request_id": "a7f3b29c",
  "user_query": "Can John Smith, born 03/04/1985, take 150mg of sertraline with metformin?",
  "llm_response": "John Smith should consult their doctor before combining sertraline and metformin...",
  "user_id": "user-7821"
}

This is a HIPAA violation waiting to happen. Patient names and medication details are Protected Health Information (PHI). They have no business being in application logs that are retained for 90 days and accessible to your entire engineering team.

Why This Happens

The root cause is almost always the same pattern: a developer added request/response logging for debugging purposes during development and nobody reviewed it before production.

Python
# The dangerous anti-pattern:
@app.post("/query")
async def query_endpoint(body: QueryRequest):
    logger.info(f"Processing query: {body.query}")  # BUG: logs PII directly
    response = await process_query(body.query)
    logger.info(f"Generated response: {response}")  # BUG: logs PII in response
    return {"answer": response}

This seems innocent during development when all queries are "what is the capital of France?" It becomes catastrophic when real users start asking health questions.

Step 1: Audit Your Existing Logs

Before fixing the pipeline, understand the scope of the problem. Query your logs for known PII patterns:

Python
import re
from dataclasses import dataclass
from typing import List

PII_AUDIT_PATTERNS = {
    "full_name_pattern": r"\b[A-Z][a-z]{2,}\s+[A-Z][a-z]{2,}\b",
    "date_of_birth": r"\b(0[1-9]|1[0-2])/(0[1-9]|[12][0-9]|3[01])/\d{4}\b",
    "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
    "email": r"\b[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}\b",
    "phone": r"\b(\+1\s?)?\(?\d{3}\)?[\s.\-]?\d{3}[\s.\-]?\d{4}\b",
    "medication_dosage": r"\b\d+\s*(mg|ml|mcg|IU)\b",
    "credit_card": r"\b(?:\d{4}[\s\-]?){3}\d{4}\b",
}

@dataclass
class PIIAuditResult:
    log_id: str
    field: str
    pii_type: str
    match_preview: str  # first 20 chars of match for verification

def audit_log_entry(log_entry: dict) -> List[PIIAuditResult]:
    """Scan a single log entry for PII patterns."""
    results = []
    for field, value in log_entry.items():
        if not isinstance(value, str):
            continue
        for pii_type, pattern in PII_AUDIT_PATTERNS.items():
            matches = re.findall(pattern, value)
            for match in matches:
                results.append(PIIAuditResult(
                    log_id=log_entry.get("request_id", "unknown"),
                    field=field,
                    pii_type=pii_type,
                    match_preview=str(match)[:20] + "...",
                ))
    return results

def audit_log_sample(log_entries: List[dict]) -> dict:
    all_findings = []
    for entry in log_entries:
        all_findings.extend(audit_log_entry(entry))

    by_type = {}
    for finding in all_findings:
        by_type[finding.pii_type] = by_type.get(finding.pii_type, 0) + 1

    return {
        "total_entries_scanned": len(log_entries),
        "entries_with_pii": len(set(f.log_id for f in all_findings)),
        "total_pii_instances": len(all_findings),
        "by_pii_type": by_type,
    }

Step 2: Add Microsoft Presidio for PII Detection and Anonymization

Presidio is Microsoft's open-source PII detection library. It uses NLP (spaCy) to detect named entities like PERSON, DATE_OF_BIRTH, and US_SSN — far more reliably than regex alone.

Bash
pip install presidio-analyzer presidio-anonymizer spacy
python -m spacy download en_core_web_lg
Python
from presidio_analyzer import AnalyzerEngine, RecognizerResult
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from typing import List

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Entities we care about in healthcare context
SENSITIVE_ENTITIES = [
    "PERSON",
    "DATE_TIME",
    "US_SSN",
    "PHONE_NUMBER",
    "EMAIL_ADDRESS",
    "CREDIT_CARD",
    "IP_ADDRESS",
    "MEDICAL_LICENSE",
    "US_PASSPORT",
    "US_DRIVER_LICENSE",
]

def anonymize_text(text: str, language: str = "en") -> dict:
    """
    Detect and anonymize PII in a string.
    Returns the anonymized text and a list of what was found.
    """
    if not text or len(text.strip()) == 0:
        return {"anonymized": text, "entities_found": []}

    # Detect PII
    results: List[RecognizerResult] = analyzer.analyze(
        text=text,
        entities=SENSITIVE_ENTITIES,
        language=language,
    )

    if not results:
        return {"anonymized": text, "entities_found": []}

    # Anonymize: replace with placeholder type labels
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            "PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
            "DATE_TIME": OperatorConfig("replace", {"new_value": "<DATE>"}),
            "US_SSN": OperatorConfig("replace", {"new_value": "<SSN>"}),
            "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"}),
            "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
            "CREDIT_CARD": OperatorConfig("replace", {"new_value": "<CREDIT_CARD>"}),
            "IP_ADDRESS": OperatorConfig("replace", {"new_value": "<IP>"}),
            "DEFAULT": OperatorConfig("replace", {"new_value": "<PII>"}),
        },
    )

    entities_found = [
        {
            "entity_type": r.entity_type,
            "score": round(r.score, 3),
            "start": r.start,
            "end": r.end,
        }
        for r in results
    ]

    return {
        "anonymized": anonymized.text,
        "entities_found": entities_found,
    }

# Example:
original = "John Smith (DOB: 03/04/1985) is asking about 150mg sertraline."
result = anonymize_text(original)
print(result["anonymized"])
# Output: "<PERSON> (DOB: <DATE>) is asking about 150mg sertraline."
print(f"Found: {[e['entity_type'] for e in result['entities_found']]}")
# Output: Found: ['PERSON', 'DATE_TIME']

Step 3: Redesign the Log Schema

The right approach is to log metadata, not content. Never log raw user queries or LLM responses. Log only what you need for debugging and monitoring:

Python
import hashlib
import logging
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import Optional

logger = logging.getLogger("rag.audit")

@dataclass
class SafeRequestLog:
    # Identifiers (safe  opaque IDs, not user data)
    request_id: str
    user_id: str
    session_id: str
    timestamp: str

    # Query metadata (no content)
    query_length_chars: int
    query_token_count: int
    query_hash: str            # SHA-256 for correlation, not reversible

    # Processing metadata
    cache_hit: bool
    model_used: str
    chunks_retrieved: int
    context_tokens: int
    completion_tokens: int
    latency_ms: float

    # Safety signals (flags, not content)
    input_guard_triggered: bool
    output_guard_triggered: bool
    pii_entities_detected: list[str]  # types only, e.g. ["PERSON", "DATE_TIME"]
    pii_count: int

    # Error state
    error: Optional[str] = None

def build_safe_log(
    request_id: str,
    user_id: str,
    session_id: str,
    raw_query: str,
    model_used: str,
    chunks_retrieved: int,
    context_tokens: int,
    completion_tokens: int,
    latency_ms: float,
    cache_hit: bool,
    pii_result: dict,
    input_blocked: bool = False,
    output_blocked: bool = False,
) -> SafeRequestLog:
    query_hash = hashlib.sha256(raw_query.encode()).hexdigest()

    return SafeRequestLog(
        request_id=request_id,
        user_id=user_id,
        session_id=session_id,
        timestamp=datetime.utcnow().isoformat(),
        query_length_chars=len(raw_query),
        query_token_count=count_tokens(raw_query),
        query_hash=query_hash,
        cache_hit=cache_hit,
        model_used=model_used,
        chunks_retrieved=chunks_retrieved,
        context_tokens=context_tokens,
        completion_tokens=completion_tokens,
        latency_ms=round(latency_ms, 2),
        input_guard_triggered=input_blocked,
        output_guard_triggered=output_blocked,
        pii_entities_detected=[e["entity_type"] for e in pii_result.get("entities_found", [])],
        pii_count=len(pii_result.get("entities_found", [])),
    )

def log_request(log: SafeRequestLog):
    """Log only safe metadata. NEVER log query content or response content."""
    logger.info(asdict(log))

Step 4: PII-Safe Query Pipeline

The full pipeline intercepts the query, anonymizes it before logging, and stores the mapping securely:

Python
import secrets
from typing import Optional

class PIISafeQueryPipeline:
    def __init__(self, vector_store, llm_client):
        self.vector_store = vector_store
        self.llm_client = llm_client

    async def process_query(
        self,
        raw_query: str,
        user_id: str,
        session_id: str,
    ) -> dict:
        request_id = secrets.token_hex(16)
        start_time = datetime.utcnow()

        # Step 1: Detect PII in query (but pass original query to LLM)
        # We detect for logging purposes, not for redaction to the LLM.
        # The LLM needs the actual query to answer correctly.
        pii_result = anonymize_text(raw_query)
        has_pii = len(pii_result["entities_found"]) > 0

        if has_pii:
            logger.info({
                "event": "pii_detected_in_query",
                "request_id": request_id,
                "pii_types": [e["entity_type"] for e in pii_result["entities_found"]],
                # Never log: pii_result["anonymized"] or raw_query
            })

        # Step 2: RAG pipeline with original query
        try:
            chunks = await self.vector_store.search(raw_query, k=5)
            context = build_context_string(chunks)

            response = await self.llm_client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": f"Context:\n{context}"},
                    {"role": "user", "content": raw_query},
                ],
                max_tokens=600,
            )
            answer = response.choices[0].message.content

        except Exception as e:
            # Log error WITHOUT the query content
            log_error = build_safe_log(
                request_id=request_id,
                user_id=user_id,
                session_id=session_id,
                raw_query=raw_query,
                model_used="gpt-4o",
                chunks_retrieved=0,
                context_tokens=0,
                completion_tokens=0,
                latency_ms=0,
                cache_hit=False,
                pii_result=pii_result,
            )
            log_error.error = type(e).__name__  # error type, not message (may contain PII)
            log_request(log_error)
            raise

        # Step 3: Log safe metadata only
        latency_ms = (datetime.utcnow() - start_time).total_seconds() * 1000
        safe_log = build_safe_log(
            request_id=request_id,
            user_id=user_id,
            session_id=session_id,
            raw_query=raw_query,
            model_used="gpt-4o",
            chunks_retrieved=len(chunks),
            context_tokens=count_tokens(context),
            completion_tokens=response.usage.completion_tokens,
            latency_ms=latency_ms,
            cache_hit=False,
            pii_result=pii_result,
        )
        log_request(safe_log)

        return {"answer": answer, "request_id": request_id}

Step 5: Secure Debug Logging for Incident Investigation

Sometimes you genuinely need to see query content for debugging. The solution is time-limited, access-controlled storage with full audit logging — not application logs:

Python
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import BlobServiceClient, BlobSasPermissions, generate_blob_sas
from datetime import timedelta

class SecureDebugStore:
    """
    Stores raw query/response pairs for up to 48 hours in encrypted blob storage.
    Only accessible by authorized engineers via time-limited SAS URLs.
    Full audit log of every access.
    """
    def __init__(self, blob_client: BlobServiceClient, key_vault_client: SecretClient):
        self.blob = blob_client
        self.kv = key_vault_client
        self.container = "debug-logs-encrypted"

    def store_for_debug(self, request_id: str, query: str, response: str):
        """
        Store only when explicitly requested (e.g., for a specific failing request).
        NEVER called in the normal request path.
        """
        import json
        from cryptography.fernet import Fernet

        # Encrypt before storing
        key = self.kv.get_secret("debug-log-encryption-key").value.encode()
        f = Fernet(key)
        payload = json.dumps({"query": query, "response": response}).encode()
        encrypted = f.encrypt(payload)

        blob_name = f"{request_id}.enc"
        self.blob.get_blob_client(self.container, blob_name).upload_blob(
            encrypted,
            overwrite=True,
            metadata={"stored_at": datetime.utcnow().isoformat(), "ttl_hours": "48"},
        )

        logger.info({
            "event": "debug_log_stored",
            "request_id": request_id,
            "stored_by": get_current_engineer_id(),
            "expires_in_hours": 48,
        })

Compliance Summary

| Requirement | Implementation | |---|---| | No PII in application logs | Log metadata schema (query hash, not content) | | PII detection before any logging | Presidio analyzer on every request | | PII type visibility for monitoring | Log entity types only, not values | | Debug access with audit trail | Encrypted blob store with access log | | Data retention limits | Azure Monitor workspace retention = 30 days | | Right to erasure (GDPR) | No PII stored, nothing to erase |

The guiding principle: log signals, not signals with PII attached. You can debug latency, error rates, model performance, and cache efficiency entirely through metadata. Raw query content is almost never needed in routine operations — and when it is, it should live in a secure, time-limited, audited store, not in your application log stream.