Learnixo

AI Safety & Guardrails · Lesson 13 of 15

Content Moderation APIs: OpenAI, Azure Content Safety

Why Use Moderation APIs?

Building a custom safety classifier requires data, training, and maintenance. For most use cases, a moderation API gives you a production-ready classifier in one API call.

Use moderation APIs for:

  • Screening user inputs before they reach your LLM
  • Screening LLM outputs before they reach users
  • Fast iteration without a machine learning team

OpenAI Moderation API

The OpenAI Moderation API is free, fast, and catches most common categories of harmful content.

Categories detected:

  • harassment and harassment/threatening
  • hate and hate/threatening
  • self-harm and self-harm/intent and self-harm/instructions
  • sexual and sexual/minors
  • violence and violence/graphic
  • illicit and illicit/violent
Python
# moderation/openai_mod.py
from openai import AsyncAzureOpenAI
from dataclasses import dataclass

@dataclass
class ModerationResult:
    flagged: bool
    categories: list[str]
    scores: dict[str, float]

async def moderate_with_openai(
    text: str,
    client: AsyncAzureOpenAI,
    threshold: float = 0.5,
) -> ModerationResult:
    """Check text against OpenAI content policy categories."""
    response = await client.moderations.create(
        input=text,
        model="omni-moderation-latest",
    )

    result = response.results[0]

    # Extract flagged categories above threshold
    scores = result.category_scores.model_dump()
    flagged_categories = [
        category
        for category, score in scores.items()
        if score >= threshold
    ]

    return ModerationResult(
        flagged=result.flagged or len(flagged_categories) > 0,
        categories=flagged_categories,
        scores=scores,
    )

# Example usage:
async def check_user_input(text: str, client) -> bool:
    result = await moderate_with_openai(text, client)
    if result.flagged:
        return False  # Block this input
    return True

# Latency: 100-200ms
# Cost: Free (as of 2026)
# Best for: general content policy violations

Azure Content Safety

Azure Content Safety provides category scores with severity levels (0, 2, 4, 6) across four categories. It's designed for enterprise use and supports custom blocklists.

Categories:

  • Hate — hate speech
  • Violence — violent content
  • Sexual — sexual content
  • SelfHarm — self-harm content

Each category returns a severity score from 0 (safe) to 6 (high severity).

Python
# moderation/azure_content_safety.py
from azure.ai.contentsafety import ContentSafetyClient
from azure.ai.contentsafety.models import AnalyzeTextOptions, TextCategory
from azure.core.credentials import AzureKeyCredential
from dataclasses import dataclass

@dataclass
class AzureSafetyResult:
    is_safe: bool
    hate_severity: int
    violence_severity: int
    sexual_severity: int
    self_harm_severity: int

class AzureContentSafetyModerator:
    def __init__(self, endpoint: str, api_key: str):
        self.client = ContentSafetyClient(
            endpoint=endpoint,
            credential=AzureKeyCredential(api_key),
        )

    def analyze(
        self,
        text: str,
        max_severity: int = 2,  # Block severity 4+ (medium and above)
    ) -> AzureSafetyResult:
        """Analyze text synchronously. Use thread pool for async contexts."""
        response = self.client.analyze_text(
            AnalyzeTextOptions(
                text=text,
                categories=[
                    TextCategory.HATE,
                    TextCategory.VIOLENCE,
                    TextCategory.SEXUAL,
                    TextCategory.SELF_HARM,
                ],
            )
        )

        severities = {
            item.category: item.severity
            for item in response.categories_analysis
        }

        hate = severities.get(TextCategory.HATE, 0)
        violence = severities.get(TextCategory.VIOLENCE, 0)
        sexual = severities.get(TextCategory.SEXUAL, 0)
        self_harm = severities.get(TextCategory.SELF_HARM, 0)

        is_safe = max(hate, violence, sexual, self_harm) <= max_severity

        return AzureSafetyResult(
            is_safe=is_safe,
            hate_severity=hate,
            violence_severity=violence,
            sexual_severity=sexual,
            self_harm_severity=self_harm,
        )

    def add_blocklist_item(self, blocklist_name: str, text: str):
        """Add custom terms to a blocklist (e.g., competitor drug names)."""
        from azure.ai.contentsafety.models import AddOrUpdateTextBlocklistItemsOptions, TextBlocklistItem
        self.client.add_or_update_blocklist_items(
            blocklist_name=blocklist_name,
            options=AddOrUpdateTextBlocklistItemsOptions(
                blocklist_items=[TextBlocklistItem(text=text)]
            ),
        )

Custom blocklists are a key Azure Content Safety feature. For a pharmaceutical app, add:

  • Drug names that should never be mentioned (controlled substances without medical context)
  • Competitor brand names (if restricted by brand guidelines)
  • Internal company terms that shouldn't appear in public responses

AWS Comprehend

AWS Comprehend is primarily an NLP service (sentiment, entities, key phrases) but also provides content classification and PII detection:

Python
# moderation/aws_comprehend.py
import boto3
from dataclasses import dataclass

@dataclass
class ComprehendPIIResult:
    contains_pii: bool
    pii_entities: list[dict]

class AWSComprehendModerator:
    def __init__(self, region: str = "us-east-1"):
        self.client = boto3.client("comprehend", region_name=region)

    def detect_pii(self, text: str, language: str = "en") -> ComprehendPIIResult:
        """Detect PII entities in text."""
        response = self.client.detect_pii_entities(
            Text=text,
            LanguageCode=language,
        )

        entities = response.get("Entities", [])
        pii_found = [
            e for e in entities
            if e.get("Score", 0) >= 0.9  # High confidence only
        ]

        return ComprehendPIIResult(
            contains_pii=len(pii_found) > 0,
            pii_entities=[
                {
                    "type": e["Type"],
                    "start": e["BeginOffset"],
                    "end": e["EndOffset"],
                    "score": e["Score"],
                }
                for e in pii_found
            ],
        )

    def mask_pii(self, text: str) -> str:
        """Replace detected PII with [REDACTED]."""
        result = self.detect_pii(text)
        if not result.contains_pii:
            return text

        # Replace PII from end to start (to preserve offsets)
        masked = text
        for entity in sorted(result.pii_entities, key=lambda e: e["end"], reverse=True):
            masked = masked[:entity["start"]] + f"[{entity['type']}]" + masked[entity["end"]:]

        return masked

Comparison: Which API to Use

| API | Cost | Latency | Best For | |---|---|---|---| | OpenAI Moderation | Free | 100-200ms | General content policy (hate, violence, sexual) | | Azure Content Safety | ~$1/1k calls | 100-300ms | Enterprise, custom blocklists, severity levels | | AWS Comprehend | ~$0.50-$1/1k | 200-500ms | PII detection, NLP features, AWS ecosystem |

Recommendation for most AI applications:

  1. Use OpenAI Moderation API for user input screening (free, fast)
  2. Use Azure Content Safety for output screening with custom blocklists (enterprise control)
  3. Use AWS Comprehend only if you need PII masking specifically

Combining APIs in Production

Python
# moderation/combined.py
import asyncio

async def full_moderation_pipeline(
    text: str,
    openai_client,
    azure_moderator: AzureContentSafetyModerator,
) -> tuple[bool, list[str]]:
    """Run OpenAI + Azure in parallel, fail-fast on first block."""

    # Run both in parallel
    openai_result, azure_result = await asyncio.gather(
        moderate_with_openai(text, openai_client),
        asyncio.to_thread(azure_moderator.analyze, text),
    )

    reasons = []

    if openai_result.flagged:
        reasons.extend(openai_result.categories)

    if not azure_result.is_safe:
        if azure_result.hate_severity > 2:
            reasons.append("hate")
        if azure_result.violence_severity > 2:
            reasons.append("violence")

    is_safe = len(reasons) == 0
    return is_safe, reasons

Running both APIs in parallel adds virtually no latency compared to running either one — the total time is max(openai_latency, azure_latency), not the sum.


When Moderation APIs Are Not Enough

Moderation APIs catch policy violations but not domain-specific safety issues:

  • OpenAI Moderation won't catch dangerous drug combination advice (not a policy violation per se)
  • Azure Content Safety won't catch medically incorrect dosage information

For domain-specific safety (medical, legal, financial), you need custom classifiers or LLM-as-judge evaluation as described in the output classifiers lesson.

Use moderation APIs as Layer 1 (fast, general), then add domain-specific checks as Layer 2.