Learnixo

Prompt Engineering Mastery · Lesson 16 of 24

Interview: Design a Structured Output Prompt

Q: How do you reliably get JSON from an LLM?

Layered approach:

  1. Prompt-level: specify schema in TypeScript-style types, say "respond ONLY with valid JSON," prefill the assistant turn with { to force JSON generation.
  2. API-level: use JSON mode (OpenAI) for syntactic validity, or Structured Outputs / tool use for schema enforcement — the schema is applied at sampling time, not just described.
  3. Parsing: wrap json.loads() in error handling; strip markdown code blocks before parsing; try regex extraction as fallback.
  4. Validation: validate parsed JSON against a Pydantic model or JSON Schema.
  5. Retry: on failure, append the specific error to the conversation and re-prompt; cap at 2-3 retries.

Q: What is the difference between JSON mode and Structured Outputs?

JSON mode (OpenAI): guarantees the output is syntactically valid JSON. Does not guarantee schema compliance — the model may return different field names or types. The schema is only described to the model; not enforced at sampling.

Structured Outputs (OpenAI, via response_format with Pydantic schema): enforces the schema at the constrained decoding level — invalid tokens are masked during sampling. The model physically cannot produce an output that doesn't conform to the schema. Pydantic model is returned directly without parsing.

Tool use / function calling (OpenAI + Anthropic): similar to Structured Outputs — schema enforced at sampling. The model generates arguments that conform to the tool's input schema.


Q: How do you handle truncated JSON output?

Prevention is better than cure: estimate the output size and set max_tokens conservatively large. For streaming responses, monitor token count against budget.

For recovery: detect truncation via json.JSONDecodeError with "Unterminated" in the error message. Recovery options:

  • Prefill the assistant turn with the partial JSON and prompt continuation
  • Re-prompt from scratch with max_tokens increased
  • Use streaming and cut off cleanly at the last complete top-level value

Q: How would you design a clinical data extraction pipeline?

Python
# High-level design:
# 1. Parse the clinical note (handle HL7, FHIR, plain text)
# 2. Chunk if necessary (long notes exceed context)
# 3. Call the LLM with structured extraction prompt
# 4. Parse and validate JSON output (Pydantic schema)
# 5. Apply domain rules (Warfarin  INR required)
# 6. Retry on validation failures (max 2 retries)
# 7. Log all failures for review
# 8. Write to structured store (SQL, FHIR resource)

class ClinicalExtractor:
    def __init__(self, client, schema: type[BaseModel]):
        self.client = client
        self.schema = schema

    def extract(self, note: str) -> BaseModel:
        for attempt in range(3):
            try:
                response = self.client.messages.create(
                    model="claude-sonnet-4-6",
                    max_tokens=2048,
                    system=EXTRACTION_SYSTEM_PROMPT,
                    messages=[{"role": "user", "content": f"<note>{note}</note>"}]
                )
                raw = json.loads(strip_markdown(response.content[0].text))
                return self.schema(**raw)
            except (json.JSONDecodeError, ValidationError) as e:
                if attempt == 2:
                    raise
                # next iteration adds error to prompt

Q: How do you handle enums reliably?

Enumerate all valid values explicitly in the prompt. Show a concrete example of each value. Specify the default if the value is ambiguous. Validate the enum value after parsing — reject and retry if the model invented a new value.

Python
# In the prompt:
# 'urgency' MUST be exactly one of:
# - "low" (can wait for routine follow-up)
# - "medium" (needs attention within 24 hours)
# - "high" (needs immediate action)
# Default to "medium" if unclear.

# In validation:
class ClinicalSummary(BaseModel):
    urgency: Literal["low", "medium", "high"]  # Pydantic validates enum

Q: When would you use tool use instead of a JSON prompt?

Tool use / function calling is preferred when:

  • Schema compliance is critical (medical, legal) — sampling-level enforcement
  • You have multiple possible output structures and the model should "decide" which tool to call
  • The output may be very long and schema enforcement prevents runaway generation
  • You need typed arguments without a parsing step

JSON prompt is acceptable when:

  • The model is small and doesn't support tool use
  • Schema is simple and rarely changes
  • You need the output embedded in a larger text response rather than isolated

Q: How do you handle partial extractions?

When a field is missing from the source (e.g., no dose is mentioned for a medication):

  1. In the prompt: define the null convention explicitly (null vs "" vs "Not documented")
  2. In the schema: use Optional[str] or str | None — Pydantic accepts null
  3. In business logic: handle null explicitly; don't assume all fields are present
  4. Don't retry just because a field is null — that's correct behaviour for missing data

Interview Answer Template

"Reliable structured extraction from LLMs uses a layered approach: TypeScript-style schema in the prompt, API-level enforcement via Structured Outputs or tool use, Pydantic validation after parsing, and a 2-3 attempt retry loop that feeds the specific error back to the model. JSON mode guarantees syntax but not schema compliance — prefer Structured Outputs for strict schema requirements. Define null conventions explicitly in the prompt, use Literal types for enums, and set max_tokens generously to prevent truncation. Log all validation failures: they reveal either prompt gaps or genuinely ambiguous inputs that need special handling."