Interview: Design a Structured Output Prompt — Prompt Engineering Mastery | Learnixo

Q: How do you reliably get JSON from an LLM?

Layered approach:

Prompt-level: specify schema in TypeScript-style types, say "respond ONLY with valid JSON," prefill the assistant turn with { to force JSON generation.
API-level: use JSON mode (OpenAI) for syntactic validity, or Structured Outputs / tool use for schema enforcement — the schema is applied at sampling time, not just described.
Parsing: wrap json.loads() in error handling; strip markdown code blocks before parsing; try regex extraction as fallback.
Validation: validate parsed JSON against a Pydantic model or JSON Schema.
Retry: on failure, append the specific error to the conversation and re-prompt; cap at 2-3 retries.

Q: What is the difference between JSON mode and Structured Outputs?

JSON mode (OpenAI): guarantees the output is syntactically valid JSON. Does not guarantee schema compliance — the model may return different field names or types. The schema is only described to the model; not enforced at sampling.

Structured Outputs (OpenAI, via response_format with Pydantic schema): enforces the schema at the constrained decoding level — invalid tokens are masked during sampling. The model physically cannot produce an output that doesn't conform to the schema. Pydantic model is returned directly without parsing.

Tool use / function calling (OpenAI + Anthropic): similar to Structured Outputs — schema enforced at sampling. The model generates arguments that conform to the tool's input schema.

Q: How do you handle truncated JSON output?

Prevention is better than cure: estimate the output size and set max_tokens conservatively large. For streaming responses, monitor token count against budget.

For recovery: detect truncation via json.JSONDecodeError with "Unterminated" in the error message. Recovery options:

Prefill the assistant turn with the partial JSON and prompt continuation
Re-prompt from scratch with max_tokens increased
Use streaming and cut off cleanly at the last complete top-level value

Q: How would you design a clinical data extraction pipeline?

Python

# High-level design:
# 1. Parse the clinical note (handle HL7, FHIR, plain text)
# 2. Chunk if necessary (long notes exceed context)
# 3. Call the LLM with structured extraction prompt
# 4. Parse and validate JSON output (Pydantic schema)
# 5. Apply domain rules (Warfarin → INR required)
# 6. Retry on validation failures (max 2 retries)
# 7. Log all failures for review
# 8. Write to structured store (SQL, FHIR resource)

class ClinicalExtractor:
    def __init__(self, client, schema: type[BaseModel]):
        self.client = client
        self.schema = schema

    def extract(self, note: str) -> BaseModel:
        for attempt in range(3):
            try:
                response = self.client.messages.create(
                    model="claude-sonnet-4-6",
                    max_tokens=2048,
                    system=EXTRACTION_SYSTEM_PROMPT,
                    messages=[{"role": "user", "content": f"<note>{note}</note>"}]
                )
                raw = json.loads(strip_markdown(response.content[0].text))
                return self.schema(**raw)
            except (json.JSONDecodeError, ValidationError) as e:
                if attempt == 2:
                    raise
                # next iteration adds error to prompt

Q: How do you handle enums reliably?

Enumerate all valid values explicitly in the prompt. Show a concrete example of each value. Specify the default if the value is ambiguous. Validate the enum value after parsing — reject and retry if the model invented a new value.

Python

# In the prompt:
# 'urgency' MUST be exactly one of:
# - "low" (can wait for routine follow-up)
# - "medium" (needs attention within 24 hours)
# - "high" (needs immediate action)
# Default to "medium" if unclear.

# In validation:
class ClinicalSummary(BaseModel):
    urgency: Literal["low", "medium", "high"]  # Pydantic validates enum

Q: When would you use tool use instead of a JSON prompt?

Tool use / function calling is preferred when:

Schema compliance is critical (medical, legal) — sampling-level enforcement
You have multiple possible output structures and the model should "decide" which tool to call
The output may be very long and schema enforcement prevents runaway generation
You need typed arguments without a parsing step

JSON prompt is acceptable when:

The model is small and doesn't support tool use
Schema is simple and rarely changes
You need the output embedded in a larger text response rather than isolated

Q: How do you handle partial extractions?

When a field is missing from the source (e.g., no dose is mentioned for a medication):

In the prompt: define the null convention explicitly (null vs "" vs "Not documented")
In the schema: use Optional[str] or str | None — Pydantic accepts null
In business logic: handle null explicitly; don't assume all fields are present
Don't retry just because a field is null — that's correct behaviour for missing data

Interview Answer Template

"Reliable structured extraction from LLMs uses a layered approach: TypeScript-style schema in the prompt, API-level enforcement via Structured Outputs or tool use, Pydantic validation after parsing, and a 2-3 attempt retry loop that feeds the specific error back to the model. JSON mode guarantees syntax but not schema compliance — prefer Structured Outputs for strict schema requirements. Define null conventions explicitly in the prompt, use Literal types for enums, and set max_tokens generously to prevent truncation. Log all validation failures: they reveal either prompt gaps or genuinely ambiguous inputs that need special handling."