Learnixo

Prompt Engineering Mastery · Lesson 23 of 24

Meta-Prompting: Using LLMs to Improve Prompts

What is Meta-Prompting?

Meta-prompting uses an LLM to generate or improve prompts for other LLM calls. Instead of hand-crafting prompts, you describe what you want the prompt to accomplish and let the model generate it.

Applications:

  • Prompt generation: Describe a task, get a prompt that does it well
  • Prompt optimization: Give an existing prompt and examples of failures, get an improved version
  • Test case generation: Generate diverse inputs to test prompt robustness
  • Evaluation criteria: Generate rubrics for judging output quality

Generating Prompts for Specific Tasks

Python
from openai import OpenAI

client = OpenAI()

def generate_prompt_for_task(
    task_description: str,
    audience: str,
    constraints: list[str] = None,
    examples: list[dict] = None,
) -> str:
    """Generate an effective system prompt for a described task."""

    meta_prompt = f"""You are a prompt engineering expert. Generate an effective system prompt for the following task.

TASK DESCRIPTION:
{task_description}

TARGET AUDIENCE:
{audience}

CONSTRAINTS:
{chr(10).join(f'- {c}' for c in (constraints or ['None specified']))}

{"EXAMPLE INPUTS AND DESIRED OUTPUTS:" if examples else ""}
{chr(10).join(f"Input: {ex['input']}{chr(10)}Desired output: {ex['output']}" for ex in (examples or []))}

Generate a system prompt that:
1. Clearly defines the assistant's role and expertise
2. Specifies the output format and length
3. Includes any necessary constraints or guardrails
4. Is specific enough to produce consistent results
5. Includes any domain-specific context needed

Return ONLY the system prompt text, no explanations or meta-commentary."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": meta_prompt}],
        temperature=0.3,
    )
    return response.choices[0].message.content

# Example: Generate a prompt for a clinical decision support tool
generated_prompt = generate_prompt_for_task(
    task_description="Help pharmacists identify dangerous drug interactions in patient medication lists and recommend management",
    audience="Licensed clinical pharmacists at a hospital",
    constraints=[
        "Do not make specific dosing recommendations for individual patients",
        "Always specify interaction severity (major/moderate/minor)",
        "Responses should be under 300 words",
        "Use standard pharmacology terminology",
    ],
    examples=[
        {
            "input": "Patient on warfarin 5mg daily, starting clarithromycin 500mg BID",
            "desired_output": "MAJOR interaction. Clarithromycin inhibits CYP2C9 and CYP3A4, increasing warfarin levels. Action: Monitor INR within 3-5 days; anticipate INR increase of 30-50%; consider empirical warfarin dose reduction of 25-50%.",
        }
    ]
)

print("Generated System Prompt:")
print(generated_prompt)

Automatic Prompt Optimization

Given examples of failures, automatically improve a prompt:

Python
def optimize_prompt(
    current_prompt: str,
    failure_cases: list[dict],  # {"input": str, "bad_output": str, "desired_output": str}
    n_iterations: int = 3,
) -> str:
    """Iteratively improve a prompt using failure cases."""

    prompt_to_optimize = current_prompt

    for iteration in range(n_iterations):
        # Test current prompt on failure cases
        current_failures = []
        for case in failure_cases:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": prompt_to_optimize},
                    {"role": "user", "content": case["input"]},
                ],
                temperature=0,
            )
            output = response.choices[0].message.content

            # Simple check: does output contain key expected elements?
            if case.get("desired_output") and len(output) < 10:
                current_failures.append({
                    "input": case["input"],
                    "actual": output,
                    "desired": case["desired_output"],
                })

        if not current_failures:
            print(f"Prompt satisfactory after {iteration} optimizations")
            break

        # Ask the meta-model to improve the prompt
        failure_text = "\n\n".join(
            f"Input: {f['input']}\nActual output: {f['actual']}\nDesired output: {f['desired']}"
            for f in failure_cases
        )

        optimization_prompt = f"""Current system prompt:
---
{prompt_to_optimize}
---

This prompt fails on these cases:
{failure_text}

Analyze why the prompt fails and generate an improved version.
The improved prompt should:
1. Handle the failure cases correctly
2. Not break any other cases not shown here
3. Be specific about how to handle the problematic patterns

Return ONLY the improved prompt text."""

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": optimization_prompt}],
            temperature=0.3,
        )
        prompt_to_optimize = response.choices[0].message.content
        print(f"Iteration {iteration + 1}: prompt updated")

    return prompt_to_optimize

# Example optimization
initial_prompt = "You are a helpful clinical pharmacist."

failures = [
    {
        "input": "What's the interaction between warfarin and ibuprofen?",
        "bad_output": "These drugs interact.",
        "desired_output": "MAJOR interaction: Ibuprofen inhibits platelet function (pharmacodynamic) and may displace warfarin from protein binding (pharmacokinetic), increasing bleeding risk. Monitor INR closely; consider acetaminophen as an alternative analgesic.",
    },
    {
        "input": "Is my patient on too many medications?",
        "bad_output": "I can tell you about polypharmacy risks.",
        "desired_output": "This requires patient-specific clinical assessment. I can evaluate specific drug combinations for interactions, redundant mechanisms, or inappropriate polypharmacy patterns if you share the medication list.",
    },
]

improved_prompt = optimize_prompt(initial_prompt, failures, n_iterations=2)
print("\nImproved Prompt:")
print(improved_prompt)

Generating Test Cases

Use meta-prompting to generate diverse test inputs:

Python
def generate_test_cases(
    task_description: str,
    n_cases: int = 20,
    coverage: list[str] = None,
) -> list[dict]:
    """Generate diverse test cases for a task."""
    default_coverage = [
        "simple straightforward cases",
        "edge cases and boundary conditions",
        "ambiguous or unclear inputs",
        "out-of-scope requests",
        "potentially harmful requests",
        "very long inputs",
        "inputs with typos or formatting issues",
        "multilingual inputs",
    ]

    coverage_areas = coverage or default_coverage

    meta_prompt = f"""Generate {n_cases} diverse test cases for this task:
{task_description}

Cover these areas:
{chr(10).join(f'- {c}' for c in coverage_areas)}

Return a JSON array of test cases:
[
  {{
    "input": "the test input",
    "category": "which coverage area this tests",
    "expected_behavior": "what the system should do (not the exact output, just the behavior)"
  }},
  ...
]

Make the test cases realistic and varied. Include edge cases that would reveal prompt weaknesses."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": meta_prompt}],
        response_format={"type": "json_object"},
        temperature=0.7,  # Higher temp for diversity
    )

    import json
    raw = json.loads(response.choices[0].message.content)
    # Handle both {"test_cases": [...]} and [...] formats
    return raw if isinstance(raw, list) else raw.get("test_cases", raw.get("cases", []))

# Generate tests for our clinical pharmacist
test_cases = generate_test_cases(
    task_description="Clinical pharmacist assistant that answers drug interaction and dosing questions for hospital pharmacists",
    n_cases=15,
)

print(f"Generated {len(test_cases)} test cases:")
for case in test_cases[:5]:
    print(f"\n[{case['category']}]")
    print(f"Input: {case['input']}")
    print(f"Expected: {case['expected_behavior']}")

Generating Evaluation Rubrics

Python
def generate_evaluation_rubric(
    task_description: str,
    output_type: str = "clinical advice",
) -> dict:
    """Generate evaluation criteria for a specific task."""
    meta_prompt = f"""Create an evaluation rubric for assessing LLM outputs for this task:
Task: {task_description}
Output type: {output_type}

Generate 5-7 evaluation criteria. For each criterion:
1. A name for the criterion
2. A description of what it measures
3. A scoring scale (1-5) with descriptions for scores 1, 3, and 5
4. An example of a high-scoring output (score 5)
5. An example of a low-scoring output (score 1)

Return as JSON:
{{
  "rubric_name": "...",
  "criteria": [
    {{
      "name": "...",
      "description": "...",
      "scores": {{
        "1": "description of score 1",
        "3": "description of score 3",
        "5": "description of score 5"
      }},
      "high_score_example": "...",
      "low_score_example": "..."
    }}
  ]
}}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": meta_prompt}],
        response_format={"type": "json_object"},
        temperature=0.2,
    )

    import json
    return json.loads(response.choices[0].message.content)

rubric = generate_evaluation_rubric(
    task_description="Clinical pharmacist assistant answering drug interaction questions",
    output_type="clinical drug interaction advice"
)

print(f"Rubric: {rubric['rubric_name']}")
for criterion in rubric.get("criteria", []):
    print(f"\n- {criterion['name']}: {criterion['description']}")

APE: Automatic Prompt Engineer

A simple implementation of automated prompt search:

Python
def automatic_prompt_engineer(
    task: str,
    example_inputs: list[str],
    example_outputs: list[str],
    n_candidate_prompts: int = 10,
    evaluation_model: str = "gpt-4o-mini",
) -> str:
    """Generate candidate prompts and select the best one."""

    # Step 1: Generate candidate prompts
    generation_prompt = f"""Task description: {task}

Generate {n_candidate_prompts} different system prompt candidates for this task.
Each should have a different emphasis or approach.
Return as a JSON list of strings."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": generation_prompt}],
        response_format={"type": "json_object"},
        temperature=0.8,
    )

    import json
    candidates_raw = json.loads(response.choices[0].message.content)
    candidates = candidates_raw if isinstance(candidates_raw, list) else list(candidates_raw.values())[0]

    # Step 2: Evaluate each candidate on example inputs
    scores = []
    for candidate_prompt in candidates:
        candidate_score = 0

        for inp, expected_out in zip(example_inputs, example_outputs):
            actual = client.chat.completions.create(
                model=evaluation_model,
                messages=[
                    {"role": "system", "content": candidate_prompt},
                    {"role": "user", "content": inp},
                ],
                temperature=0,
            ).choices[0].message.content

            # Score using another LLM call (or string similarity for simple tasks)
            score_response = client.chat.completions.create(
                model=evaluation_model,
                messages=[{"role": "user", "content": f"""Rate how well this output matches the expected:
Expected: {expected_out}
Actual: {actual}
Score 1-10. Return only the integer."""}],
                temperature=0,
            )
            try:
                candidate_score += int(score_response.choices[0].message.content.strip())
            except ValueError:
                candidate_score += 5

        scores.append(candidate_score)

    # Step 3: Return best candidate
    best_idx = scores.index(max(scores))
    print(f"Best prompt score: {scores[best_idx]} (out of {len(example_inputs) * 10} max)")
    return candidates[best_idx]

When to Use Meta-Prompting

| Scenario | Meta-prompting value | |---|---| | Building a new LLM feature | High — generate and iterate quickly | | Optimizing a failing prompt | High — diagnose failures systematically | | Testing prompt robustness | High — generate diverse edge cases | | One-off queries | Low — just write the prompt manually | | Expert with deep domain knowledge | Lower — expert may outperform the generator | | Production prompt maintenance | Medium — automate regression testing |

Meta-prompting adds LLM cost but can save significant human time during development. Use it for systematic tasks (generating test cases, evaluating coverage) rather than replacing expert prompt authoring for high-stakes systems.