What Is Prompt Engineering?

Prompt engineering is the discipline of designing, refining, and optimizing the text inputs you send to a large language model (LLM) in order to get outputs that are accurate, useful, safe, and consistent. It is part art, part science — you need intuition about how models behave, plus a systematic approach to measuring and improving results.

At its core, a prompt is nothing more than a string of text. But the choices you make inside that string — the words you choose, the format you impose, the examples you include, the constraints you specify — dramatically change the quality of what comes back.

Why Prompt Engineering Matters

When OpenAI released GPT-3 in 2020, researchers discovered that the same base model could perform radically different tasks simply by changing the text prepended to the input. A model trained on the internet could suddenly do arithmetic, translate languages, summarize legal documents, and write Python — all without retraining — just by framing the prompt correctly.

This insight turned prompting into a first-class engineering skill. Today, prompt engineering sits at the intersection of:

Linguistics — how phrasing affects meaning and tone
Cognitive science — how to activate the right "knowledge cluster" in the model
Software engineering — how to build reliable, testable, maintainable prompt systems
Domain expertise — knowing what a good answer looks like so you can measure it

Organizations that ship AI products invest heavily in prompt engineering because a 10% improvement in prompt quality can mean the difference between a product people love and one they abandon.

Prompting vs Fine-Tuning vs RAG

These three techniques are often confused. They serve different purposes and should be combined deliberately.

| Technique | What it does | Cost | When to use | |-----------|-------------|------|-------------| | Prompting | Changes the instruction at inference time | Low (API calls only) | Most tasks; rapid iteration | | Fine-tuning | Changes the weights of the model | High (GPU training) | Consistent style, domain vocabulary, latency | | RAG | Injects external knowledge at inference time | Medium (retrieval infra) | Factual accuracy, up-to-date info, large corpora |

Prompting is always your first stop. It is the fastest way to iterate and requires no infrastructure beyond an API key. If prompting gets you to 80% of your goal, you ship. If you need higher accuracy or a very specific style that prompting alone cannot achieve, you consider fine-tuning. If the model needs current or proprietary knowledge it was never trained on, you use RAG.

A mature AI system often uses all three: a fine-tuned model, grounded by RAG, controlled by a carefully engineered prompt.

The Anatomy of a Prompt

Modern LLM APIs structure conversation as a list of messages, each with a role. Understanding these roles is foundational.

System Message

The system message is set by the developer. It is the persistent instruction layer — it tells the model who it is, what it should and should not do, and how to format its responses. Users typically cannot see it.

TEXT

SYSTEM:
You are MedAssist, a clinical decision support tool for licensed physicians.
- Always cite the evidence level (Grade A/B/C) for any recommendation.
- If a question is outside your scope, say so clearly and suggest a specialist.
- Never provide dosing for off-label uses without explicitly flagging them as off-label.
- Respond in structured markdown with headers.

User Message

The user message is the turn-by-turn input from the end user or the orchestration layer. It contains the actual request.

TEXT

USER:
My patient is a 67-year-old male with HFrEF (EF 30%) and CKD stage 3b.
He is currently on carvedilol 25 mg BID and lisinopril 10 mg daily.
What is the guideline recommendation for adding an SGLT2 inhibitor?

Assistant Message

The assistant message is the model's response. In a multi-turn conversation, you include previous assistant messages to give the model context about what it has already said.

TEXT

ASSISTANT:
## SGLT2 Inhibitor in HFrEF with CKD

**Recommendation (Grade A — ACC/AHA 2022 HF Guidelines):**
Empagliflozin 10 mg daily or dapagliflozin 10 mg daily is recommended for patients
with HFrEF to reduce CV death and hospitalization, regardless of diabetes status.

**CKD Considerations:**
- Dapagliflozin: approved down to eGFR ≥25 mL/min/1.73m²
- Empagliflozin: approved down to eGFR ≥20 mL/min/1.73m²
- Monitor eGFR and potassium after initiation given concurrent ACEi use

The Full API Call in Python

Python

import openai

client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": (
                "You are MedAssist, a clinical decision support tool for licensed physicians. "
                "Always cite evidence grades. Never provide off-label dosing without flagging it."
            ),
        },
        {
            "role": "user",
            "content": (
                "67-year-old male, HFrEF EF 30%, CKD stage 3b. "
                "On carvedilol 25 mg BID and lisinopril 10 mg. "
                "Should we add an SGLT2 inhibitor?"
            ),
        },
    ],
    temperature=0.2,
    max_tokens=800,
)

print(response.choices[0].message.content)

How Temperature Affects Outputs

Temperature is the single most impactful sampling parameter you control. It scales the probability distribution over tokens before sampling.

Temperature 0.0 — The model always picks the highest-probability token. Outputs are deterministic (identical for the same input). Use this for factual lookups, structured data extraction, and any task where you want reproducibility.
Temperature 0.3–0.5 — Slightly varied outputs, still focused and accurate. Good for code generation and technical writing.
Temperature 0.7–1.0 — Balanced creativity and coherence. Good for marketing copy, brainstorming, and conversational agents.
Temperature 1.5–2.0 — Highly random. The model explores unlikely tokens. Useful for creative fiction, but outputs often degrade in quality.

Python

import openai

client = openai.OpenAI()

prompt = "Write a one-sentence tagline for a cloud security platform."

for temp in [0.0, 0.5, 1.0, 1.5]:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=60,
    )
    text = response.choices[0].message.content.strip()
    print(f"temp={temp}: {text}")

Sample outputs:

temp=0.0: Protect every layer of your cloud with enterprise-grade security.
temp=0.5: Your cloud, fortified — from edge to core, every second.
temp=1.0: Security that breathes with your cloud, adapts with your threats.
temp=1.5: Where trust dissolves and clouds remember everything you fear.

Notice how at temperature 0.0 you get the most "generic expected" answer, and at 1.5 the output becomes poetic but less useful commercially.

How Top-P (Nucleus Sampling) Works

Top-P is an alternative — or complement — to temperature. Instead of scaling the full probability distribution, top-P restricts sampling to the smallest set of tokens whose cumulative probability exceeds P.

top_p=1.0 — Sample from the full vocabulary (default).
top_p=0.9 — Sample only from tokens that together account for 90% of the probability mass. Rare, nonsensical tokens are excluded.
top_p=0.1 — Extremely focused; nearly deterministic even at higher temperatures.

The OpenAI recommendation: adjust temperature or top_p, not both simultaneously, as their effects compound unpredictably.

Python

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum entanglement in one paragraph."}],
    temperature=1.0,
    top_p=0.9,   # prune the tail of the distribution
    max_tokens=200,
)

Practical Parameter Cheat Sheet

Python

# Deterministic extraction (JSON, data, facts)
EXTRACTION_PARAMS = {"temperature": 0.0, "top_p": 1.0}

# Balanced technical writing (documentation, code review)
TECHNICAL_PARAMS = {"temperature": 0.3, "top_p": 1.0}

# Conversational assistant (customer support, tutoring)
CONVERSATIONAL_PARAMS = {"temperature": 0.7, "top_p": 0.95}

# Creative writing (marketing, storytelling)
CREATIVE_PARAMS = {"temperature": 1.0, "top_p": 0.9}

A Mental Model: The Prompt as a Program

Think of a prompt as a program written in natural language. Just as a Python function has inputs, logic, and outputs, a well-engineered prompt has:

Context — What world does the model inhabit? (System prompt, background documents)
Instruction — What should the model do? (Verb-first: "Summarize", "Extract", "Classify")
Input — What data should it act on? (The user's actual content)
Constraints — What are the rules? (Format, length, scope, safety)
Output specification — What does success look like? (JSON schema, example output)

TEXT

[Context]   You are a healthcare billing specialist with expertise in ICD-10 coding.

[Instruction] Extract all billable diagnoses from the clinical note below.

[Constraint] Return only confirmed diagnoses, not rule-outs.
             Format as a JSON array with fields: code, description, confidence (high/medium/low).

[Input]     CLINICAL NOTE:
            Patient presents with chest pain and shortness of breath.
            EKG shows ST elevation in leads II, III, aVF consistent with inferior STEMI.
            Rule out pulmonary embolism. History of type 2 diabetes mellitus.

[Output spec] Example:
            [{"code": "I21.19", "description": "STEMI of inferior wall", "confidence": "high"}]

Why Bad Prompts Fail

Understanding failure modes helps you write better prompts from the start.

| Failure mode | Example | Fix | |---|---|---| | Too vague | "Tell me about diabetes" | "Explain type 2 diabetes pathophysiology in 3 bullet points for a medical student" | | No format specified | "List the capitals" | "Return a JSON object mapping country name to capital city" | | Contradictory instructions | "Be brief but comprehensive" | Pick one: "Respond in under 100 words" or "Be comprehensive" | | No role/context | Model uses generic register | Add system prompt with persona and domain | | Ignores safety constraints | No safety instruction | Add explicit safety rails in system prompt |

Summary

Prompt engineering is the practice of crafting text inputs that reliably produce high-quality outputs from LLMs. Key takeaways:

Use prompting first; reach for fine-tuning and RAG only when prompting hits its ceiling.
Every prompt has four layers: system, user, assistant history, and output specification.
Temperature controls creativity vs. determinism; top_p controls vocabulary breadth.
Think of your prompt as a program: context + instruction + input + constraints + output spec.

In the next lessons you will learn zero-shot prompting, few-shot prompting, and chain-of-thought — the foundational techniques that sit on top of this anatomy.

What Is Prompt Engineering?

What Is Prompt Engineering?

Why Prompt Engineering Matters

Prompting vs Fine-Tuning vs RAG

The Anatomy of a Prompt

System Message

User Message

Assistant Message

The Full API Call in Python

How Temperature Affects Outputs

How Top-P (Nucleus Sampling) Works

Practical Parameter Cheat Sheet

A Mental Model: The Prompt as a Program

Why Bad Prompts Fail

Summary

Enjoyed this article?

Leave a comment