Learnixo
Back to blog
AI Systemsintermediate

What is a Feature and a Label?

Clear definitions of features and labels in machine learning: raw vs engineered features, target variables for regression and classification, and how they map to real AI use cases like drug prediction and clinical NLP.

Asma Hafeez KhanMay 16, 20264 min read
Machine LearningFeaturesLabelsFeature EngineeringInterview
Share:š•

Features (Inputs)

A feature is a measurable piece of information used as input to the model. Features are the variables the model uses to make predictions.

Clinical data example:
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ age │ weight │ creatinine │ INR │ drug_count │ ...     │
│  65 │   78   │    1.2     │ 2.4 │     5      │ ...     │
│  42 │   92   │    0.8     │ 1.1 │     2      │ ...     │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
These columns are the features (also called X, inputs, predictors, or covariates)

Labels (Outputs)

A label is the correct answer you want the model to predict. Labels are what you have during training; at inference time you're trying to predict them for new, unseen inputs.

Label examples:
- Will this patient develop sepsis?          → binary: 0 or 1
- What drug class is this?                   → multi-class: 0, 1, 2, 3
- What will this patient's INR be in 7 days? → continuous number

| Other names | Same thing | |---|---| | Target, output, y, response variable | They all mean "the thing you're predicting" |


A Complete Example

Python
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Dataset: predict whether a drug is an anticoagulant
data = pd.DataFrame({
    "mol_weight":      [296.3, 180.2, 419.5, 334.4],
    "log_p":           [2.4,   -3.1,   3.8,   1.2],
    "hbd_count":       [1,      4,      2,     3],       # Hydrogen bond donors
    "is_anticoagulant":[1,      0,      1,     0],       # Label
})

# Separate features (X) and label (y)
X = data[["mol_weight", "log_p", "hbd_count"]]   # Feature matrix: (4, 3)
y = data["is_anticoagulant"]                     # Label vector: (4,)

# Train
model = LogisticRegression()
model.fit(X, y)

# Predict for a new compound
new_compound = [[310.0, 2.1, 1]]   # Features only — no label known yet
prediction = model.predict(new_compound)
print(prediction)   # [1] — model predicts anticoagulant

Raw Features vs Engineered Features

Raw features come directly from the data source. Engineered features are derived or transformed to help the model learn better.

Python
# Raw features from an EHR record:
raw = {
    "date_of_birth": "1958-03-14",
    "admission_date": "2024-06-10",
    "medications": "warfarin, metoprolol, omeprazole",
    "creatinine": 1.4,
}

# Engineered features:
from datetime import date

engineered = {
    # Derived: age in years
    "age": (date(2024, 6, 10) - date(1958, 3, 14)).days // 365,

    # Extracted: medication count
    "med_count": len(raw["medications"].split(",")),

    # Flagged: is on anticoagulant
    "on_anticoagulant": int("warfarin" in raw["medications"]),

    # Renal function category (binned)
    "renal_impairment": int(raw["creatinine"] > 1.2),
}

Features in NLP and LLM Systems

For text-based ML (NLP), features aren't numeric columns — they're representations of text.

Python
# Traditional NLP: TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "warfarin is an oral anticoagulant",
    "aspirin reduces fever and inflammation",
    "metformin lowers blood glucose in diabetes",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape)   # (3, n_unique_words) — each word is a feature

# Modern: embedding features from an LLM
# Each text is represented as a high-dimensional vector
# Embedding models: text-embedding-3-small, sentence-transformers, etc.
# X.shape would be (3, 1536) — 1536-dimensional embedding vectors

Label Types

| Label Type | Task | Example | |---|---|---| | Binary | Binary classification | Sepsis: yes / no | | Multi-class | Multi-class classification | Drug class: A / B / C / D | | Multi-label | Multi-label classification | Side effects: can be multiple | | Continuous | Regression | INR value, dose in mg | | Sequence | Sequence labeling (NER) | Token → drug-name / dose / route | | Ranking | Learning to rank | Most relevant document first |


The Feature Matrix Shape

In scikit-learn and NumPy, features are always shaped as (n_samples, n_features):

Python
import numpy as np

# 1000 patients, each described by 20 features
X = np.random.randn(1000, 20)   # Shape: (1000, 20)
y = np.random.randint(0, 2, 1000)  # Shape: (1000,) — one label per patient

print(f"Samples:  {X.shape[0]}")   # 1000 — one row per patient
print(f"Features: {X.shape[1]}")   # 20 — one column per feature
print(f"Labels:   {y.shape}")      # (1000,)

Interview Answer Template

Q: What is the difference between a feature and a label?

Features are the input variables the model uses to make predictions — they describe each example (age, weight, lab values, text embeddings). Labels are the correct output values the model is trying to learn to predict — in training, we know them; at inference, we're trying to figure them out. For example, in a drug-class classifier, the molecular weight, log-P, and binding affinity would be features, while the drug class ("anticoagulant", "antidiabetic") would be the label. Feature engineering — creating better features from raw data — often has more impact on model performance than changing the model architecture.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.