Machine Learning Foundations · Lesson 3 of 70

What is a Feature and a Label?

Features (Inputs)

A feature is a measurable piece of information used as input to the model. Features are the variables the model uses to make predictions.

Clinical data example:
┌────────────────────────────────────────────────────────┐
│ age │ weight │ creatinine │ INR │ drug_count │ ...     │
│  65 │   78   │    1.2     │ 2.4 │     5      │ ...     │
│  42 │   92   │    0.8     │ 1.1 │     2      │ ...     │
└────────────────────────────────────────────────────────┘
These columns are the features (also called X, inputs, predictors, or covariates)

Labels (Outputs)

A label is the correct answer you want the model to predict. Labels are what you have during training; at inference time you're trying to predict them for new, unseen inputs.

Label examples:
- Will this patient develop sepsis?          → binary: 0 or 1
- What drug class is this?                   → multi-class: 0, 1, 2, 3
- What will this patient's INR be in 7 days? → continuous number

| Other names | Same thing | |---|---| | Target, output, y, response variable | They all mean "the thing you're predicting" |

A Complete Example

Python

import pandas as pd
from sklearn.linear_model import LogisticRegression

# Dataset: predict whether a drug is an anticoagulant
data = pd.DataFrame({
    "mol_weight":      [296.3, 180.2, 419.5, 334.4],
    "log_p":           [2.4,   -3.1,   3.8,   1.2],
    "hbd_count":       [1,      4,      2,     3],       # Hydrogen bond donors
    "is_anticoagulant":[1,      0,      1,     0],       # Label
})

# Separate features (X) and label (y)
X = data[["mol_weight", "log_p", "hbd_count"]]   # Feature matrix: (4, 3)
y = data["is_anticoagulant"]                     # Label vector: (4,)

# Train
model = LogisticRegression()
model.fit(X, y)

# Predict for a new compound
new_compound = [[310.0, 2.1, 1]]   # Features only — no label known yet
prediction = model.predict(new_compound)
print(prediction)   # [1] — model predicts anticoagulant

Raw Features vs Engineered Features

Raw features come directly from the data source. Engineered features are derived or transformed to help the model learn better.

Python

# Raw features from an EHR record:
raw = {
    "date_of_birth": "1958-03-14",
    "admission_date": "2024-06-10",
    "medications": "warfarin, metoprolol, omeprazole",
    "creatinine": 1.4,
}

# Engineered features:
from datetime import date

engineered = {
    # Derived: age in years
    "age": (date(2024, 6, 10) - date(1958, 3, 14)).days // 365,

    # Extracted: medication count
    "med_count": len(raw["medications"].split(",")),

    # Flagged: is on anticoagulant
    "on_anticoagulant": int("warfarin" in raw["medications"]),

    # Renal function category (binned)
    "renal_impairment": int(raw["creatinine"] > 1.2),
}

Features in NLP and LLM Systems

For text-based ML (NLP), features aren't numeric columns — they're representations of text.

Python

# Traditional NLP: TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "warfarin is an oral anticoagulant",
    "aspirin reduces fever and inflammation",
    "metformin lowers blood glucose in diabetes",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape)   # (3, n_unique_words) — each word is a feature

# Modern: embedding features from an LLM
# Each text is represented as a high-dimensional vector
# Embedding models: text-embedding-3-small, sentence-transformers, etc.
# X.shape would be (3, 1536) — 1536-dimensional embedding vectors

Label Types

| Label Type | Task | Example | |---|---|---| | Binary | Binary classification | Sepsis: yes / no | | Multi-class | Multi-class classification | Drug class: A / B / C / D | | Multi-label | Multi-label classification | Side effects: can be multiple | | Continuous | Regression | INR value, dose in mg | | Sequence | Sequence labeling (NER) | Token → drug-name / dose / route | | Ranking | Learning to rank | Most relevant document first |

The Feature Matrix Shape

In scikit-learn and NumPy, features are always shaped as (n_samples, n_features):

Python

import numpy as np

# 1000 patients, each described by 20 features
X = np.random.randn(1000, 20)   # Shape: (1000, 20)
y = np.random.randint(0, 2, 1000)  # Shape: (1000,) — one label per patient

print(f"Samples:  {X.shape[0]}")   # 1000 — one row per patient
print(f"Features: {X.shape[1]}")   # 20 — one column per feature
print(f"Labels:   {y.shape}")      # (1000,)

Interview Answer Template

Q: What is the difference between a feature and a label?

Features are the input variables the model uses to make predictions — they describe each example (age, weight, lab values, text embeddings). Labels are the correct output values the model is trying to learn to predict — in training, we know them; at inference, we're trying to figure them out. For example, in a drug-class classifier, the molecular weight, log-P, and binding affinity would be features, while the drug class ("anticoagulant", "antidiabetic") would be the label. Feature engineering — creating better features from raw data — often has more impact on model performance than changing the model architecture.

Training, Validation, and Testing — What Each Does

Next Lesson

How Does a Model Actually Learn?