Machine Learning Foundations · Lesson 3 of 70
What is a Feature and a Label?
Features (Inputs)
A feature is a measurable piece of information used as input to the model. Features are the variables the model uses to make predictions.
Clinical data example:
┌────────────────────────────────────────────────────────┐
│ age │ weight │ creatinine │ INR │ drug_count │ ... │
│ 65 │ 78 │ 1.2 │ 2.4 │ 5 │ ... │
│ 42 │ 92 │ 0.8 │ 1.1 │ 2 │ ... │
└────────────────────────────────────────────────────────┘
These columns are the features (also called X, inputs, predictors, or covariates)Labels (Outputs)
A label is the correct answer you want the model to predict. Labels are what you have during training; at inference time you're trying to predict them for new, unseen inputs.
Label examples:
- Will this patient develop sepsis? → binary: 0 or 1
- What drug class is this? → multi-class: 0, 1, 2, 3
- What will this patient's INR be in 7 days? → continuous number| Other names | Same thing | |---|---| | Target, output, y, response variable | They all mean "the thing you're predicting" |
A Complete Example
import pandas as pd
from sklearn.linear_model import LogisticRegression
# Dataset: predict whether a drug is an anticoagulant
data = pd.DataFrame({
"mol_weight": [296.3, 180.2, 419.5, 334.4],
"log_p": [2.4, -3.1, 3.8, 1.2],
"hbd_count": [1, 4, 2, 3], # Hydrogen bond donors
"is_anticoagulant":[1, 0, 1, 0], # Label
})
# Separate features (X) and label (y)
X = data[["mol_weight", "log_p", "hbd_count"]] # Feature matrix: (4, 3)
y = data["is_anticoagulant"] # Label vector: (4,)
# Train
model = LogisticRegression()
model.fit(X, y)
# Predict for a new compound
new_compound = [[310.0, 2.1, 1]] # Features only — no label known yet
prediction = model.predict(new_compound)
print(prediction) # [1] — model predicts anticoagulantRaw Features vs Engineered Features
Raw features come directly from the data source. Engineered features are derived or transformed to help the model learn better.
# Raw features from an EHR record:
raw = {
"date_of_birth": "1958-03-14",
"admission_date": "2024-06-10",
"medications": "warfarin, metoprolol, omeprazole",
"creatinine": 1.4,
}
# Engineered features:
from datetime import date
engineered = {
# Derived: age in years
"age": (date(2024, 6, 10) - date(1958, 3, 14)).days // 365,
# Extracted: medication count
"med_count": len(raw["medications"].split(",")),
# Flagged: is on anticoagulant
"on_anticoagulant": int("warfarin" in raw["medications"]),
# Renal function category (binned)
"renal_impairment": int(raw["creatinine"] > 1.2),
}Features in NLP and LLM Systems
For text-based ML (NLP), features aren't numeric columns — they're representations of text.
# Traditional NLP: TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"warfarin is an oral anticoagulant",
"aspirin reduces fever and inflammation",
"metformin lowers blood glucose in diabetes",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape) # (3, n_unique_words) — each word is a feature
# Modern: embedding features from an LLM
# Each text is represented as a high-dimensional vector
# Embedding models: text-embedding-3-small, sentence-transformers, etc.
# X.shape would be (3, 1536) — 1536-dimensional embedding vectorsLabel Types
| Label Type | Task | Example | |---|---|---| | Binary | Binary classification | Sepsis: yes / no | | Multi-class | Multi-class classification | Drug class: A / B / C / D | | Multi-label | Multi-label classification | Side effects: can be multiple | | Continuous | Regression | INR value, dose in mg | | Sequence | Sequence labeling (NER) | Token → drug-name / dose / route | | Ranking | Learning to rank | Most relevant document first |
The Feature Matrix Shape
In scikit-learn and NumPy, features are always shaped as (n_samples, n_features):
import numpy as np
# 1000 patients, each described by 20 features
X = np.random.randn(1000, 20) # Shape: (1000, 20)
y = np.random.randint(0, 2, 1000) # Shape: (1000,) — one label per patient
print(f"Samples: {X.shape[0]}") # 1000 — one row per patient
print(f"Features: {X.shape[1]}") # 20 — one column per feature
print(f"Labels: {y.shape}") # (1000,)Interview Answer Template
Q: What is the difference between a feature and a label?
Features are the input variables the model uses to make predictions — they describe each example (age, weight, lab values, text embeddings). Labels are the correct output values the model is trying to learn to predict — in training, we know them; at inference, we're trying to figure them out. For example, in a drug-class classifier, the molecular weight, log-P, and binding affinity would be features, while the drug class ("anticoagulant", "antidiabetic") would be the label. Feature engineering — creating better features from raw data — often has more impact on model performance than changing the model architecture.