What is a Feature and a Label?
Clear definitions of features and labels in machine learning: raw vs engineered features, target variables for regression and classification, and how they map to real AI use cases like drug prediction and clinical NLP.
Features (Inputs)
A feature is a measurable piece of information used as input to the model. Features are the variables the model uses to make predictions.
Clinical data example:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā age ā weight ā creatinine ā INR ā drug_count ā ... ā
ā 65 ā 78 ā 1.2 ā 2.4 ā 5 ā ... ā
ā 42 ā 92 ā 0.8 ā 1.1 ā 2 ā ... ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
These columns are the features (also called X, inputs, predictors, or covariates)Labels (Outputs)
A label is the correct answer you want the model to predict. Labels are what you have during training; at inference time you're trying to predict them for new, unseen inputs.
Label examples:
- Will this patient develop sepsis? ā binary: 0 or 1
- What drug class is this? ā multi-class: 0, 1, 2, 3
- What will this patient's INR be in 7 days? ā continuous number| Other names | Same thing | |---|---| | Target, output, y, response variable | They all mean "the thing you're predicting" |
A Complete Example
import pandas as pd
from sklearn.linear_model import LogisticRegression
# Dataset: predict whether a drug is an anticoagulant
data = pd.DataFrame({
"mol_weight": [296.3, 180.2, 419.5, 334.4],
"log_p": [2.4, -3.1, 3.8, 1.2],
"hbd_count": [1, 4, 2, 3], # Hydrogen bond donors
"is_anticoagulant":[1, 0, 1, 0], # Label
})
# Separate features (X) and label (y)
X = data[["mol_weight", "log_p", "hbd_count"]] # Feature matrix: (4, 3)
y = data["is_anticoagulant"] # Label vector: (4,)
# Train
model = LogisticRegression()
model.fit(X, y)
# Predict for a new compound
new_compound = [[310.0, 2.1, 1]] # Features only ā no label known yet
prediction = model.predict(new_compound)
print(prediction) # [1] ā model predicts anticoagulantRaw Features vs Engineered Features
Raw features come directly from the data source. Engineered features are derived or transformed to help the model learn better.
# Raw features from an EHR record:
raw = {
"date_of_birth": "1958-03-14",
"admission_date": "2024-06-10",
"medications": "warfarin, metoprolol, omeprazole",
"creatinine": 1.4,
}
# Engineered features:
from datetime import date
engineered = {
# Derived: age in years
"age": (date(2024, 6, 10) - date(1958, 3, 14)).days // 365,
# Extracted: medication count
"med_count": len(raw["medications"].split(",")),
# Flagged: is on anticoagulant
"on_anticoagulant": int("warfarin" in raw["medications"]),
# Renal function category (binned)
"renal_impairment": int(raw["creatinine"] > 1.2),
}Features in NLP and LLM Systems
For text-based ML (NLP), features aren't numeric columns ā they're representations of text.
# Traditional NLP: TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"warfarin is an oral anticoagulant",
"aspirin reduces fever and inflammation",
"metformin lowers blood glucose in diabetes",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape) # (3, n_unique_words) ā each word is a feature
# Modern: embedding features from an LLM
# Each text is represented as a high-dimensional vector
# Embedding models: text-embedding-3-small, sentence-transformers, etc.
# X.shape would be (3, 1536) ā 1536-dimensional embedding vectorsLabel Types
| Label Type | Task | Example | |---|---|---| | Binary | Binary classification | Sepsis: yes / no | | Multi-class | Multi-class classification | Drug class: A / B / C / D | | Multi-label | Multi-label classification | Side effects: can be multiple | | Continuous | Regression | INR value, dose in mg | | Sequence | Sequence labeling (NER) | Token ā drug-name / dose / route | | Ranking | Learning to rank | Most relevant document first |
The Feature Matrix Shape
In scikit-learn and NumPy, features are always shaped as (n_samples, n_features):
import numpy as np
# 1000 patients, each described by 20 features
X = np.random.randn(1000, 20) # Shape: (1000, 20)
y = np.random.randint(0, 2, 1000) # Shape: (1000,) ā one label per patient
print(f"Samples: {X.shape[0]}") # 1000 ā one row per patient
print(f"Features: {X.shape[1]}") # 20 ā one column per feature
print(f"Labels: {y.shape}") # (1000,)Interview Answer Template
Q: What is the difference between a feature and a label?
Features are the input variables the model uses to make predictions ā they describe each example (age, weight, lab values, text embeddings). Labels are the correct output values the model is trying to learn to predict ā in training, we know them; at inference, we're trying to figure them out. For example, in a drug-class classifier, the molecular weight, log-P, and binding affinity would be features, while the drug class ("anticoagulant", "antidiabetic") would be the label. Feature engineering ā creating better features from raw data ā often has more impact on model performance than changing the model architecture.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.