AI Systemsintermediate
List Comprehensions
Write concise, readable list comprehensions in Python: basic syntax, filtering, nested comprehensions, when to use them vs loops, and patterns in AI/ML data processing.
Asma Hafeez KhanMay 16, 20266 min read
PythonList ComprehensionFunctional ProgrammingData ProcessingPerformance
Basic Syntax
A list comprehension creates a new list by applying an expression to each item in an iterable:
[expression for item in iterable]Python
# Without comprehension
squares = []
for x in range(5):
squares.append(x ** 2)
# [0, 1, 4, 9, 16]
# With comprehension ā same result, one line
squares = [x ** 2 for x in range(5)]
# [0, 1, 4, 9, 16]With Filtering
Add an if clause to filter items:
Python
# [expression for item in iterable if condition]
scores = [0.92, 0.45, 0.78, 0.61, 0.89, 0.33]
# Keep only passing scores (0.7 or above)
passing = [s for s in scores if s >= 0.7]
# [0.92, 0.78, 0.89]
# Transform AND filter: round passing scores
rounded_passing = [round(s, 2) for s in scores if s >= 0.7]
# [0.92, 0.78, 0.89]Common Patterns in AI/ML Code
Python
drug_names = [" Warfarin ", "ASPIRIN", " metformin ", "Lisinopril"]
# Normalize drug names: strip whitespace, lowercase
normalized = [name.strip().lower() for name in drug_names]
# ["warfarin", "aspirin", "metformin", "lisinopril"]
# Extract specific field from list of dicts
patients = [
{"id": "P001", "inr": 2.4, "drug": "warfarin"},
{"id": "P002", "inr": 1.8, "drug": "warfarin"},
{"id": "P003", "inr": 3.2, "drug": "warfarin"},
]
inr_values = [p["inr"] for p in patients] # [2.4, 1.8, 3.2]
patient_ids = [p["id"] for p in patients] # ["P001", "P002", "P003"]
above_target = [p for p in patients if p["inr"] > 2.0] # P001 and P003
# Extract + transform + filter
high_inr_ids = [p["id"] for p in patients if p["inr"] > 3.0]
# ["P003"]
# Process retrieved documents for RAG
from langchain_core.documents import Document
def format_retrieved_docs(docs: list[Document]) -> list[str]:
return [
f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
for doc in docs
if len(doc.page_content.strip()) > 50 # Skip near-empty chunks
]Conditional Expression (Ternary) in Comprehensions
Python
# [value_if_true if condition else value_if_false for item in iterable]
scores = [0.92, 0.45, 0.78, 0.61, 0.89]
# Label each score
labels = ["pass" if s >= 0.7 else "fail" for s in scores]
# ["pass", "fail", "pass", "fail", "pass"]
# Normalize: keep score if high, else replace with 0
filtered_scores = [s if s >= 0.7 else 0.0 for s in scores]
# [0.92, 0.0, 0.78, 0.0, 0.89]Nested Comprehensions
Python
# Flatten a 2D list
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flat = [x for row in matrix for x in row]
# [1, 2, 3, 4, 5, 6, 7, 8, 9]
# Read as: "for each row in matrix, for each x in row, take x"
# Cartesian product
drugs = ["warfarin", "aspirin"]
routes = ["PO", "IV"]
combinations = [(drug, route) for drug in drugs for route in routes]
# [("warfarin", "PO"), ("warfarin", "IV"), ("aspirin", "PO"), ("aspirin", "IV")]
# 2D comprehension (list of lists)
grid = [[i * j for j in range(1, 4)] for i in range(1, 4)]
# [[1, 2, 3], [2, 4, 6], [3, 6, 9]]Generator Expressions (Memory-Efficient Alternative)
Replace [] with () to create a generator ā computes values lazily, one at a time:
Python
# List comprehension: creates the entire list in memory
total_chars = sum([len(text) for text in huge_text_list])
# Generator expression: computes each value as needed ā no full list in memory
total_chars = sum(len(text) for text in huge_text_list) # Same result, less RAM
# When to use generator vs list:
# - Use list when you need to access elements multiple times or by index
# - Use generator (no []) when you only need to iterate once or feed into sum/max/min/any/all
# Generators with any() and all() short-circuit
has_major = any(interaction["severity"] == "Major" for interaction in interactions)
# Stops at first "Major" ā doesn't process the rest
all_pass = all(score >= 0.7 for score in scores)
# Stops at first failing scorePerformance Comparison
Python
import timeit
data = list(range(10_000))
# Method 1: for loop
def with_loop():
result = []
for x in data:
if x % 2 == 0:
result.append(x ** 2)
return result
# Method 2: list comprehension
def with_comprehension():
return [x ** 2 for x in data if x % 2 == 0]
# Method 3: map + filter (functional)
def with_map_filter():
return list(map(lambda x: x ** 2, filter(lambda x: x % 2 == 0, data)))
t1 = timeit.timeit(with_loop, number=1000)
t2 = timeit.timeit(with_comprehension, number=1000)
t3 = timeit.timeit(with_map_filter, number=1000)
print(f"Loop: {t1:.3f}s | Comprehension: {t2:.3f}s | Map+Filter: {t3:.3f}s")
# Comprehension is typically 20-30% faster than the equivalent for loop
# because it avoids the per-iteration overhead of .append() lookupsWhen NOT to Use Comprehensions
Python
# Don't: complex logic buried in a comprehension ā hard to read
result = [
process_clinical_note(note)
for note in notes
if note.get("status") == "active"
and note.get("category") in {"pharmacist", "physician"}
and len(note.get("content", "")) > 100
]
# Better: name the filter condition
def is_valid_note(note: dict) -> bool:
return (
note.get("status") == "active"
and note.get("category") in {"pharmacist", "physician"}
and len(note.get("content", "")) > 100
)
result = [process_clinical_note(note) for note in notes if is_valid_note(note)]
# Don't: use comprehension for side effects ā use a for loop
# Wrong:
_ = [print(drug) for drug in drugs] # Side effect in comprehension
# Right:
for drug in drugs:
print(drug)
# Don't: nest more than 2 levels deep
# Two levels is the limit for readability ā beyond that, use a for loop with namesData Processing Examples for AI
Python
# 1. Build embedding batch
def prepare_embedding_batch(documents: list[dict]) -> list[str]:
return [
f"Title: {doc['title']}\n{doc['content']}"
for doc in documents
if doc.get("content")
]
# 2. Extract unique sources from retrieved docs
def get_unique_sources(docs: list) -> list[str]:
return list({doc.metadata.get("source", "") for doc in docs if doc.metadata.get("source")})
# 3. Format Q&A dataset from raw pairs
qa_pairs = [("What is warfarin?", "Warfarin is an anticoagulant..."), ("What is metformin?", "Metformin is a biguanide...")]
formatted_dataset = [
{"messages": [{"role": "user", "content": q}, {"role": "assistant", "content": a}]}
for q, a in qa_pairs
]
# 4. Filter and score retrieval results
def filter_by_score(results: list[tuple], min_score: float = 0.75) -> list:
return [doc for doc, score in results if score >= min_score]Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.