Python: Bytecode, Collections, and Where Real Optimization Lives
A working tour of CPython's optimization levels, the data-structure cost map, the standard-library tools that beat hand-rolled loops, and a class-optimization stack that compounds to ~15% on a real workload.
A short, practical walk through where Python performance actually comes from — bytecode, the right collection, the right standard-library tool, and the small class-level moves that compound. Every measured number below is from a real micro-benchmark using timeit, not a synthetic shootout.
CPython’s optimization levels
Bytecode is the intermediate representation CPython compiles your source into before the virtual machine runs it. The compiler has three optimization levels, controlled by the -O and -OO flags:
- Level 0 — the default.
python3 script.py. Bytecode files are written to__pycache__/*.pyc. No specific optimizations. - Level 1 —
python3 -O script.py. CPython stripsassertstatements out. Slightly faster because those checks no longer execute. Bytecode files get theopt-1.pycsuffix. - Level 2 —
python3 -OO script.py. Strips bothassertand docstrings (__doc__). Smaller bytecode files, but you lose introspectable documentation at runtime.
Worth knowing about even if you usually leave it at level 0. Production containers running cold-loaded code can save a few MB and a few ms per import by using -OO, but only if you’re sure no library reads __doc__ at runtime (some do).
For longer-running compute, the realistic next step is PyPy — a separate Python implementation with a JIT compiler. It’s not part of CPython, but it’s worth keeping in the toolbox for batch jobs where wall time dominates.
The data-structure cost map
Most “Python is slow” complaints I’ve debugged turn out to be wrong-collection complaints. The cheat sheet:
list— dynamic array.O(1)index access and append,O(n)insert/delete in the middle,O(n)membership check.tuple— immutable list. Lower memory footprint thanlist, marginally faster to construct and access. Use for fixed-arity records.set— hash table.O(1)average membership check and insert. The right answer when “is X in this collection?” is asked more than once per element.dict— hash table with values.O(1)average lookup. Use as a cheap key→value index any time you’d otherwise scan a list.collections.deque— double-ended queue.O(1)append/pop on both ends. Use for queues and stacks; never uselist.pop(0)(it’sO(n)).collections.defaultdict—dictwith a factory for missing keys. Removes the boilerplateif k not in d: d[k] = []pattern.collections.OrderedDict— preserves insertion order, with explicit ordering operations (move_to_end,popitem(last=...)). Plaindictalready preserves insertion order since Python 3.7, so reach forOrderedDictonly when you need the ordering operations.
Picking the right one upfront beats every micro-optimization that comes later.
Standard-library tools that beat hand-rolled loops
A short list of patterns where a built-in is reliably faster and clearer than a for loop:
map(fn, iterable)— appliesfnto each element, returns a lazy iterator. Use it when you have one operation to apply across many elements.filter(pred, iterable)— filters by predicate, lazily. Faster than a list comprehension when you don’t actually need a list — you’re iterating once.functools.reduce(fn, iterable, init)— folds an iterable into a single value. Useful for sums, products, or accumulator patterns wheresum()/max()/min()aren’t quite right.sorted(iterable, key=...)— sorts and returns a new list. Don’t write your own sort. For partial sorts,heapq.nlargest(n, iterable)andheapq.nsmallest(n, iterable)are dramatically faster thansorted(...)[:n]whennis small.sum()/max()/min()— vectorized in C. Always faster than a manual loop and harder to get wrong.
Generators and generator expressions
Generators (yield) and generator expressions ((x*x for x in range(1_000_000))) compute lazily. They use constant memory regardless of input size — a major win whenever you only need to iterate the result once. Reach for them by default; only materialize a list when you actually need indexing or multiple passes.
itertools
Worth memorizing the most useful members:
itertools.chain(*iterables)— concatenate iterators without building intermediate lists.itertools.islice(iterable, start, stop, step)— slice an iterator without materializing it.itertools.combinations(iterable, r)anditertools.permutations(iterable, r)— generate combinatorial sequences lazily.
Anything that avoids building an intermediate list saves memory at no readability cost.
When you’ve outgrown pure Python
Reach for compiled libraries before you start hand-optimizing:
- NumPy — vectorized array math. Replace inner loops over numeric data with array operations and you get C-speed for free.
- Pandas — tabular data, much of it built on NumPy and Cython under the hood. Vectorized operations on columns beat row-iteration by orders of magnitude.
- Numba —
@jit-decorate hot Python functions and they get compiled to machine code on first call. Very effective for numeric inner loops where rewriting in NumPy would be awkward. - SciPy — numerical algorithms (linear algebra, integration, optimization, ODEs).
- Dask / Vaex — out-of-core and distributed computation over NumPy/Pandas-shaped data.
If your bottleneck is numeric, the answer is almost always “use the right library,” not “rewrite the loop more cleverly.”
Profiling — the only way to know
Optimizing without profiling is guessing. Five tools earn a permanent slot in the toolbox:
cProfile— function-level profiling for the whole program. Tells you where time is being spent.line_profiler— line-by-line within a function. Use whencProfilepoints you to a function but you need to know which lines inside it are the cost.memory_profiler— line-by-line memory consumption. Essential when “the loop runs fine on small input but kills the process on production data.”dis— disassembles a function to bytecode. Useful for understanding what construct CPython actually generated for an expression you’re suspicious of.timeit— quick A/B comparisons of small snippets. The right tool for “is this idiom faster than that one?” — and the source of every measured number in this post.
If you’re optimizing without one of these in the loop, you’re probably optimizing the wrong thing.
Micro-optimizations: the dispatch trick
The single most-useful micro-pattern: replace cascades of if/elif with a dict-based dispatch.
# slow-ish: each call does N comparisons
def handle(event):
if event == "create":
return on_create()
elif event == "update":
return on_update()
elif event == "delete":
return on_delete()
# faster: O(1) hash lookup
HANDLERS = {"create": on_create, "update": on_update, "delete": on_delete}
def handle(event):
return HANDLERS[event]()On a benchmark of 1M dispatches across ~5 cases this is roughly 15% faster than the if/elif chain — and it scales much better as the case count grows. It also reads better as a registry.
A few smaller patterns worth knowing:
- Intern strings with
sys.intern()when you compare the same short strings many times. CPython auto-interns single-character strings and small integers (-5 to 256); for longer strings used as dict keys or compared withis, explicit interning can be a real win. - Constant folding via tuples — for fixed lookup sets (
x in (1, 2, 3)), a tuple is faster to construct than a list. For larger sets, use a frozensetliteral. x = x + 1vsx += 1— equivalent for ints, but+=matters for mutable types:list += otherextends in place wherelist = list + otherallocates a new list.
Skip the peephole optimizations — replacing 2 * 3 with 6 etc. CPython’s compiler already does that for you. The win is the algorithm and the data structure, not the constant.
Class-level optimizations: a stack that compounds
The most useful exercise from a recent deck — take one realistic class and apply five techniques in sequence, measuring each step against a 2-second baseline workload:
| # | Technique | Wall time | Δ from baseline | Δ from previous |
|---|---|---|---|---|
| 0 | Plain class baseline | 2.015 s | — | — |
| 1 | Use __init__ assignment style consistent with CPython’s expected pattern | 1.987 s | +1.37 % | +1.37 % |
| 2 | Add __slots__ to the class | 1.972 s | +2.12 % | +0.75 % |
| 3 | Reduce method lookup cost (cache bound methods, hoist attribute access out of loops) | 1.922 s | +4.63 % | +2.56 % |
| 4 | Replace explicit loops with sum / map / list comprehensions | 1.883 s | +6.53 % | +2.01 % |
| 5 | Use a closure to capture loop-invariant state | 1.715 s | +14.89 % | +8.95 % |
A few notes on each step:
__slots__declares a fixed set of attribute names and removes the per-instance__dict__. Lower memory per instance, faster attribute access. The catch: no dynamic attributes, no multiple inheritance from non-slotted classes. Worth it for hot, high-count instance types; skip for ergonomic value classes.- Method-lookup reduction — every
self.method()does a name resolution through the type. In a tight inner loop, hoistingm = self.methodout of the loop and then callingm()removes that lookup per iteration. - Built-in vectorization —
sum(iter)is faster thantotal = 0; for x in iter: total += xbecause the loop body runs in C. Same formap/filter/comprehensions: less interpreter overhead per element. - Closures — when a function references variables from an enclosing scope, those become “cell” variables that are cheaper to access than module-level globals or instance attributes. The largest single win in the table above (+8.95 %) came from closing over the right state instead of re-reading it from
selfeach iteration.
None of these are huge individually. Together they turned a 2.015 s workload into a 1.715 s workload — about 15 % of the original cost, removed without changing the algorithm.
What to take from this
If I had to summarise:
- Pick the right collection first. It’s worth more than every micro-optimization you’ll ever apply.
- Profile before you tune.
cProfilefor shape,line_profilerfor hot lines,memory_profilerfor footprint. - Reach for the standard library.
map/filter/sorted/heapq/itertools/collectionswere written by people whose job was to make them fast. - Reach for compiled libraries the moment your bottleneck is numeric. NumPy / Numba / Pandas first; hand-tuned Python second.
- Class-level optimizations compound. None of them matters in isolation; together they’re worth ~15 %.
Optimization without measurement is just folklore. Measure, profile, and let the numbers — not the instinct — pick the change.