Python: Bytecode, Collections, and Where Real Optimization Lives

· python · performance · profiling · internals

A working tour of CPython's optimization levels, the data-structure cost map, the standard-library tools that beat hand-rolled loops, and a class-optimization stack that compounds to ~15% on a real workload.

A short, practical walk through where Python performance actually comes from — bytecode, the right collection, the right standard-library tool, and the small class-level moves that compound. Every measured number below is from a real micro-benchmark using timeit, not a synthetic shootout.


CPython’s optimization levels

Bytecode is the intermediate representation CPython compiles your source into before the virtual machine runs it. The compiler has three optimization levels, controlled by the -O and -OO flags:

  • Level 0 — the default. python3 script.py. Bytecode files are written to __pycache__/*.pyc. No specific optimizations.
  • Level 1python3 -O script.py. CPython strips assert statements out. Slightly faster because those checks no longer execute. Bytecode files get the opt-1.pyc suffix.
  • Level 2python3 -OO script.py. Strips both assert and docstrings (__doc__). Smaller bytecode files, but you lose introspectable documentation at runtime.

Worth knowing about even if you usually leave it at level 0. Production containers running cold-loaded code can save a few MB and a few ms per import by using -OO, but only if you’re sure no library reads __doc__ at runtime (some do).

For longer-running compute, the realistic next step is PyPy — a separate Python implementation with a JIT compiler. It’s not part of CPython, but it’s worth keeping in the toolbox for batch jobs where wall time dominates.


The data-structure cost map

Most “Python is slow” complaints I’ve debugged turn out to be wrong-collection complaints. The cheat sheet:

  • list — dynamic array. O(1) index access and append, O(n) insert/delete in the middle, O(n) membership check.
  • tuple — immutable list. Lower memory footprint than list, marginally faster to construct and access. Use for fixed-arity records.
  • set — hash table. O(1) average membership check and insert. The right answer when “is X in this collection?” is asked more than once per element.
  • dict — hash table with values. O(1) average lookup. Use as a cheap key→value index any time you’d otherwise scan a list.
  • collections.deque — double-ended queue. O(1) append/pop on both ends. Use for queues and stacks; never use list.pop(0) (it’s O(n)).
  • collections.defaultdictdict with a factory for missing keys. Removes the boilerplate if k not in d: d[k] = [] pattern.
  • collections.OrderedDict — preserves insertion order, with explicit ordering operations (move_to_end, popitem(last=...)). Plain dict already preserves insertion order since Python 3.7, so reach for OrderedDict only when you need the ordering operations.

Picking the right one upfront beats every micro-optimization that comes later.


Standard-library tools that beat hand-rolled loops

A short list of patterns where a built-in is reliably faster and clearer than a for loop:

  • map(fn, iterable) — applies fn to each element, returns a lazy iterator. Use it when you have one operation to apply across many elements.
  • filter(pred, iterable) — filters by predicate, lazily. Faster than a list comprehension when you don’t actually need a list — you’re iterating once.
  • functools.reduce(fn, iterable, init) — folds an iterable into a single value. Useful for sums, products, or accumulator patterns where sum()/max()/min() aren’t quite right.
  • sorted(iterable, key=...) — sorts and returns a new list. Don’t write your own sort. For partial sorts, heapq.nlargest(n, iterable) and heapq.nsmallest(n, iterable) are dramatically faster than sorted(...)[:n] when n is small.
  • sum() / max() / min() — vectorized in C. Always faster than a manual loop and harder to get wrong.

Generators and generator expressions

Generators (yield) and generator expressions ((x*x for x in range(1_000_000))) compute lazily. They use constant memory regardless of input size — a major win whenever you only need to iterate the result once. Reach for them by default; only materialize a list when you actually need indexing or multiple passes.

itertools

Worth memorizing the most useful members:

  • itertools.chain(*iterables) — concatenate iterators without building intermediate lists.
  • itertools.islice(iterable, start, stop, step) — slice an iterator without materializing it.
  • itertools.combinations(iterable, r) and itertools.permutations(iterable, r) — generate combinatorial sequences lazily.

Anything that avoids building an intermediate list saves memory at no readability cost.


When you’ve outgrown pure Python

Reach for compiled libraries before you start hand-optimizing:

  • NumPy — vectorized array math. Replace inner loops over numeric data with array operations and you get C-speed for free.
  • Pandas — tabular data, much of it built on NumPy and Cython under the hood. Vectorized operations on columns beat row-iteration by orders of magnitude.
  • Numba@jit-decorate hot Python functions and they get compiled to machine code on first call. Very effective for numeric inner loops where rewriting in NumPy would be awkward.
  • SciPy — numerical algorithms (linear algebra, integration, optimization, ODEs).
  • Dask / Vaex — out-of-core and distributed computation over NumPy/Pandas-shaped data.

If your bottleneck is numeric, the answer is almost always “use the right library,” not “rewrite the loop more cleverly.”


Profiling — the only way to know

Optimizing without profiling is guessing. Five tools earn a permanent slot in the toolbox:

  • cProfile — function-level profiling for the whole program. Tells you where time is being spent.
  • line_profiler — line-by-line within a function. Use when cProfile points you to a function but you need to know which lines inside it are the cost.
  • memory_profiler — line-by-line memory consumption. Essential when “the loop runs fine on small input but kills the process on production data.”
  • dis — disassembles a function to bytecode. Useful for understanding what construct CPython actually generated for an expression you’re suspicious of.
  • timeit — quick A/B comparisons of small snippets. The right tool for “is this idiom faster than that one?” — and the source of every measured number in this post.

If you’re optimizing without one of these in the loop, you’re probably optimizing the wrong thing.


Micro-optimizations: the dispatch trick

The single most-useful micro-pattern: replace cascades of if/elif with a dict-based dispatch.

# slow-ish: each call does N comparisons
def handle(event):
    if event == "create":
        return on_create()
    elif event == "update":
        return on_update()
    elif event == "delete":
        return on_delete()
 
# faster: O(1) hash lookup
HANDLERS = {"create": on_create, "update": on_update, "delete": on_delete}
 
def handle(event):
    return HANDLERS[event]()

On a benchmark of 1M dispatches across ~5 cases this is roughly 15% faster than the if/elif chain — and it scales much better as the case count grows. It also reads better as a registry.

A few smaller patterns worth knowing:

  • Intern strings with sys.intern() when you compare the same short strings many times. CPython auto-interns single-character strings and small integers (-5 to 256); for longer strings used as dict keys or compared with is, explicit interning can be a real win.
  • Constant folding via tuples — for fixed lookup sets (x in (1, 2, 3)), a tuple is faster to construct than a list. For larger sets, use a frozen set literal.
  • x = x + 1 vs x += 1 — equivalent for ints, but += matters for mutable types: list += other extends in place where list = list + other allocates a new list.

Skip the peephole optimizations — replacing 2 * 3 with 6 etc. CPython’s compiler already does that for you. The win is the algorithm and the data structure, not the constant.


Class-level optimizations: a stack that compounds

The most useful exercise from a recent deck — take one realistic class and apply five techniques in sequence, measuring each step against a 2-second baseline workload:

#TechniqueWall timeΔ from baselineΔ from previous
0Plain class baseline2.015 s
1Use __init__ assignment style consistent with CPython’s expected pattern1.987 s+1.37 %+1.37 %
2Add __slots__ to the class1.972 s+2.12 %+0.75 %
3Reduce method lookup cost (cache bound methods, hoist attribute access out of loops)1.922 s+4.63 %+2.56 %
4Replace explicit loops with sum / map / list comprehensions1.883 s+6.53 %+2.01 %
5Use a closure to capture loop-invariant state1.715 s+14.89 %+8.95 %

A few notes on each step:

  • __slots__ declares a fixed set of attribute names and removes the per-instance __dict__. Lower memory per instance, faster attribute access. The catch: no dynamic attributes, no multiple inheritance from non-slotted classes. Worth it for hot, high-count instance types; skip for ergonomic value classes.
  • Method-lookup reduction — every self.method() does a name resolution through the type. In a tight inner loop, hoisting m = self.method out of the loop and then calling m() removes that lookup per iteration.
  • Built-in vectorizationsum(iter) is faster than total = 0; for x in iter: total += x because the loop body runs in C. Same for map/filter/comprehensions: less interpreter overhead per element.
  • Closures — when a function references variables from an enclosing scope, those become “cell” variables that are cheaper to access than module-level globals or instance attributes. The largest single win in the table above (+8.95 %) came from closing over the right state instead of re-reading it from self each iteration.

None of these are huge individually. Together they turned a 2.015 s workload into a 1.715 s workload — about 15 % of the original cost, removed without changing the algorithm.


What to take from this

If I had to summarise:

  1. Pick the right collection first. It’s worth more than every micro-optimization you’ll ever apply.
  2. Profile before you tune. cProfile for shape, line_profiler for hot lines, memory_profiler for footprint.
  3. Reach for the standard library. map/filter/sorted/heapq/itertools/collections were written by people whose job was to make them fast.
  4. Reach for compiled libraries the moment your bottleneck is numeric. NumPy / Numba / Pandas first; hand-tuned Python second.
  5. Class-level optimizations compound. None of them matters in isolation; together they’re worth ~15 %.

Optimization without measurement is just folklore. Measure, profile, and let the numbers — not the instinct — pick the change.