Chelombus: Clustering 9.6 Billion Molecules

First tool to cluster and browse billions of molecules on a single computer.

Before Chelombus, the largest molecule sets you could visualize interactively were in the low millions. I pushed that to 9.6 billion. The idea is simple: compress each molecule into a short numerical fingerprint, cluster them into 100,000 groups, pick a representative for each group, and build an interactive tree-map you can click through. Each cluster links to its own detailed map, so you can go from a bird's-eye view of billions down to a single molecule in two clicks. The whole thing runs on one workstation, no cluster, no cloud.

Tech Stack

PythonRDKitProduct QuantizationPQk-meansTMAPClickHouseDocker

Features

Went from a few million molecules (previous limit) to 9.6 billion.
100,000 clusters, each with a representative molecule and a nested interactive map.
Two-click navigation: global overview, cluster, individual molecule.
Runs on a single workstation, no cloud infrastructure needed.
Streaming pipeline processes molecules in batches so memory stays manageable.

Challenges

Fitting 9.6 billion molecules into memory on a single machine (solved with streaming + compression).
Keeping clusters chemically meaningful after aggressive compression.
Making the interactive maps responsive at this scale.

Learnings

Streaming and compression can push single-machine workflows surprisingly far.
Good UX matters just as much as the algorithm. Nobody uses a tool they can't navigate.
Nested maps let you keep the big picture without losing detail.

Links

Live Github