Chelombus: Clustering 9.6 Billion Molecules
First tool to cluster and browse billions of molecules on a single computer.
Before Chelombus, the largest molecule sets you could visualize interactively were in the low millions. I pushed that to 9.6 billion. The idea is simple: compress each molecule into a short numerical fingerprint, cluster them into 100,000 groups, pick a representative for each group, and build an interactive tree-map you can click through. Each cluster links to its own detailed map, so you can go from a bird's-eye view of billions down to a single molecule in two clicks. The whole thing runs on one workstation, no cluster, no cloud.
Tech Stack
PythonRDKitProduct QuantizationPQk-meansTMAPClickHouseDocker
Features
- Went from a few million molecules (previous limit) to 9.6 billion.
- 100,000 clusters, each with a representative molecule and a nested interactive map.
- Two-click navigation: global overview, cluster, individual molecule.
- Runs on a single workstation, no cloud infrastructure needed.
- Streaming pipeline processes molecules in batches so memory stays manageable.
Challenges
- Fitting 9.6 billion molecules into memory on a single machine (solved with streaming + compression).
- Keeping clusters chemically meaningful after aggressive compression.
- Making the interactive maps responsive at this scale.
Learnings
- Streaming and compression can push single-machine workflows surprisingly far.
- Good UX matters just as much as the algorithm. Nobody uses a tool they can't navigate.
- Nested maps let you keep the big picture without losing detail.