Chelombus: Clustering 9.6 Billion Molecules

First tool to cluster and browse billions of molecules on a single computer.

Before Chelombus, the largest molecule sets you could visualize interactively were in the low millions. I pushed that to 9.6 billion. The idea is simple: compress each molecule into a short numerical fingerprint, cluster them into 100,000 groups, pick a representative for each group, and build an interactive tree-map you can click through. Each cluster links to its own detailed map, so you can go from a bird's-eye view of billions down to a single molecule in two clicks. The whole thing runs on one workstation, no cluster, no cloud.

Tech Stack

PythonRDKitProduct QuantizationPQk-meansTMAPClickHouseDocker

Features

  • Went from a few million molecules (previous limit) to 9.6 billion.
  • 100,000 clusters, each with a representative molecule and a nested interactive map.
  • Two-click navigation: global overview, cluster, individual molecule.
  • Runs on a single workstation, no cloud infrastructure needed.
  • Streaming pipeline processes molecules in batches so memory stays manageable.

Challenges

  • Fitting 9.6 billion molecules into memory on a single machine (solved with streaming + compression).
  • Keeping clusters chemically meaningful after aggressive compression.
  • Making the interactive maps responsive at this scale.

Learnings

  • Streaming and compression can push single-machine workflows surprisingly far.
  • Good UX matters just as much as the algorithm. Nobody uses a tool they can't navigate.
  • Nested maps let you keep the big picture without losing detail.