Back to projects

J. Chem. Inf. Model. 2026

Clustering 9.6 billion molecules on a single workstation.

Streaming Product Quantization + GPU PQk-means pipeline that clusters and visualises the 9.6B-molecule Enamine REAL set on one workstation. End-to-end in ~4.5 hours, vs ~14 days on the reference C++ pipeline.

Chelombus
Featured
Primary TMAP of the 9.6B-molecule Enamine REAL dataset, organised by MQN similarity.
Primary TMAP: 92,464 cluster representatives over the Enamine REAL set.

What it does

Each molecule is encoded as a 42-dimensional MQN fingerprint, compressed to a 6-byte PQ code (~28× compression), then assigned to one of 100,000 clusters via GPU-accelerated PQk-means. A nested TMAP visualisation lets a user navigate from a primary view of cluster representatives down to individual molecules in two clicks.

First-author paper accepted at J. Chem. Inf. Model. 2026 (DOI: 10.1021/acs.jcim.6c00420).

Why it's fast

PQk-means is reimplemented in pure Python + Numba with custom Triton/CUDA kernels for the assignment step. The full 9.6B-molecule pipeline runs in ~4.5 hours on a single RTX 4070 Ti, compared to ~14 days on the reference C++ implementation. Streaming batches mean peak VRAM is bounded by user-configured batch size, not dataset size, so the same code runs on any modern GPU.

Live platform

Results are accessible at chelombus.gdb.tools, a Next.js + Nginx site (Dockerized, systemd-supervised) serving 180,000+ pre-generated TMAPs directly from disk with cache headers, with a UI built for fast dataset and cluster navigation.