What it does
Each molecule is encoded as a 42-dimensional MQN fingerprint, compressed to a 6-byte PQ code (~28× compression), then assigned to one of 100,000 clusters via GPU-accelerated PQk-means. A nested TMAP visualisation lets a user navigate from a primary view of cluster representatives down to individual molecules in two clicks.
First-author paper accepted at J. Chem. Inf. Model. 2026 (DOI: 10.1021/acs.jcim.6c00420).
Why it's fast
PQk-means is reimplemented in pure Python + Numba with custom Triton/CUDA kernels for the assignment step. The full 9.6B-molecule pipeline runs in ~4.5 hours on a single RTX 4070 Ti, compared to ~14 days on the reference C++ implementation. Streaming batches mean peak VRAM is bounded by user-configured batch size, not dataset size, so the same code runs on any modern GPU.
Live platform
Results are accessible at chelombus.gdb.tools, a Next.js + Nginx site (Dockerized, systemd-supervised) serving 180,000+ pre-generated TMAPs directly from disk with cache headers, with a UI built for fast dataset and cluster navigation.
