Back to projects

Manuscript in preparation, 2026

Tree-shaped similarity maps for high-dimensional data.

I rebuilt TMAP from the ground up. Python and Numba instead of the old C++ monolith, a new API that's easier to live with, and a pluggable index that takes recall@20 from 49% up to around 99%. So far I've used it on 2.7M AlphaFold structures, the full ChEMBL set, and image embeddings.

TMAP 2.0
Live TMAPAPPROVED DRUGS · TMAP
Loading interactive TMAP…
Open full page

What I rebuilt

The old TMAP was a C++ monolith. It worked, but it was hard to read, hard to extend, and the LSH neighbor search was showing its age. TMAP 2.0 is a clean Python + Numba codebase with a scikit-learn style API and a pluggable index layer. You can use USearch HNSW for cosine, Euclidean, or binary Jaccard, fall back to a Numba MinHash + LSH-Forest if you need to match the old behaviour, or feed in your own kNN graph from MMseqs2, Foldseek, or BLAST.

What's actually new

Recall@20 went from 49% with the old LSH path to about 99% with USearch on a 1M-point benchmark at d=128. The Numba MinHash route is still there for parity and runs 2 to 3 times faster than the original C++. Memory use is lower across the board.

On the user-facing side: there are now filtering tools to select subsets of the data (you can try this on the demo at the top of the page, the controls live on the bottom), and you can insert new points into an existing map after the fact, which the old TMAP couldn't do. Jupyter integration is also much cleaner now.

Where it's been used

Most of my own testing happened on chemistry, but TMAP 2.0 is happy with anything you can put in a vector. I've used it on 2.7M AlphaFold predicted structures, with a structure-aware viewer embedded in the map, on a single-cell Arabidopsis atlas, and on image-embedding collections. It's also being integrated into internal discovery pipelines at AbbVie and Roche.