TMAP 2.0
Rewrote the TMAP visualization library from scratch. 3x faster, readable Python, and works with any kind of data.
TMAP is a library that takes high-dimensional data and lays it out as an interactive tree you can explore in the browser. The original version was written in C++ with Python bindings, hard to read, hard to extend, and only worked with one type of similarity metric (Jaccard). I rewrote it from scratch in Python with Numba for the hot paths and a small C++ core for the layout engine. The result is 3x faster, supports cosine and euclidean metrics (so it works with embeddings, images, text, not just molecular fingerprints), has an sklearn-style API, and comes with 411 tests, 7 tutorials, and interactive visualization out of the box.
Tech Stack
Features
- 3x faster than the original despite being mostly Python (Numba JIT for the heavy parts).
- Works with any kind of data: molecular fingerprints (Jaccard), embeddings (cosine), tabular data (euclidean).
- Sklearn-style API: just call fit_transform() and get coordinates.
- Interactive visualization: pan, zoom, hover, color by any property, export to HTML.
- Tree exploration: find paths between points, measure distances, extract subtrees.
- 411 tests and 7 Jupyter tutorials.
Challenges
- Rewriting a C++ library in Python while making it faster, not slower.
- Supporting multiple distance metrics without complicating the user-facing API.
- Building visualization that works the same in a Jupyter notebook, an HTML file, and a static plot.
Learnings
- Numba lets you write readable Python that runs at C speed for numerical code.
- Auto-selecting parameters based on dataset size means users don't have to tune anything.
- A big test suite is what makes bold refactors possible.