TMAP 2.0

Rewrote the TMAP visualization library from scratch. 3x faster, readable Python, and works with any kind of data.

TMAP is a library that takes high-dimensional data and lays it out as an interactive tree you can explore in the browser. The original version was written in C++ with Python bindings, hard to read, hard to extend, and only worked with one type of similarity metric (Jaccard). I rewrote it from scratch in Python with Numba for the hot paths and a small C++ core for the layout engine. The result is 3x faster, supports cosine and euclidean metrics (so it works with embeddings, images, text, not just molecular fingerprints), has an sklearn-style API, and comes with 411 tests, 7 tutorials, and interactive visualization out of the box.

Tech Stack

PythonC++ (OGDF + pybind11)NumbaNumPySciPyFAISSdatasketchpytest

Features

  • 3x faster than the original despite being mostly Python (Numba JIT for the heavy parts).
  • Works with any kind of data: molecular fingerprints (Jaccard), embeddings (cosine), tabular data (euclidean).
  • Sklearn-style API: just call fit_transform() and get coordinates.
  • Interactive visualization: pan, zoom, hover, color by any property, export to HTML.
  • Tree exploration: find paths between points, measure distances, extract subtrees.
  • 411 tests and 7 Jupyter tutorials.

Challenges

  • Rewriting a C++ library in Python while making it faster, not slower.
  • Supporting multiple distance metrics without complicating the user-facing API.
  • Building visualization that works the same in a Jupyter notebook, an HTML file, and a static plot.

Learnings

  • Numba lets you write readable Python that runs at C speed for numerical code.
  • Auto-selecting parameters based on dataset size means users don't have to tune anything.
  • A big test suite is what makes bold refactors possible.