Post

Incremental DBSCAN clustering

A Python implementation

Custering illustration

I authored incdbscan, the open source implementation of IncrementalDBSCAN clustering as a Python package, after facing a problem at work. At LogMeIn, we wanted to cluster chat messages and were wondering if the clustering could be easily updated with adding new messages or removing old ones. DBSCAN had been working well for our clustering needs, so we started looking for an incremental version. Although the authors of DBSCAN described an algorithm called IncrementalDBSCAN in a 1998 paper, no public implementation seemed to exist.

incdbscan became my COVID project. Implementing the algorithm turned out to be longer and more challenging than expected. The paper had a few gaps that forced me to rethink parts of the algorithm, handle tricky edge cases, and thoroughly test the implementation with unit tests. It was also a great chance to dive into tools and processes like line profiling, publishing to PyPI, and setting up pipelines with GitHub Actions.

As of December 2024, incdbscan has grown in popularity, with about 1,500 monthly downloads according to PyPI Stats.

Download stats

You can find the code and usage examples here: https://github.com/DataOmbudsman/incdbscan

This work was also presented at the Budapest ML Forum and at a PyData Budapest meetup.

This post is licensed under CC BY 4.0 by the author.