semhash

galtay · Jan 14, 2025 · 3fb6aa6 · 3fb6aa6
1 parent 4c7a7af
commit 3fb6aa6
Showing 1 changed file with 39 additions and 0 deletions.
diff --git a/_posts/2025-01-14-semhash-intro.md b/_posts/2025-01-14-semhash-intro.md
@@ -0,0 +1,39 @@
+---
+title: "[tif] semhash for semantic deduplication"
+date: 2025-01-14
+tags: tif deduplication minish-lab
+---
+
+I often find myself in a position where I have to deduplicate.
+And most of the time, it's a lot of data that needs to be deduplicated.
+So I keep my eye out for libraries that might help with this task.
+Today I saw one from [MinishLab](https://github.com/MinishLab)
+called [semhash](https://github.com/MinishLab/semhash).
+
+> SemHash is a lightweight and flexible tool for deduplicating datasets using semantic similarity. It combines fast embedding generation from [Model2Vec](https://github.com/MinishLab/model2vec) with efficient ANN-based similarity search through [Vicinity](https://github.com/MinishLab/vicinity).
+
+> SemHash supports both single-dataset deduplication (e.g., cleaning up a train set) and multi-dataset deduplication (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.
+
+The two links in the quote above point to other projects from [MinishLab](https://github.com/MinishLab).
+
+> We're a two-person (@pringled and @stephantul) open-source company, with a focus on Natural Language Processing.
+We believe that if you make models fast enough, you unlock new possibilities.
+
+Looks like the team is
+* [Stephan Tulkens](https://github.com/stephantul)
+* [Thomas van Dongen](https://github.com/Pringled)
+
+I love the library names, the retro gaming art in the github repo, and the focus on unlocking new use cases by just making things really fast. Some of the things they suggest you can do with their software:
+
+> * Ingest the entire English Wikipedia in 5 minutes
+* Classify tens of thousands of documents per second on CPU
+* Approximately deduplicate extremely large datasets in minutes
+* Build the fastest RAG application in the world
+* Easily evaluate which ANN algorithm works best for your data
+
+Looking forward to giving some of these a try!
+
+---
+
+source [semhash](https://github.com/MinishLab/semhash)
+via [philipp schmid](https://www.linkedin.com/in/philipp-schmid-a6a2bb196/)