Add CountVectorizer #315

ksew1 · 2025-01-11T14:08:16Z

Hi I added CountVectorizer as it would be useful in pipeline with Naive Bayes algorithms that we already implemented (for example in spam detection)

josevalim · 2025-01-11T15:41:30Z

lib/scholar/feature_extraction/count_vectorizer.ex

@@ -0,0 +1,147 @@
+defmodule Scholar.FeatureExtraction.CountVectorizer do
+  @moduledoc """
+    A `CountVectorizer` converts a collection of text documents to a matrix of token counts.


Don't indent the docs by two spaces. They should be aligned with the closing """, as that's the one that controls indentation. :)

josevalim · 2025-01-11T15:41:50Z

lib/scholar/feature_extraction/count_vectorizer.ex

+  ## Options
+  #{NimbleOptions.docs(@binarize_schema)}
+  ## Examples


Suggested change

## Options

#{NimbleOptions.docs(@binarize_schema)}

## Examples

## Options

#{NimbleOptions.docs(@binarize_schema)}

## Examples

josevalim · 2025-01-11T15:45:22Z

lib/scholar/feature_extraction/count_vectorizer.ex

+        {tensor, vocabulary}
+      end
+
+    max_index = tensor |> Nx.reduce_max() |> Nx.add(1) |> Nx.to_number()


Under jit mode, you cannot convert a tensor to a number. You can try adding this test:

fun = &Scholar.FeatureExtraction.CountVectorizer.fit_transform(&1, indexed_tensor: true)
Nx.Defn.jit(fun).(Nx.tensor([[0, 1, 2], [1, 3, 4]])

The correct solution is to compute the default value for max_index inside defn.

The problem is that it needs to be a number in order to create a tensor with this shape. Inside defn, I'm afraid it is not possible to dynamically obtain a number. We can make this a required option, so we do not need to use Nx.to_number().

I see. So we should not use deftransform, because this certainly cannot be invoked inside a defn, as we don't yet support dynamic shapes. For now, it should be a regular def and maybe it should be called something else.

I believe this topic appeared in the past and we maybe discussed a possible contract for keeping code that doesn't work inside defn but I don't recall it right now. :) Maybe @msluszniak does?

Usually, if the shape was based on computations we either droped the options that force it (like percent of variation preserved in PCA) or develop some heuristics that roughly assesses the upper bound of the problematic shape, and then make computations on bigger tensor with some "padding".

Or make an option that is required via NimbleOptions

Maybe an option is the way to go then and we could even have a helper function like CountVectorizer.size(n) that they could use to compute it.

I made this required option and added helper function 😄

josevalim

The fit_transform function currently has an issue in that it accepts either tensors (which can be JITed) or a list of strings, which cannot. Generally speaking, our fit_transform functions can be jitted, so this would be breaking the tradition.

It seems passing a list of strings is implementing rather a simple tokenization algorithm? I think we should either move it elsewhere OR simply remove the functionality from here. My suggestion would be to remove it :)

ksew1 · 2025-01-11T21:39:52Z

The fit_transform function currently has an issue in that it accepts either tensors (which can be JITed) or a list of strings, which cannot. Generally speaking, our fit_transform functions can be jitted, so this would be breaking the tradition.

It seems passing a list of strings is implementing rather a simple tokenization algorithm? I think we should either move it elsewhere OR simply remove the functionality from here. My suggestion would be to remove it :)

Removed this feature, now it operates only on tensor

josevalim · 2025-01-17T13:30:35Z

💚 💙 💜 💛 ❤️

Add CountVectorizer

f04d8e7

josevalim reviewed Jan 11, 2025

View reviewed changes

ksew1 added 2 commits January 11, 2025 21:53

docs fixes

e31f605

remove preprocessing

9d8aeb4

ksew1 and others added 2 commits January 17, 2025 11:55

Add max_token_id option

fe6c835

Update count_vectorizer.ex

6956de9

josevalim merged commit b815c59 into elixir-nx:main Jan 17, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CountVectorizer #315

Add CountVectorizer #315

ksew1 commented Jan 11, 2025

josevalim Jan 11, 2025

ksew1 Jan 11, 2025

josevalim Jan 11, 2025

ksew1 Jan 11, 2025

josevalim Jan 11, 2025

ksew1 Jan 11, 2025

josevalim Jan 12, 2025

msluszniak Jan 16, 2025

msluszniak Jan 16, 2025

josevalim Jan 16, 2025

ksew1 Jan 17, 2025

josevalim left a comment

ksew1 commented Jan 11, 2025

josevalim commented Jan 17, 2025

Add CountVectorizer #315

Add CountVectorizer #315

Conversation

ksew1 commented Jan 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josevalim left a comment

Choose a reason for hiding this comment

ksew1 commented Jan 11, 2025

josevalim commented Jan 17, 2025