Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CountVectorizer #315

Merged
merged 5 commits into from
Jan 17, 2025
Merged

Add CountVectorizer #315

merged 5 commits into from
Jan 17, 2025

Conversation

ksew1
Copy link
Contributor

@ksew1 ksew1 commented Jan 11, 2025

Hi I added CountVectorizer as it would be useful in pipeline with Naive Bayes algorithms that we already implemented (for example in spam detection)

@@ -0,0 +1,147 @@
defmodule Scholar.FeatureExtraction.CountVectorizer do
@moduledoc """
A `CountVectorizer` converts a collection of text documents to a matrix of token counts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't indent the docs by two spaces. They should be aligned with the closing """, as that's the one that controls indentation. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 😄

Comment on lines 37 to 39
## Options
#{NimbleOptions.docs(@binarize_schema)}
## Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Options
#{NimbleOptions.docs(@binarize_schema)}
## Examples
## Options
#{NimbleOptions.docs(@binarize_schema)}
## Examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 😄

{tensor, vocabulary}
end

max_index = tensor |> Nx.reduce_max() |> Nx.add(1) |> Nx.to_number()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under jit mode, you cannot convert a tensor to a number. You can try adding this test:

fun = &Scholar.FeatureExtraction.CountVectorizer.fit_transform(&1, indexed_tensor: true)
Nx.Defn.jit(fun).(Nx.tensor([[0, 1, 2], [1, 3, 4]])

The correct solution is to compute the default value for max_index inside defn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it needs to be a number in order to create a tensor with this shape. Inside defn, I'm afraid it is not possible to dynamically obtain a number. We can make this a required option, so we do not need to use Nx.to_number().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So we should not use deftransform, because this certainly cannot be invoked inside a defn, as we don't yet support dynamic shapes. For now, it should be a regular def and maybe it should be called something else.

I believe this topic appeared in the past and we maybe discussed a possible contract for keeping code that doesn't work inside defn but I don't recall it right now. :) Maybe @msluszniak does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, if the shape was based on computations we either droped the options that force it (like percent of variation preserved in PCA) or develop some heuristics that roughly assesses the upper bound of the problematic shape, and then make computations on bigger tensor with some "padding".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or make an option that is required via NimbleOptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an option is the way to go then and we could even have a helper function like CountVectorizer.size(n) that they could use to compute it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this required option and added helper function 😄

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fit_transform function currently has an issue in that it accepts either tensors (which can be JITed) or a list of strings, which cannot. Generally speaking, our fit_transform functions can be jitted, so this would be breaking the tradition.

It seems passing a list of strings is implementing rather a simple tokenization algorithm? I think we should either move it elsewhere OR simply remove the functionality from here. My suggestion would be to remove it :)

@ksew1
Copy link
Contributor Author

ksew1 commented Jan 11, 2025

The fit_transform function currently has an issue in that it accepts either tensors (which can be JITed) or a list of strings, which cannot. Generally speaking, our fit_transform functions can be jitted, so this would be breaking the tradition.

It seems passing a list of strings is implementing rather a simple tokenization algorithm? I think we should either move it elsewhere OR simply remove the functionality from here. My suggestion would be to remove it :)

Removed this feature, now it operates only on tensor

@josevalim josevalim merged commit b815c59 into elixir-nx:main Jan 17, 2025
2 checks passed
@josevalim
Copy link
Contributor

💚 💙 💜 💛 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants