-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CountVectorizer #315
Add CountVectorizer #315
Conversation
@@ -0,0 +1,147 @@ | |||
defmodule Scholar.FeatureExtraction.CountVectorizer do | |||
@moduledoc """ | |||
A `CountVectorizer` converts a collection of text documents to a matrix of token counts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't indent the docs by two spaces. They should be aligned with the closing """
, as that's the one that controls indentation. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed 😄
## Options | ||
#{NimbleOptions.docs(@binarize_schema)} | ||
## Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Options | |
#{NimbleOptions.docs(@binarize_schema)} | |
## Examples | |
## Options | |
#{NimbleOptions.docs(@binarize_schema)} | |
## Examples | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed 😄
{tensor, vocabulary} | ||
end | ||
|
||
max_index = tensor |> Nx.reduce_max() |> Nx.add(1) |> Nx.to_number() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under jit
mode, you cannot convert a tensor to a number. You can try adding this test:
fun = &Scholar.FeatureExtraction.CountVectorizer.fit_transform(&1, indexed_tensor: true)
Nx.Defn.jit(fun).(Nx.tensor([[0, 1, 2], [1, 3, 4]])
The correct solution is to compute the default value for max_index
inside defn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it needs to be a number in order to create a tensor with this shape. Inside defn
, I'm afraid it is not possible to dynamically obtain a number. We can make this a required option, so we do not need to use Nx.to_number()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So we should not use deftransform
, because this certainly cannot be invoked inside a defn
, as we don't yet support dynamic shapes. For now, it should be a regular def
and maybe it should be called something else.
I believe this topic appeared in the past and we maybe discussed a possible contract for keeping code that doesn't work inside defn
but I don't recall it right now. :) Maybe @msluszniak does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, if the shape was based on computations we either droped the options that force it (like percent of variation preserved in PCA) or develop some heuristics that roughly assesses the upper bound of the problematic shape, and then make computations on bigger tensor with some "padding".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or make an option that is required via NimbleOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an option is the way to go then and we could even have a helper function like CountVectorizer.size(n)
that they could use to compute it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this required option and added helper function 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fit_transform
function currently has an issue in that it accepts either tensors (which can be JITed) or a list of strings, which cannot. Generally speaking, our fit_transform
functions can be jitted, so this would be breaking the tradition.
It seems passing a list of strings is implementing rather a simple tokenization algorithm? I think we should either move it elsewhere OR simply remove the functionality from here. My suggestion would be to remove it :)
Removed this feature, now it operates only on tensor |
💚 💙 💜 💛 ❤️ |
Hi I added
CountVectorizer
as it would be useful in pipeline with Naive Bayes algorithms that we already implemented (for example in spam detection)