Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List boundary discards one token in the context window #10

Open
jonnybluesman opened this issue Sep 9, 2021 · 4 comments
Open

List boundary discards one token in the context window #10

jonnybluesman opened this issue Sep 9, 2021 · 4 comments

Comments

@jonnybluesman
Copy link

enumerate(word_ids[max(i - boundary, 0):i + boundary]) if u != v]

I think i + boundary should include a + 1 to make it inclusive, otherwise the right context takes 1 token less in the resulting skipgrams.

@francesco-mollica
Copy link

Why creating a boundary like this:
boundary = np.random.randint(1, self.window_size)

and not use simply the window_size value instead of boundary?

@jonnybluesman
Copy link
Author

Why creating a boundary like this: boundary = np.random.randint(1, self.window_size)

and not use simply the window_size value instead of boundary?

Because with the random function you are implicitly giving "more importance" to the closest words in the neighbourhood, by creating more data with those "close" tokens.

@francesco-mollica
Copy link

francesco-mollica commented Nov 16, 2021

You wanna say explicitly? So why not reduce the window size and fix it instead of using a boundary? This use of boundary is in other implementations?
Just to be clear, this implementation uses a random window size that is in the range (1, window_size)? Is it correct that the boundary changes with each new sentence?
thanks for the quick response!

@francesco-mollica
Copy link

francesco-mollica commented Nov 20, 2021

Does the concept of boundary can be apply to cbow-style? I implemented it and i'm in stuck because the size of the context varies from phrase to phrase as boundary changes as well and put it all in a unique tensor create me big problems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants