Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newsela Split #15

Open
chrhad opened this issue Aug 3, 2018 · 1 comment
Open

Newsela Split #15

chrhad opened this issue Aug 3, 2018 · 1 comment

Comments

@chrhad
Copy link

chrhad commented Aug 3, 2018

Hi,

I would like to clarify regarding the Newsela data setup:

  1. Am I right that the originally released data in 2015 (1,130 articles) was used in the paper? (That is, the text file in “newsela_data_share-20150302” in the Newsela release)
  2. Following the description in Section 5 by having the first 1,070 articles for training, the next 30 for development, and the next 30 for testing, followed by filtering out sentence pairs corresponding to alignment levels 0-1, 1-2, and 2-3 gave me numbers of sentence pairs that are different from the paper (94,944 training, 2,531 development, and 2,462 test sentences). How can I come up with 94,208 training sentence pairs, 1,129 development sentence pairs, and 1,076 test sentence pairs as stipulated in the paper?

Thank you.

Regards,
Christian

@saraswat
Copy link

saraswat commented Aug 4, 2018

Where will I find the newsela release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants