-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support destructive splitting #10
Comments
I think I understood what you meant, but could you please provide an example to confirm? There is a plan to provide an advanced API other than This means, the default API can be opinionated, but the new API can defer such decisions to a caller. The exact API is not finalized. Still collecting requirements for that API. |
Yes, sure. Here's what I currently get with a file based on the README example in sentencex-js and using the content from this comment as a test: import segment from 'sentencex'
const text = `This should? Produce a violation.
So should! This example.
Abbreviations (e.g. like) these (i.e. should) not.
"Sentences in." Quotes should, too.
Pausing... for... thought... should not?
This rule does weird things if a sentence is
already wrapped. It should maybe unwrap in
cases like this?`;
console.log(segment("en", text)) I saved this as [
'This should?',
'Produce a violation.',
'\n\n',
'So should!',
'This example.',
'\n\n',
'Abbreviations (e.g. like) these (i.e. should) not.',
'\n\n',
'"Sentences in." Quotes should, too.',
'\n\n',
'Pausing...',
'for...',
'thought...',
'should not?',
'\n\n',
'This rule does weird things if a sentence is\nalready wrapped.',
'It should maybe unwrap in\ncases like this?'
] I would have expected the whitespace-only entries to not be present: [
'This should?',
'Produce a violation.',
'So should!',
'This example.',
'Abbreviations (e.g. like) these (i.e. should) not.',
'"Sentences in." Quotes should, too.',
'Pausing...',
'for...',
'thought...',
'should not?',
'This rule does weird things if a sentence is\nalready wrapped.',
'It should maybe unwrap in\ncases like this?'
] ...but I understand that this is by design, hence the suggestion of an option. That said, in further tests I noticed that, contrary to my assumption above, whitespace around sentences is not preserved (internal whitespace is preserved). Shouldn't that also be kept for full "reconstructability" of the original text? p.s. - I realize I'm using sentencex-js above even though I am commenting in the Python-based sentencex repo, but I assumed the two projects to produce identical results, and this one to be the canonical one for decisions about the API. |
It's understood that the library performs non-destructive splitting by default, but would it be possible to add an option to allow "destructive" splitting? In other words, trimming of whitespace around each sentence, and removal of whitespace-only splitted parts.
The text was updated successfully, but these errors were encountered: