Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikitext_split breaks up urls #35

Open
legoktm opened this issue Jul 6, 2015 · 4 comments
Open

wikitext_split breaks up urls #35

legoktm opened this issue Jul 6, 2015 · 4 comments

Comments

@legoktm
Copy link

legoktm commented Jul 6, 2015

Hi!

I was trying to use the persistence code to identify when a specific url was added to an article, but ran into an issue with the wiktiext_split function breaking up urls:

>>> from mw.lib.persistence.tokenization import wikitext_split
>>> wikitext_split('Something blah blah http://foobar.com')
['Something', ' ', 'blah', ' ', 'blah', ' ', 'http', ':', '/', '/', 'foobar', '.', 'com']

It would be nice if urls were special-cased and kept together.

@halfak
Copy link
Member

halfak commented Jul 6, 2015

It seems like this would be possible. We have some good options for a URL regex. See https://mathiasbynens.be/demo/url-regex

@halfak
Copy link
Member

halfak commented Jul 7, 2015

@legoktm do you know if there is a MediaWiki URL regex we can use?

@legoktm
Copy link
Author

legoktm commented Jul 7, 2015

Looking through Parser::replaceExternalLinks(), it appears to use:

> var_dump($wgParser->mExtLinkBracketedRegex);
string(342) "/\[(((?i)bitcoin\:|ftp\:\/\/|ftps\:\/\/|geo\:|git\:\/\/|gopher\:\/\/|http\:\/\/|https\:\/\/|irc\:\/\/|ircs\:\/\/|magnet\:|mailto\:|mms\:\/\/|news\:|nntp\:\/\/|redis\:\/\/|sftp\:\/\/|sip\:|sips\:|sms\:|ssh\:\/\/|svn\:\/\/|tel\:|telnet\:\/\/|urn\:|worldwind\:\/\/|xmpp\:|\/\/)[^][<>"\x00-\x20\x7F\p{Zs}]+)\p{Zs}*([^\]\x00-\x08\x0a-\x1F]*?)\]/Su"

@halfak
Copy link
Member

halfak commented Jul 24, 2015

I've added the URL symbol to the wikitext split lexicon in deltas. See halfak/deltas@40d984d

I'll need to do a follow-up change here to pull in wikitext_split from deltas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants