-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure normalized IDNA domains return ASCII strings #90
Conversation
We could also create a new list instead of changing the default one: CUSTOM_SUFFIX_LIST = PublicSuffix::List.
parse(File.read(PublicSuffix::List::DEFAULT_LIST_PATH), private_domains: false).
tap do |list|
list.indexes.keys.
map { |name| Addressable::IDNA.to_ascii(name) }.
select { |name| name =~ /xn\-\-/ }.
each { |name| list << PublicSuffix::Rule.factory(name) }
end
# ...
public_suffix_domain = PublicSuffix.parse(host, default_rule: nil, list: CUSTOM_SUFFIX_LIST) |
File.read(PublicSuffix::List::DEFAULT_LIST_PATH), private_domains: false) | ||
PublicSuffix::List.default.indexes.keys. | ||
map { |name| Addressable::IDNA.to_ascii(name) }. | ||
select { |name| name =~ /xn\-\-/ }. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add ^
, I think it's always in the start of the host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same expressions as https://github.com/weppos/publicsuffix-ruby/blob/v2.0.2/test/psl_test.rb#L37-L38, but that expression might be constructed to handle more variants (https://github.com/weppos/publicsuffix-ruby/blob/v2.0.2/test/tests.txt)
we can check if we get the same count using ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think /\Axn\-\-/i
is the correct regex.
The ToASCII and ToUnicode operations MUST recognize the ACE prefix in a case-insensitive manner.
The ACE prefix for IDNA is "xn--" or any capitalization thereof.
This doesn't look right: [1] pry(main)> Twingly::URL.parse("https://www.foo.ایران.ir/bar").tld
=> "ایران.ir"
[2] pry(main)> Twingly::URL.parse("https://www.foo.ایران.ir/bar").normalized.tld
=> "ir" |
Made a few changes that fixes things: diff --git a/lib/twingly/url.rb b/lib/twingly/url.rb
index 7e7fef6..46ac6ca 100644
--- a/lib/twingly/url.rb
+++ b/lib/twingly/url.rb
@@ -8,10 +8,12 @@ require_relative "version"
PublicSuffix::List.default = PublicSuffix::List.parse(
File.read(PublicSuffix::List::DEFAULT_LIST_PATH), private_domains: false)
-PublicSuffix::List.default.indexes.keys.
- map { |name| Addressable::IDNA.to_ascii(name) }.
+PublicSuffix::List.default.each.
+ map { |rule| Addressable::IDNA.to_ascii(rule.value) }.
select { |name| name =~ /xn\-\-/ }.
- each { |name| PublicSuffix::List.default << PublicSuffix::Rule.factory(name) }
+ each { |name| PublicSuffix::List.default.add(PublicSuffix::Rule.factory(name), reindex: false) }
+
+PublicSuffix::List.default.reindex!
module Twingly
class URL Had to add Edit: I don't know why I added the |
But I like the suggestions above about our own list, looks nice to be able to pass |
Aha, I could have kept the |
I misread that, we still pass it when doing |
Oops, no, we need to use |
"bundle exec rspec spec/lib/twingly/url_spec.rb" went from files took 1.72 seconds to load to files took 0.27164 seconds to load
Good catch, I used your fix |
Hmm https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode
I wonder if we do it correctly here... |
My shell is a bit fancy, and with the old versions I got this: $ cat `bundle show public_suffix`/data/list.txt | wc -l preexec: command not found: bundle 11618
|
||
# Add ASCII form of all internationalized domain names to the public suffix list | ||
PublicSuffix::List.default. | ||
map { |rule| Addressable::IDNA.to_ascii(rule.value) }. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could try with both "pure" and "libidn" implementations just so we know we won't run into something exciting once we drop libidn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried manually with
diff --git a/lib/twingly/url.rb b/lib/twingly/url.rb
index 4193abd..9dd4581 100644
--- a/lib/twingly/url.rb
+++ b/lib/twingly/url.rb
@@ -1,5 +1,5 @@
require "addressable/uri"
-require "addressable/idna/native"
+# require "addressable/idna/native"
require "public_suffix"
require_relative "url/null_url"
@@ -29,7 +29,7 @@ module Twingly
ERRORS_TO_EXTEND = [
Addressable::URI::InvalidURIError,
PublicSuffix::DomainInvalid,
- IDN::Idna::IdnaError,
+ # IDN::Idna::IdnaError,
]
private_constant :ACCEPTED_SCHEMES, :ENDS_WITH_SLASH, :ERRORS_TO_EXTEND
diff --git a/twingly-url.gemspec b/twingly-url.gemspec
index f06d0d9..89f14cd 100644
--- a/twingly-url.gemspec
+++ b/twingly-url.gemspec
@@ -14,7 +14,7 @@ Gem::Specification.new do |s|
s.add_dependency "addressable", "~> 2"
s.add_dependency "public_suffix", "~> 2"
- s.add_dependency "idn-ruby", "~> 0.1"
+ # s.add_dependency "idn-ruby", "~> 0.1"
s.add_development_dependency "rake", "~> 10"
s.add_development_dependency "rspec", "~> 3"
~/twingly/ruby/twingly-url public_suffix-2*
$ bundle show | grep idn-ruby
and all specs pass, but I'm not sure it says much, as we just have a few IDNs in our tests cases. But I think it's okay. :)
Catch valid URLs being normalized to something that's parsed as NullURL.
Currently, this URL isn't valid, but if we drop libidn (which we talked about), it would be valid, due to the way Addressable 2.4.0 works. See comment at: #66 (comment) Adding this URL to the tests so we will notice the change in behaviour, if/when we drop libidn.
Continuing the hidden conversation at #90 (comment) here Interesting results, the counts are different [1] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /xn\-\-/ }.count
=> 814
[2] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /\Axn\-\-/i }.count
=> 812
[3] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /^xn\-\-/i }.count
=> 812 |
[6] pry(main)> a_814 - a_812
=> ["sande.xn--mre-og-romsdal-qqb.no", "sande.xn--mre-og-romsdal-qqb.no"] |
Hmm [7] pry(main)> (a_814 - a_812).uniq
=> ["sande.xn--mre-og-romsdal-qqb.no"]
[8] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /xn\-\-/ }.uniq.count
=> 407
[9] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /\Axn\-\-/i }.uniq.count
=> 406 |
Oh, my bad, I'm in |
Shouldn't we check each part? (#90 (comment)) > PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name.split(".").any? { |part| part =~ /\Axn\-\-/i } }.count
814 |
Yes |
It could be done with a regex, like |
list | ||
end | ||
|
||
private_class_method def self.punycoded_names(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative could be:
private_class_method \
def self.punycoded_names(list)
# ..
end
Not sure which I prefer though (probably one with newline).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jage WDYT? Merge? |
I think we need to add a comment (in the PR) as to what changes here. Previously we had som consistency as to what Previous version: [1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3fd876a72dd8 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3fd8767d277c https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)>
[4] pry(main)> u1.to_s
=> "https://www.foo.ایران.ir/bar"
[5] pry(main)> u2.to_s
=> "https://www.foo.xn--mgba3a4f16a.ir/bar"
[6] pry(main)>
[7] pry(main)> u1.domain
=> "foo.ایران.ir"
[8] pry(main)> u2.domain
=> "foo.ایران.ir"
[9] pry(main)>
[10] pry(main)> u1.tld
=> "ایران.ir"
[11] pry(main)> u2.tld
=> "ایران.ir" This version: [1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3fc621fd5454 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3fc621fb855c https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)>
[4] pry(main)> u1.to_s
=> "https://www.foo.ایران.ir/bar"
[5] pry(main)> u2.to_s
=> "https://www.foo.xn--mgba3a4f16a.ir/bar"
[6] pry(main)>
[7] pry(main)> u1.domain
=> "foo.ایران.ir"
[8] pry(main)> u2.domain
=> "foo.xn--mgba3a4f16a.ir"
[9] pry(main)>
[10] pry(main)> u1.tld
=> "ایران.ir"
[11] pry(main)> u2.tld
=> "xn--mgba3a4f16a.ir" |
Isn't that what we have normalization for? |
Yes and no, normalization is useful to compare URLs (to avoid duplicates). But if I want to print the URL in my web UI, shouldn't we print the same string regardless of the input was in unicode or ascii? |
That just happened to be the case, it wasn't documented.
That is true, we can add tests for that. |
That doesn't change the fact. |
Yes, that makes sense, but fortunately, we don't have the need to do that right now, so I think we are okay. I don't think that should be the default behaviour though, I think it should be a function that the user explicitly uses. Much like #71, "get URL (or parts of) in Unicode". |
One could argue we are more consistent now, what the methods return are based on the input. E.g. Previous version: $ bundle console
[1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3ff0792d88b4 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3ff079d181f8 https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)> u1.host
=> "www.foo.ایران.ir"
[4] pry(main)> u2.host
=> "www.foo.xn--mgba3a4f16a.ir" This version: $ bundle console
Resolving dependencies...
[1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3fe381504f48 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3fe381186484 https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)> u1.host
=> "www.foo.ایران.ir"
[4] pry(main)> u2.host
=> "www.foo.xn--mgba3a4f16a.ir" As said, I can add tests for this. |
Just a note about the Nameprep algorithm, from https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode
|
Could have been part of the commit message in my latest commit (15414ed), but I can write it here.
|
This will print userinfo, user and password for all URLs, which it didn't do before.
* DRY up to urls example * Print the same example methods for all URLs in the README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion to a new PR-title: "Ensure normalized IDNA domains return ASCII strings" since I think PublicSuffix is just a small technical detail to make it happen.
url.host # => "räksmörgås.макдональдс.рф" | ||
url.origin # => "http://xn--rksmrgs-5wao1o.xn--80aalb1aicli8a5i.xn--p1ai" | ||
url.path # => "/foo" | ||
url.without_scheme # => "//räksmörgås.макдональдс.рф/foo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we print the normalized version as well (for all parts).
url.without_scheme
url.normalized.without_scheme
Since they return different strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was thinking of showing normalization examples of these URLs as another section below, but maybe inline is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's the same URL we compare non-normalized and normalized, I think inline (alternating) is easier to read, my assumption is that the reader wants to understand the potential difference between a normalized and a non-normalized URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to not have to repeat all the parts for the normalized version, so I'm thinking a new section is easier, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with the current structure it might become too many rows. We can add a new section now.
I can open an issue about improving the output, I'm thinking some of the behavior needs a table:
Method | Non-normalized | Normalized |
---|---|---|
.host |
räksmörgås.макдональдс.рф |
xn--rksmrgs-5wao1o.xn--80aalb1aicli8a5i.xn--p1ai |
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table seems like a nice solution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can add them as you suggested. I think it will make the most clear examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, GitHub failed to show me your comments, saw them now when I did a full page reload...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our specs could also benefit from the "table-way" of doing things.
LGTM!
Yes, agree that this is a major version bump. |
This PR ensure correct behaviour for normalized URLs (fixes #89), but it also changes the behaviour for non- normalized URL objects, see these snippets (original comment):
Previous version (4.2.0):
This version:
Original description
So, yes, this increases the load time quite a bit... :)Not true anymore (see 2bd6447) Haven't run the profiler yet, but we could do that.I'm open to suggestions about how we should deal with the extending of the default list, I just did something to get the discussion started.
Related to #71. Close #85 and #89.