Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NormalizedURL#to_s returns punycode, other instance methods does not #89

Closed
roback opened this issue Sep 6, 2016 · 13 comments · Fixed by #90
Closed

NormalizedURL#to_s returns punycode, other instance methods does not #89

roback opened this issue Sep 6, 2016 · 13 comments · Fixed by #90
Labels

Comments

@roback
Copy link
Member

roback commented Sep 6, 2016

Calling #.to_s or #host on a normalized URL returns a punycoded string.

Calling #domain, #tld, #sld etc. on the normalized url returns the non-punycoded version of the URL. What those methods have in common is that they all uses public_suffix.

[3] pry(main)> url = Twingly::URL.parse("http://test.भारत")
=> #<Twingly::URL:0x3fe40956dc7c http://test.भारत>
[4] pry(main)> normalized_url = url.normalized
=> #<Twingly::URL:0x3fe40acdbfa8 http://www.test.xn--h2brj9c/>
[5] pry(main)> url.to_s
=> "http://test.भारत"
[6] pry(main)> normalized_url.to_s
=> "http://www.test.xn--h2brj9c/"
[7] pry(main)> normalized_url.tld
=> "भारत"
[8] pry(main)> normalized_url.domain
=> "test.भारत"
@roback roback added the bug label Sep 6, 2016
@roback
Copy link
Member Author

roback commented Sep 6, 2016

I think we need to do #68 before this can be solved, so we can check whether the strings that public_suffix_domain returns (in these methods) should be converted to ascii.

@dentarg
Copy link
Collaborator

dentarg commented Sep 6, 2016

Hmm, why?

@roback
Copy link
Member Author

roback commented Sep 7, 2016

Hmm, why?

    def trd
      url_trd = public_suffix_domain.trd.to_s
      url_trd = convert_to_ascii(url_trd) if normalized?
      url_trd
    end

@dentarg
Copy link
Collaborator

dentarg commented Sep 7, 2016

I see, but that's not how we are going to do it :) I think we should parse/convert only one time (when #normalized is called), save the values for later, and return them when they are asked for.

    def trd
      @normalized_trd || @trd
    end

I think we should expand #71 to include not only host, but all parts that can exist in the two formats

@roback
Copy link
Member Author

roback commented Sep 7, 2016

I see, but that's not have we are going to do it :) I think we should parse/convert only one time (when #normalized is called), save the values for later, and return them when they are asked for.

Ah, yes, thats a much better idea. Just ignore my comments then 😄

@dentarg
Copy link
Collaborator

dentarg commented Sep 7, 2016

Sorry, talked a little bit with @walro, he thinks we should make the #normalized method work, not keeping both states of both non-normalized and normalized

    def normalized
      normalized_url = addressable_uri.dup

      normalized_url.scheme = normalized_scheme
      normalized_url.host   = normalized_host
      normalized_url.path   = normalized_path

      self.class.parse(normalized_url)
    end

Above we have only thought of addressable, I think addressable when it comes to TLDs doesn't do what we expect it to do... you can experiment with this if you want, or we can take a look together later. Gotta go now :)

@roback
Copy link
Member Author

roback commented Sep 7, 2016

Just realized that this doesn't just have to do with normalized URLs. How do we want to do in this case:

[1] pry(main)> u = Twingly::URL.parse("http://teståäö.xn--3e0b707e")
=> #<Twingly::URL:0x3fdd006c6b10 http://teståäö.xn--3e0b707e>
[2] pry(main)> u.to_s
=> "http://teståäö.xn--3e0b707e"
[3] pry(main)> u.tld
=> "한국"

I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).

@roback
Copy link
Member Author

roback commented Sep 8, 2016

I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).

How will we know what the original tld looks like if we can only extract the tld by using public_suffix, which only works with non-punycoded domains (because the suffix list doesn't contain punycoded domains)?

I think I give up on this for today 😄

@dentarg
Copy link
Collaborator

dentarg commented Sep 8, 2016

Maybe we need to create the reverse public suffix list

On 8 sep. 2016, at 16:36, Mattias Roback [email protected] wrote:

I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).

How will we know what the original tld looks like if we can only extract the tld by using public_suffix, which only works with non-punycoded domains (because the suffix list doesn't contain punycoded domains)?

I think I give up on this for today 😄


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@dentarg
Copy link
Collaborator

dentarg commented Sep 8, 2016

On the plane ride home I played around with twingly-url and public suffix 2, and I have the reverse list now :)

@roback
Copy link
Member Author

roback commented Sep 9, 2016

Maybe we need to create the reverse public suffix list

public_suffix has an add method that you can use to insert new rules into the list. Maby that can be used somehow. I'll have to try and see if I can make something work :)

@dentarg
Copy link
Collaborator

dentarg commented Sep 9, 2016

Yes, that's what I used

dentarg added a commit that referenced this issue Sep 9, 2016
Related to #89 and #71.

Close #85.
@dentarg
Copy link
Collaborator

dentarg commented Sep 9, 2016

I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).

This is the case with the changes made in #90 – we no longer pass Addressable #display_uri.host (which always converted punycoded names (xn--`)) to public suffix.

If we merge #90, we can close this issue by adding more comphrensive tests for our normalize methods, e.g. normalizing some different types of URLs and expecting correct output for each part and the whole URL.

dentarg added a commit that referenced this issue Sep 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants