-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Public Suffix API #676
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I will reach out to the PSL maintainers I have been in contact with to ask them to take a look. I'll also share this internally to get an overall opinion from Chrome.
960f99d
to
0f09b4a
Compare
638833b
to
3301526
Compare
proposals/public-suffix.md
Outdated
by default to be the last domain label of the domain name, or alternatively | ||
the domain name could be considered invalid. | ||
|
||
**Note:** it may be more performant to allow unknown suffixes and assume a single-label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too concerned about the performance aspect of this; any IPC to handle the API call is likely more expensive than a look up in the PSL.
The proposal does not describe why unknown suffixes should be supported. Could you elaborate when one would want and when one does not want public suffixes? And offer a recommendation on default behavior that is not dependent on (likely not relevant) performance considerations?
It's probably worth pointing out explicitly that there may be web pages displayed in the browser that are not on a registrable domain, e.g. when a local intranet has custom non-public host names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update this section regarding unknown suffixes as you suggest.
On the performance point: I am concerned about performance of this API, because if it turns out to be slower than extensions' own public suffix implementations, then they may not use this API. In particular, Use Cases 1 (Filter Requests by Organization) and 3 (Detect Third-Party Requests) in this proposal are performance-sensitive, because they obtain the registrable domain for every request (not just top-level pages) as the user browses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's faster for the extension to use an internal list, wouldn't that imply that we should really improve our implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extensions that use their own internal list benefit from not having to do a cross process call to get the public suffix, because everything stays within the extension itself. This could theoretically make the extensions' own hand-rolled implementations faster. The downside is, they are duplicating work already done in the parent browser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling says this doesn't matter but I have no data on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: mentioned intranet hostnames and changed treatment of unknown eTLDs
proposals/public-suffix.md
Outdated
| PSL Feature | Requirement | Discussion | | ||
|------------------------|-------------|------------| | ||
| Allow Private Suffixes | Yes | including all suffixes in PSL means more information about third-party boundaries | | ||
| Allow Unknown Suffixes | Yes | provides better performance | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are Unknown suffixes required in this case? Theoretically better performance does not automatically translate to a functional requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming there is no performance benefit from allowing unknown suffixes, I will revisit this requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: changed treatment of unknown eTLDs
proposals/public-suffix.md
Outdated
base?: string, | ||
// The Private-suffixed registrable domain. | ||
// Null if an error occurred, or if the domain has no matching Private suffix. | ||
private?: string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need all three options (domain, base, private)?
- (input)
domain
is redundant, because the extension can keep track of the parameters that it had sent to the API. - Requiring
base
andprivate
may require two lookups in the PSL, even if the extension does not need it. An extension interested in both can call the API twice withexcludePrivateSuffixes
set to true and false.
Given this, what do you think of reducing the number of options to just one registrableDomain
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling the API twice with excludePrivateSuffixes
set to true and false has the following potential issues:
Duplication of work
- Must parse/canonicalize the same input domain parameter on each API call.
- The PSL lookup involves removing each label in turn from the input domain and testing the remaining suffix until a match is found. Therefore on the second API call, the same unmatching candidate suffixes that have already been tested in the first API call will again be retested. The difference will be that the algorithm keeps going further in terms of removing labels with the
excludePrivateSuffixes=true
call.
Duplication of returned arrays
- If the registrable domains array obtained with
excludePrivateSuffixes=false
happens to only contain ICANN domains (because no matching private suffixes exist), then the second API call withexcludePrivateSuffixes=true
will return exactly the same registrable domains array again.
Note that my proposal does not require "two lookups". One possible implementation would simply continue removing labels after finding a private suffix until the ICANN suffix is found, i.e. not a full lookup, but a continuation of the current lookup. Even better, a more optimal implementation would have the corresponding ICANN suffix pre-calculated and stored with every private suffix, such that no further work would be needed to determine the ICANN suffix upon matching a private suffix.
I am happy to remove domain
in the result, because I agree this can be inferred from the array position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-friedberger @Rob--W I am in the process of updating the proposal to incorporate Simon's feedback about the somewhat arbitrary distinction between the Private and ICANN sections of the PSL. Would we prefer to avoid using the term "private" in this proposal's API given those issues?
In its current form, this API method getRegistrableDomains()
returns an object containing registrable domain values as follows:
Field --> | base |
private? |
---|---|---|
ICANN section / unknown | Private section | |
foo.bar.wixsite.com | wixsite.com | bar.wixsite.com |
xyz.example.com | example.com |
If the Private-section vs ICANN-section distinction is arbitrary, should we reconsider these fields? An alternative terminology could be:
Field --> | default |
short? |
---|---|---|
Private section / ICANN section / unknown | ICANN section | |
foo.bar.wixsite.com | bar.wixsite.com | wixsite.com |
xyz.example.com | example.com |
The short
registrable domain would only be populated if the default
registrable domain had an eTLD from the Private section (all Private section eTLDs are longer).
Similarly, we could rename the excludePrivateSuffixes
key of the options parameter for the non-batching API method getRegistrableDomain()
to simply short
.
Option --> | short=false |
short=true |
---|---|---|
Private section / ICANN section / unknown | ICANN section | |
foo.bar.wixsite.com | bar.wixsite.com | wixsite.com |
xyz.example.com | example.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should not offer the "short" version at all. I don't see a good argument in the use-cases for having this. Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: API no longer provides a way of getting the "short" registrable domain or distinguishes between ICANN vs Private-section eTLDs
proposals/public-suffix.md
Outdated
|
||
If no matching suffix is found in the PSL for a `domain` parameter, then unless it is determined | ||
to be specifically [invalid](#6-invalid-domain-parameter), it should be assumed the domain has a | ||
single-label suffix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the source for the required assumption of "single-label" suffix?
This proposal does not include an example where known vs unknown matters, but in Firefox specifically, there is at least one example where it matters (https://bugzilla.mozilla.org/show_bug.cgi?id=1621168): In determining whether to issue a search query or whether to try a navigation, a PSL lookup is made:
- If valid, attempt to navigate.
- If invalid, use search engine.
Unknown entries in the PSL should also trigger a search query, but unconditionally making it return a single label would rule out that use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to remove this assumption, and instead add an option to the API allowing users to opt-in to this behaviour of assuming a single-label suffix. I was guided in my analysis by the following:
- Google Chrome's Public Suffix implementation makes this assumption, I believe. See GetDomainAndRegistry: "If no matching rule is found in the effective-TLD data (or in the default data, if the resource failed to load), the last subcomponent of the host is assumed to be the registry."
- tltds also makes this assumption in its implementation. (tltds is a javascript library for obtaining Public Suffixes that is used by popular extensions such as bitwarden password manager and violentmonkey.) However, there is also an open tltds issue caused by this assumption.
- Use Cases 1 (Filter Requests by Organization) and 3 (Detect Third-Party Requests) in this proposal are performance-sensitive, because they obtain the registrable domain for every request (not just top-level pages) as the user browses. I wanted to reduce the chance of making users' browsers feel permanently slower after installing these sorts of extensions. This assumption of a single-label suffix by default may allow a faster implementation due to avoiding an explicit PSL lookup for all single-label TLDs.
- Use Case 2 (Group Domains in UI) in this proposal may benefit from this assumption, because without it all unknown-suffixed domains would be grouped together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I'm the maintainer of tldts
and wanted to clarify that the issue linked is about tldts-experimental
, a faster but less accurate implementation of tldts
. The main tldts
package (which most projects are using) is fully accurate, and behaves in a slightly different way by returning a single label in case no rule was found matching the last label (or anything more specific), but will also return two flags: isIcann
and isPrivate
, which allow to know where the matching rule came from (i.e. the ICANN or PRIVATE section of the public suffix list). If both of these flags are false
, then we can assume that the fallback behavior of using the last label as public suffix was triggered. For example:
$ npx tldts https://example.thissuffixdoesnotexist
{
"domain": "example.thissuffixdoesnotexist",
"domainWithoutSuffix": "example",
"hostname": "example.thissuffixdoesnotexist",
"isIcann": false,
"isIp": false,
"isPrivate": false,
"publicSuffix": "thissuffixdoesnotexist",
"subdomain": ""
}
I lack context on this spec but this might allow to combine the single-label suffix fallback behavior with the use-cases outlined by @Rob--W above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main use-cases for the PSL are isValidUrl(...)
which does not want this behavior (because you want to know if the TLD is known) and isSameSite(...)
for which it almost doesn't matter.
It only helps if you have made up TLDs. printer.homenet
and backup.homenet
would both have an empty eTLD and eTLD+1 would be "homenet". If you assume the last label must be a TLD then you end up with the correct behavior.
However, since most browsers currently do not update their PSL, the "made up TLD" situation can also arise when a new TLD gets introduced.
I think the correct solution here is to make the API isValidURL(...)
check the last label and isSameSite(...)
to assume that it is a valid TLD. But maybe there are use-cases that I'm not seeing which need a third option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: changed treatment of unknown eTLDs, added hasSameRegistrableDomain()
, mentioned Firefox search vs navigate use case
proposals/public-suffix.md
Outdated
or Punycode. | ||
|
||
When settling the promise returned by `getRegistrableDomain()`, the resulting | ||
domain name should be converted to Unicode from Punycode by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should the API convert the domain to Unicode by default? When the input is the host name from a URL, it would be reasonable to expect a valid hostname. I'd argue that punycode is the more sane default.
Extensions can bundle the custom library/logic to convert punycode to unicode if desired, because the algorithm is well known and fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this to returning punycode by default, with an API option to convert to unicode. Unicode registrable domains are required by Use Case 2 (Group Domains in UI) in this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156
proposals/public-suffix.md
Outdated
|
||
#### 2. Multiple Suffixes per Domain | ||
|
||
If a domain name ends with a suffix listed in the Private section of the PSL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe stress that in almost all cases you want to use the full PSL. If you're trying to attribute abuse at some point you want to go from "a.foo.com" and "b.foo.com" and "c.foo.com" are all distributing malware to "foo.com" is distributing malware. That's the only reason I can think of, though and maybe that should instead be a generic "attribute to all higher labels".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the API in this proposal defaults to using the full PSL (i.e. including Private-section suffixes). Use Case 1 & 2 in this proposal note that there may be cases where it can be useful to exclude Private-section suffixes, therefore the option parameter excludePrivateSuffixes
is provided. I will see if I can incorporate your suggested use-case in the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: changed to always use the full PSL, noted the use case in the "Future Work" section
proposals/public-suffix.md
Outdated
|
||
##### Example | ||
|
||
| Domain name | Private suffix | ICANN suffix | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called a public suffix so calling it a private suffix here might be a bit confusing. Private section suffix
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will update this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156
proposals/public-suffix.md
Outdated
"owner" of a suffix is a private organization offering a service for clients | ||
to take ownership of subdomains underneath its own ICANN-suffixed domain. | ||
|
||
#### 2. Multiple Suffixes per Domain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be multiple suffixes per domain but the distinction between private section and ICANN section is a bit arbitrary. There could be multiple entities acting as "registrars" in a domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right. There is one distinction between the two sections: every Private-section suffix always has a shorter ICANN-section suffix at the end of it (according to my analysis of the PSL). So Private-section suffixes are always "multiple-suffix"-type domains. I will see if I can update this section to reflect your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: reworded the section to reflect the arbitrary ICANN vs Private distinction, and content-owners vs registrars
proposals/public-suffix.md
Outdated
|
||
#### 1. ICANN vs Private | ||
|
||
The PSL it is divided into two sections: ICANN suffixes and Private (i.e. non-ICANN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: delete "it"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156
proposals/public-suffix.md
Outdated
on the use case, when returning the *registrable domain / eTLD+1* for a domain name, | ||
either Unicode or Punycode may be preferred. | ||
|
||
**Note:** the [PSL algorithm](https://github.com/publicsuffix/list/wiki/Format#formal-algorithm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be a made up requirement. (That wiki page really needs some work.) The algorithm is pretty trivial and just needs to compare labels for equality so any encoding should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will note this in the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156
proposals/public-suffix.md
Outdated
* automatically calculate the set of organization names and *registrable domain / eTLD+1*s | ||
from a user-specified set of domain names, and propose these as filtering rules for | ||
the user to choose | ||
* prevent users from mistakenly specifying a *public suffix / eTLD* as a filtering rule, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably be a mistake. There are unresolved, lengthy, ancient discussions on this elsewhere but basically, at the moment, public suffixes sometimes have websites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted - I will reword this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156
proposals/public-suffix.md
Outdated
|
||
| PSL Feature | Requirement | Discussion | | ||
|------------------------|-------------|------------| | ||
| Allow Private Suffixes | Yes & No | some filters may require private suffixes, others ICANN-only | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PSL exists because filtering by organization requires the information on the PSL so only using the ICANN section seems just wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason why I dislike this is, that the distinction between an ICANN/IANA TLD and an eTLD is - in practice - poorly defined.
The "official" ICANN/IANA TLD is only the last label. There is an IANA list and another list for gTLDs.
However, the PSL is trying to gather information from the official registrars and include that. To maintain the PSL we try to get in touch with the registrars to get them to update their "official" subdomains or to ask questions but we often fail.
Look at the.au
section of the list, they cannot agree what their TLDs are but they are in the ICANN section. .bd
even has a wildcard *.bd
entry in the ICANN section. .co.uk
is in the ICANN section but the actual IANA TLD is still only .uk
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the proposal to note the issues you point out regarding the PSL sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156
proposals/public-suffix.md
Outdated
|------------------------|-------------|------------| | ||
| Allow Private Suffixes | Yes & No | some filters may require private suffixes, others ICANN-only | | ||
| | | preventing user mistake may require both types of suffix | | ||
| Allow Unknown Suffixes | Yes | provides better performance | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaiu this would only make a difference when given a single label. But a single label is either not a valid URL or it is a TLD. If you're trying to look up isValidURL(...)
you don't want the optimization because it breaks the use-case. If you're looking up isSameSite(...)
then the optimization for a single label is irrelevant because you have to compare them anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PSL algorithm starts with a candidate domain name and removes labels in turn until a matching suffix is found. For any unknown-suffixed candiate domain (regardless of how many labels it has), the algorithm will always reach the final label without finding any matches in the PSL. At this point, if we are allowed to assume that all unknown single-label suffixes are valid, then the algorithm can:
- avoid doing the final check against the PSL using the final label. E.g.
foo.bar.baz
(check) -->bar.baz
(check) -->baz
(no check needed). This may save a few CPU cycles for every candidate domain lookup. - avoid storing any single-label suffixes. This allows a possible reduction in browser startup time since we are loading fewer PSL suffixes into memory from disk, and thereafter lower browser memory usage due to holding fewer PSL suffixes in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: noted the possible optimization but that it may be not significant
// { domain: "a..b", error: "Invalid domain name", }, | ||
// ] | ||
// | ||
export function getRegistrableDomains( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are callers gaining anything from calling this with an array instead of calling the other function multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with calling getRegistrableDomain
multiple times is there is overhead associated with an extension calling an async function on the parent browser. If you have a list of 50 domains, you would be making 50 async calls to the parent browser. However, with the getRegistrableDomains
batching approach, you would be making a single async call to the parent browser passing all 50 domains at once.
I hacked a quick mockup of the two approaches using a simplified implementation of the Public Suffix API in a modified Firefox, and the batched approach seems to be about 2-3 times faster for 50 domains.
In addition, getRegistrableDomains
returns both the ICANN section suffix and the Private section suffix for each input domain, whereas getRegistrableDomain
returns only one or the other (according to excludePrivateSuffixes
in the options
input parameter).
This function getRegistrableDomains
is for Use Case "2. Group Domains in UI" in this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: noted the performance justification for batching
|
||
### Schema | ||
|
||
A new API `publicSuffix` is added as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dug into PSL use-cases a couple of time now. I think it might be better to make the API
isValidURL(...)
and isSameSite(...)
. That should be enough for use-cases (1.) and (3.) above. For use case (2.) you're restricted to comparison based sorting algorithms but that's probably okay in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@remusao, if you have feedback on this I would love to hear it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @Rob--W just pointed out in chat, there are some user-interface use-cases which would still need the getRegistrableDomain
function. For example, highlighting in the URL bar or giving a name to a group of same-site URLs.
I still think it would be good to have an explicit API for the two cases mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can certainly add isSameSite(domain1, domain2)
to the proposal. Another name for the function could be hasSameRegistrableDomain(domain1, domain2)
. It would be equivalent to:
await publicSuffix.getRegistrableDomain(domain1) === await publicSuffix.getRegistrableDomain(domain2)
Regarding isValidURL(string)
, another name for this function could be hasKnownPublicSuffix(domain)
. As you note in another comment, this may not always be reliable due to delays in updating the PSL:
most browsers currently do not update their PSL, the "made up TLD" situation can also arise when a new TLD gets introduced.
However, as @Rob--W mentions in another comment, Firefox relies on the hasKnownPublicSuffix(domain)
check when deciding how to handle user input in the navigation bar (i.e. search vs navigate). So the possible unreliability of this function may be acceptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your names are better. Mine are based on the use-case and that's probably good for thinking about it but they are wrong in terms of what they actually do.
We should probably align on what we want to call things across the document. I like "eTLD+1" and "eTLD" because they are short and consistent but I agree that "registrable domain" might be clearer. So my vote would be
eTLD = Public Suffix
Registrable Domain = eTLD+1
And if we find other names for it we should probably state the mapping clearly. I know Firefox source also uses BaseDomain and RootDomain for eTLD+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll update the proposal to use these terms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: added hasSameRegistrableDomain()
, changed getRegistrableDomain()
to return null
for unknown eTLDs thereby providing an equivalent way of determining hasKnownPublicSuffix()
, adopted eTLD and Registrable Domain as standard terms.
#### 3. PSL Special Rules | ||
|
||
The lookup performed by `getRegistrableDomain()` should adhere to the | ||
[PSL algorithm](https://github.com/publicsuffix/list/wiki/Format#formal-algorithm). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support doing it this way but we will have to talk about it because that is not what browsers are currently doing.
That situation has been unresolved for a long time and people have been working around it or ignoring it. I suggest we simply defer to what the browsers are already doing here because that is what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will incorporate your comment in the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: noted the situation re. browsers
| punycode = false (default) | example.مليسيا | | ||
| punycode = true | example.xn--mgbx4cd0ab | | ||
|
||
#### 6. Invalid domain parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important to keep in mind.isSameSite("100.200.30.2", "100.200.31.2")
should probably return false so even if the last label is assumed to be a TLD (and thus not checked) there should be an exception that checks for IPv4 addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, I will include this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: included the example showing why IP addresses should be rejected
|
||
| Permission Added | Suggested Warning | | ||
| ---------------- | ----------------- | | ||
| publicSuffix | N/A | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine but it should explain why the browser is not more fingerprintable, for example, by arguing that the browser version is already available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the document as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: noted the fingerprintability
proposals/public-suffix.md
Outdated
|
||
### Abuse Mitigations | ||
|
||
This does not expose any new non-public data so there are no new abuse vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated above, it does expose which version of the PSL is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will note this in the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: noted the fingerprintability
|
||
### Open Web API | ||
|
||
The purpose of this API is to eliminate the potential for inconsistency between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit at odds with the desire to retrieve version information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, however the browser's PSL version may be of use to extensions.
proposals/public-suffix.md
Outdated
// | ||
// Gets the PSL dataset version if available | ||
// | ||
export function getVersion(): string?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-friedberger let me know that the PSL has been updated with a VERSION
metadata item like this: publicsuffix/list#1808 (comment)
Should we mention that here and agree on using that as the version we return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine with that.
I am also fine with leaving the version method off altogether if it were to be a blocker to approving the proposal.
After all, there are many existing APIs that are dependent on the result of the public suffix list without exposing versioning information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree that this shouldn't be a blocker. However, I hope it will be non-controversial. I expect most (likely all) browser vendors don't have easy access to this information today but it should be fairly trivial to get access to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c217156: included the VERSION metadat reference
proposals/public-suffix.md
Outdated
meets any of the following criteria: | ||
* Contains a character that is invalid in an Internationalized Domain Name (IDN) - e.g. symbols, whitespace | ||
* Is an IP address - IPv4 or IPv6 | ||
* Is a public suffix itself - including the case of it being a single-label suffix not explicitly matched in the PSL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to think about the callers a bit more to determine what they want here.
If we assume the caller is something like isKnownPublicSuffix(...)
then they would want a "true" for IPs and things that are already a public suffix but a false for empty labels or the root label. If we assume the caller is something like getHighlightParts(...)
then an IP should probably look like a registrable domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the API in c217156: getRegistrableDomain()
is now a proxy for hasKnownPublicSuffix()
, because it returns null
in cases where a domain is valid but has an unknown eTLD (i.e. it no longer defaults to assuming a single-label eTLD). So if the promise returned by getRegistrableDomain()
fufills with a nonnull value, then it can be inferred that the domain has an eTLD in the PSL.
The API does not currently have a getHighlightParts()
method, but we could possibly add it in the Future Work section if we can identify an appropriate use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I was sufficiently clear here.
My point is there should be an API such that if I want to highlight the different URL parts in the address bar or find out what the current "site" is for applying some setting there needs to be an API that takes.
docs.google.com/...
-> google.com
- ICANN
foo.github.io
-> foo.github.io
- PRIVATE
10.0.0.1/login
-> 10.0.0.1
- even though this is an IP
avocado.banana
-> avocado.banana
- even though this is not a valid TLD, because it can be made to resolve locally
On the other hand, if the extension is trying to determine if something is a domain or a search term we probably want
docs.google.com/...
-> google.com
- ICANN
foo.github.io
-> foo.github.io
- PRIVATE
10.0.0.1/login
-> 10.0.0.1
- even though this is an IP
avocado.banana
-> ERROR/searchTerm
I think having different functions to call depending on what people want will probably be a nicer API but just returning a tuple and adding additional information about the result like type:RegistrableDomain|IPAddress|UnknownTLD
would also be an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can see how that could be useful. Is it something we could address in a future version of this API?
ee17198
to
c217156
Compare
This formalizes #231 into a concrete proposal.