Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

top whorl values #117

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

arthuredelstein
Copy link

No description provided.

@Hainish
Copy link
Member

Hainish commented Jan 14, 2025

Hi Arthur, for some values which we expect to be longer in length, we store a hash of the value in the totals table instead of the actual value. This required some accounting for, in da12994. I'll have to run some performance tests on the queries before merging, which I hope to be able to do tomorrow.

@arthuredelstein
Copy link
Author

arthuredelstein commented Jan 14, 2025 via email

@Hainish
Copy link
Member

Hainish commented Jan 14, 2025

Unfortunately, my benchmarks show very long query times for hashed values. I don't think making these queries via an API endpoint is a viable solution given the server load it will incur. Even the simple query on our database backup server takes a while:

time docker exec -it mysql mysql -u root -p'********'  panopticlick -e "SELECT value, epoch_total FROM totals WHERE variable='user_agent' ORDER BY epoch_total DESC LIMIT 25;"
# <----- SNIP ----->
real	6m4.099s
user	0m0.104s
sys	0m0.115s

This is due to the massive size of the CYT dataset.

One way forward is running these queries on a set of metrics daily and providing those results asynchronously. This will require some infrastructural work, but I'm already working on that so this shouldn't add too much complication.

I'm wondering what use the top X results for a specific metric are. On the backup server, I've been working on providing the top X number of results for entire anonymity pools. I'm wary of the impression that if one knows specific top frequency metrics, a user will modify their browser in idiosyncratic ways to make that it fit that specific metric, while inadvertently making their overall fingerprint completely unique. For instance, one might imagine that the latest version of Safari for iOS is the plurality of browsers seen, with perhaps the latest version of Chrome for Windows being the runner-up. If one were to change their Chrome UA to an iOS UA, though that specific metric would be more common they would be making themselves completely unique: they would be the only browser out there with a content-accept string corresponding to Chrome but a UA string corresponding to iOS Safari. So at some level I'm wary providing these statistics would lead someone down a path of making themselves more vulnerable to fingerprinting.

There may also be some path forward if you are working on this for research purposes and would like to use the CYT dataset under contractual agreement.

@arthuredelstein
Copy link
Author

Wow, that is slow! Obviously that's not going to work. Thank you for giving this a try in any case. I'll follow up by email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants