top whorl values #117

arthuredelstein · 2025-01-11T06:43:26Z

No description provided.

Hainish · 2025-01-14T02:00:42Z

Hi Arthur, for some values which we expect to be longer in length, we store a hash of the value in the totals table instead of the actual value. This required some accounting for, in da12994. I'll have to run some performance tests on the queries before merging, which I hope to be able to do tomorrow.

arthuredelstein · 2025-01-14T02:37:13Z

Hi Bill, Thanks for catching that and fixing up my code! Much appreciated. Arthur

…

On Mon, Jan 13, 2025 at 6:01 PM William Budington ***@***.***> wrote: Hi Arthur, for some values which we expect to be longer in length, we store a hash of the value in the totals table instead of the actual value. This required some accounting for, in da12994 <da12994>. I'll have to run some performance tests on the queries before merging, which I hope to be able to do tomorrow. — Reply to this email directly, view it on GitHub <#117 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACWZ3XI42CHWUIDGN6GYRL2KRVV7AVCNFSM6AAAAABU7V4MUCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBYGYYTQMBVGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Hainish · 2025-01-14T20:20:35Z

Unfortunately, my benchmarks show very long query times for hashed values. I don't think making these queries via an API endpoint is a viable solution given the server load it will incur. Even the simple query on our database backup server takes a while:

time docker exec -it mysql mysql -u root -p'********'  panopticlick -e "SELECT value, epoch_total FROM totals WHERE variable='user_agent' ORDER BY epoch_total DESC LIMIT 25;"
# <----- SNIP ----->
real	6m4.099s
user	0m0.104s
sys	0m0.115s

This is due to the massive size of the CYT dataset.

One way forward is running these queries on a set of metrics daily and providing those results asynchronously. This will require some infrastructural work, but I'm already working on that so this shouldn't add too much complication.

I'm wondering what use the top X results for a specific metric are. On the backup server, I've been working on providing the top X number of results for entire anonymity pools. I'm wary of the impression that if one knows specific top frequency metrics, a user will modify their browser in idiosyncratic ways to make that it fit that specific metric, while inadvertently making their overall fingerprint completely unique. For instance, one might imagine that the latest version of Safari for iOS is the plurality of browsers seen, with perhaps the latest version of Chrome for Windows being the runner-up. If one were to change their Chrome UA to an iOS UA, though that specific metric would be more common they would be making themselves completely unique: they would be the only browser out there with a content-accept string corresponding to Chrome but a UA string corresponding to iOS Safari. So at some level I'm wary providing these statistics would lead someone down a path of making themselves more vulnerable to fingerprinting.

There may also be some path forward if you are working on this for research purposes and would like to use the CYT dataset under contractual agreement.

arthuredelstein · 2025-01-14T20:43:55Z

Wow, that is slow! Obviously that's not going to work. Thank you for giving this a try in any case. I'll follow up by email.

arthuredelstein added 2 commits January 10, 2025 22:36

upgrade docker-compose.yml to mysql lts (8.4.x)

d7fd2c8

/api/v1/top-whorls

14ded57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

top whorl values #117

top whorl values #117

arthuredelstein commented Jan 11, 2025

Hainish commented Jan 14, 2025

arthuredelstein commented Jan 14, 2025 via email

Hainish commented Jan 14, 2025 •

edited

Loading

arthuredelstein commented Jan 14, 2025

top whorl values #117

Are you sure you want to change the base?

top whorl values #117

Conversation

arthuredelstein commented Jan 11, 2025

Hainish commented Jan 14, 2025

arthuredelstein commented Jan 14, 2025 via email

Hainish commented Jan 14, 2025 • edited Loading

arthuredelstein commented Jan 14, 2025

Hainish commented Jan 14, 2025 •

edited

Loading