-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
top whorl values #117
base: master
Are you sure you want to change the base?
top whorl values #117
Conversation
Hi Arthur, for some values which we expect to be longer in length, we store a hash of the value in the |
Hi Bill,
Thanks for catching that and fixing up my code! Much appreciated.
Arthur
…On Mon, Jan 13, 2025 at 6:01 PM William Budington ***@***.***> wrote:
Hi Arthur, for some values which we expect to be longer in length, we
store a hash of the value in the totals table instead of the actual
value. This required some accounting for, in da12994
<da12994>.
I'll have to run some performance tests on the queries before merging,
which I hope to be able to do tomorrow.
—
Reply to this email directly, view it on GitHub
<#117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACWZ3XI42CHWUIDGN6GYRL2KRVV7AVCNFSM6AAAAABU7V4MUCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBYGYYTQMBVGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Unfortunately, my benchmarks show very long query times for hashed values. I don't think making these queries via an API endpoint is a viable solution given the server load it will incur. Even the simple query on our database backup server takes a while: time docker exec -it mysql mysql -u root -p'********' panopticlick -e "SELECT value, epoch_total FROM totals WHERE variable='user_agent' ORDER BY epoch_total DESC LIMIT 25;"
# <----- SNIP ----->
real 6m4.099s
user 0m0.104s
sys 0m0.115s This is due to the massive size of the CYT dataset. One way forward is running these queries on a set of metrics daily and providing those results asynchronously. This will require some infrastructural work, but I'm already working on that so this shouldn't add too much complication. I'm wondering what use the top X results for a specific metric are. On the backup server, I've been working on providing the top X number of results for entire anonymity pools. I'm wary of the impression that if one knows specific top frequency metrics, a user will modify their browser in idiosyncratic ways to make that it fit that specific metric, while inadvertently making their overall fingerprint completely unique. For instance, one might imagine that the latest version of Safari for iOS is the plurality of browsers seen, with perhaps the latest version of Chrome for Windows being the runner-up. If one were to change their Chrome UA to an iOS UA, though that specific metric would be more common they would be making themselves completely unique: they would be the only browser out there with a content-accept string corresponding to Chrome but a UA string corresponding to iOS Safari. So at some level I'm wary providing these statistics would lead someone down a path of making themselves more vulnerable to fingerprinting. There may also be some path forward if you are working on this for research purposes and would like to use the CYT dataset under contractual agreement. |
Wow, that is slow! Obviously that's not going to work. Thank you for giving this a try in any case. I'll follow up by email. |
No description provided.