Identifying ad elements on page via selectors + selenium #12

DGaffney · 2021-12-06T17:58:28Z

I know that this codebase appears to block based on network traffic by useing regexes to filter out URLs that are associated with ad serving, but is there, in this code base or any other, references for how one would identify elements on pages, once rendered, that would need to be removed from the page in order to drop all ads off a page's rendered DOM?

Bass-03 · 2021-12-06T18:52:14Z

That is very interesting.
Adblockers that are browser extensions do that, they hide elements based on selectors.

Doing it at the proxy level might not be very effective because ads are injected by scripts.
However, some websites do have the place holders already there, you might need to either test for every selector on lists like easylist, or build your own custom list and curate it.

is this helpful?
What would you like to do?

DGaffney · 2021-12-06T18:56:12Z

Super helpful - after doing more digging, I think the easylist does contain the information I’m searching for - things like `bing.com##.productAd` is the syntax for these. If I’m correct, this indicates that I need to look for items on bing.com <http://bing.com/> that match the class .productAd, right? Is there any parser that converts the easylist syntax into something else, or more robust directions for parsing it, that you’re aware of?

…

On Dec 6, 2021, at 10:52 AM, Edmundo Sanchez ***@***.***> wrote: That is very interesting. Adblockers that are browser extensions do that, they hide elements based on selectors. Doing it at the proxy level might not be very effective because ads are injected by scripts. However, some websites do have the place holders already there, you might need to either test for every selector on lists like easylist <https://easylist.to/>, or build your own custom list and curate it. is this helpful? What would you like to do? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADPJAC7ITQZH4I4V5R2MDUPUA6VANCNFSM5JPHY2XA>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Bass-03 · 2021-12-06T19:01:44Z

I am so into this!
You can learn about the syntax here
https://adblockplus.org/filter-cheatsheet

And there is a python package to parse that, https://github.com/adblockplus/python-abp
Let me know if you want any more help, My email is listed on my profile, I have some knowledge about this adblocking stuff, we can chat about that.

epitron · 2021-12-07T05:08:18Z

This would be great to add!

You're right, at the proxy level, you wouldn't be able to match the JS-created elements, but it's probably worth the effort for rules that do match (of which I assume there still are some). The biggest problem is sites which use Javascript to render the entire page, and emit essentially no HTML. They're relatively common these days.

Is beautiful soup fast enough to do real-time HTML transforms? (I'm not really a Python person.) A streaming XML processor probably isn't necessary since most pages' HTML is pretty tiny (10k-200k?). I guess, even if it's a bit slow, it'll still be faster than loading all the ads!

Unrelated (kinda): this codebase really needs a rewrite. It's an afternoon hack from 8 years ago, and it would be a lot nicer as a real unix-style commandline tool (with a --help screen and useful options and whatnot). It should also be using the Brave adblock engine's Python module, which is probably much faster than re2 (and written in Rust!)

I dunno, do you two use this thing much? Would these changes be helpful? Would you be interested in helping? I'm sure we're all busy. Just throwing these ideas out there in case you're interested!

DGaffney · 2021-12-07T17:06:05Z

Thanks to both of you! To be perfectly honest, I'm using this library only as a means to an end for a fairly separate issue - I am using selenium to visit URLs, then comparing the network transfer URLs loaded downstream of the root request against the rule set to mark network transfers as originating on ad servers with my own custom rules.should_block usage. I'm also moving further than that to look at the elements on the page that may or may not be ads - for that, I'm currently abp.filters from parse_filterlist in https://github.com/adblockplus/python-abp. Right now, for my proof of concept, I'm not too too concerned about speed, but long term it will be an issue - there's 29k CSS matching rules that I consider for each HTML document, where, for each rule, I have to run a selenium driver.find_elements_by_css_selector(rule) lookup, which takes ≈12-13 minutes per site right now. I'm sure there's more clever ways but the brute force is sufficient to at least show the idea works in principle - if you have thoughts about speeding up that portion, I'm all ears, but I should show my hand here and say my use cases for the repo you've built are tangential. That said, happy to help where it may be useful!

Bass-03 · 2021-12-07T19:05:53Z

hey @epitron
I was looking into creating something like this a while back, I found this and I sort of stopped.

I have some insights on adblocking, I might be able to help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying ad elements on page via selectors + selenium #12

Identifying ad elements on page via selectors + selenium #12

DGaffney commented Dec 6, 2021

Bass-03 commented Dec 6, 2021

DGaffney commented Dec 6, 2021 via email

Bass-03 commented Dec 6, 2021

epitron commented Dec 7, 2021 •

edited

Loading

DGaffney commented Dec 7, 2021

Bass-03 commented Dec 7, 2021

Identifying ad elements on page via selectors + selenium #12

Identifying ad elements on page via selectors + selenium #12

Comments

DGaffney commented Dec 6, 2021

Bass-03 commented Dec 6, 2021

DGaffney commented Dec 6, 2021 via email

Bass-03 commented Dec 6, 2021

epitron commented Dec 7, 2021 • edited Loading

DGaffney commented Dec 7, 2021

Bass-03 commented Dec 7, 2021

epitron commented Dec 7, 2021 •

edited

Loading