-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identifying ad elements on page via selectors + selenium #12
Comments
That is very interesting. Doing it at the proxy level might not be very effective because ads are injected by scripts. is this helpful? |
Super helpful - after doing more digging, I think the easylist does contain the information I’m searching for - things like `bing.com##.productAd` is the syntax for these. If I’m correct, this indicates that I need to look for items on bing.com <http://bing.com/> that match the class .productAd, right? Is there any parser that converts the easylist syntax into something else, or more robust directions for parsing it, that you’re aware of?
… On Dec 6, 2021, at 10:52 AM, Edmundo Sanchez ***@***.***> wrote:
That is very interesting.
Adblockers that are browser extensions do that, they hide elements based on selectors.
Doing it at the proxy level might not be very effective because ads are injected by scripts.
However, some websites do have the place holders already there, you might need to either test for every selector on lists like easylist <https://easylist.to/>, or build your own custom list and curate it.
is this helpful?
What would you like to do?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADPJAC7ITQZH4I4V5R2MDUPUA6VANCNFSM5JPHY2XA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I am so into this! And there is a python package to parse that, https://github.com/adblockplus/python-abp |
This would be great to add! You're right, at the proxy level, you wouldn't be able to match the JS-created elements, but it's probably worth the effort for rules that do match (of which I assume there still are some). The biggest problem is sites which use Javascript to render the entire page, and emit essentially no HTML. They're relatively common these days. Is beautiful soup fast enough to do real-time HTML transforms? (I'm not really a Python person.) A streaming XML processor probably isn't necessary since most pages' HTML is pretty tiny (10k-200k?). I guess, even if it's a bit slow, it'll still be faster than loading all the ads! Unrelated (kinda): this codebase really needs a rewrite. It's an afternoon hack from 8 years ago, and it would be a lot nicer as a real unix-style commandline tool (with a I dunno, do you two use this thing much? Would these changes be helpful? Would you be interested in helping? I'm sure we're all busy. Just throwing these ideas out there in case you're interested! |
Thanks to both of you! To be perfectly honest, I'm using this library only as a means to an end for a fairly separate issue - I am using selenium to visit URLs, then comparing the network transfer URLs loaded downstream of the root request against the rule set to mark network transfers as originating on ad servers with my own custom |
hey @epitron I have some insights on adblocking, I might be able to help. |
I know that this codebase appears to block based on network traffic by useing regexes to filter out URLs that are associated with ad serving, but is there, in this code base or any other, references for how one would identify elements on pages, once rendered, that would need to be removed from the page in order to drop all ads off a page's rendered DOM?
The text was updated successfully, but these errors were encountered: