Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying ad elements on page via selectors + selenium #12

Open
DGaffney opened this issue Dec 6, 2021 · 6 comments
Open

Identifying ad elements on page via selectors + selenium #12

DGaffney opened this issue Dec 6, 2021 · 6 comments

Comments

@DGaffney
Copy link

DGaffney commented Dec 6, 2021

I know that this codebase appears to block based on network traffic by useing regexes to filter out URLs that are associated with ad serving, but is there, in this code base or any other, references for how one would identify elements on pages, once rendered, that would need to be removed from the page in order to drop all ads off a page's rendered DOM?

@Bass-03
Copy link

Bass-03 commented Dec 6, 2021

That is very interesting.
Adblockers that are browser extensions do that, they hide elements based on selectors.

Doing it at the proxy level might not be very effective because ads are injected by scripts.
However, some websites do have the place holders already there, you might need to either test for every selector on lists like easylist, or build your own custom list and curate it.

is this helpful?
What would you like to do?

@DGaffney
Copy link
Author

DGaffney commented Dec 6, 2021 via email

@Bass-03
Copy link

Bass-03 commented Dec 6, 2021

I am so into this!
You can learn about the syntax here
https://adblockplus.org/filter-cheatsheet

And there is a python package to parse that, https://github.com/adblockplus/python-abp
Let me know if you want any more help, My email is listed on my profile, I have some knowledge about this adblocking stuff, we can chat about that.

@epitron
Copy link
Owner

epitron commented Dec 7, 2021

This would be great to add!

You're right, at the proxy level, you wouldn't be able to match the JS-created elements, but it's probably worth the effort for rules that do match (of which I assume there still are some). The biggest problem is sites which use Javascript to render the entire page, and emit essentially no HTML. They're relatively common these days.

Is beautiful soup fast enough to do real-time HTML transforms? (I'm not really a Python person.) A streaming XML processor probably isn't necessary since most pages' HTML is pretty tiny (10k-200k?). I guess, even if it's a bit slow, it'll still be faster than loading all the ads!

Unrelated (kinda): this codebase really needs a rewrite. It's an afternoon hack from 8 years ago, and it would be a lot nicer as a real unix-style commandline tool (with a --help screen and useful options and whatnot). It should also be using the Brave adblock engine's Python module, which is probably much faster than re2 (and written in Rust!)

I dunno, do you two use this thing much? Would these changes be helpful? Would you be interested in helping? I'm sure we're all busy. Just throwing these ideas out there in case you're interested!

@DGaffney
Copy link
Author

DGaffney commented Dec 7, 2021

Thanks to both of you! To be perfectly honest, I'm using this library only as a means to an end for a fairly separate issue - I am using selenium to visit URLs, then comparing the network transfer URLs loaded downstream of the root request against the rule set to mark network transfers as originating on ad servers with my own custom rules.should_block usage. I'm also moving further than that to look at the elements on the page that may or may not be ads - for that, I'm currently abp.filters from parse_filterlist in https://github.com/adblockplus/python-abp. Right now, for my proof of concept, I'm not too too concerned about speed, but long term it will be an issue - there's 29k CSS matching rules that I consider for each HTML document, where, for each rule, I have to run a selenium driver.find_elements_by_css_selector(rule) lookup, which takes ≈12-13 minutes per site right now. I'm sure there's more clever ways but the brute force is sufficient to at least show the idea works in principle - if you have thoughts about speeding up that portion, I'm all ears, but I should show my hand here and say my use cases for the repo you've built are tangential. That said, happy to help where it may be useful!

@Bass-03
Copy link

Bass-03 commented Dec 7, 2021

hey @epitron
I was looking into creating something like this a while back, I found this and I sort of stopped.

I have some insights on adblocking, I might be able to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants