Post Reply 
Filtering ads in RSS feeds?
Jan. 14, 2009, 04:13 AM
Post: #1
Filtering ads in RSS feeds?
First of all a BIG warm hello to you all! Is nice to see this forum pick up the pace (yes, i was lurking around for some time now). I even thought I already had an account here, what a surprise I had this morning trying to login! D'oh!

Now to the real "issue":

Lately I've been getting more and more ads from google in my RSS feeds (feedads and feedads[1-7].googleadservices.com) I guess there's more than 7!
Although there's been some hits from addomains i think it was some cookie thing as the ad was displayed.
There's some odd thing: sometimes the ad doesn't display, but proxo has nothing to do with it (some google problem I guess)
Example:
Code:
+++GET 1930+++
GET /~at/6CzHTDwmP2AiIJaIZlJ_doivUbE/i HTTP/1.1
Accept: */*
User-Agent: FeedDemon/2.5 (http://www.newsgator.com/; Microsoft Windows XP)
Accept-Encoding: gzip, x-gzip, deflate
Host: feedads.googleadservices.com
Connection: keep-alive
RESP 1930 : Cache-Control replaced: no-store, no-cache, must-revalidate, private
BlockList 1930: in AdDomains, line 347
RESP 1930 : Set-Cookie Ad killed: AdD googleadservices : testcookie=1

+++RESP 1930+++
HTTP/1.1 302 Moved Temporarily
Transfer-Encoding: chunked
Date: Tue, 13 Jan 2009 12:18:37 GMT
Content-Type: text/html; charset=UTF-8
P3P: policyref="http://www.googleadservices.com/pagead/p3p.xml", CP="NOI DEV PSA PSD IVA PVD OTP OUR OTR IND OTC"
X-Content-Type-Options: nosniff
Server: GFE/1.3
Location: http://feedads.googleadservices.com:80/~at/6CzHTDwmP2AiIJaIZlJ_doivUbE/i?tr=1
+++CLOSE 1930+++

+++GET 1931+++
GET /~at/mKdKNbVZdf--pPEAGy_ng2XFcDw/i HTTP/1.1
Accept: */*
User-Agent: FeedDemon/2.5 (http://www.newsgator.com/; Microsoft Windows XP)
Accept-Encoding: gzip, x-gzip, deflate
Host: feedads.googleadservices.com
Connection: keep-alive
RESP 1931 : Cache-Control replaced: no-store, no-cache, must-revalidate, private
BlockList 1931: in AdDomains, line 347
RESP 1931 : Set-Cookie Ad killed: AdD googleadservices : testcookie=1

+++RESP 1931+++
HTTP/1.1 302 Moved Temporarily
Transfer-Encoding: chunked
Date: Tue, 13 Jan 2009 12:19:59 GMT
Content-Type: text/html; charset=UTF-8
P3P: policyref="http://www.googleadservices.com/pagead/p3p.xml", CP="NOI DEV PSA PSD IVA PVD OTP OUR OTR IND OTC"
X-Content-Type-Options: nosniff
Server: GFE/1.3
Location: http://feedads.googleadservices.com:80/~at/mKdKNbVZdf--pPEAGy_ng2XFcDw/i?tr=1
+++CLOSE 1931+++
The first request (http://feedads.googleadservices.com/~at/..._doivUbE/i) doesn't display anything, the second one DOES (http://feedads.googleadservices.com:80/~...2XFcDw/i).



1) Sidki's set doesn't filter XML "out of the box" right?

So I did my homework and found this:
http://prxbx.com/forums/showthread.php?t...filter+xml
http://prxbx.com/forums/showthread.php?t...filter+xml
http://prxbx.com/forums/showthread.php?tid=1084

And imported z12's version:
Code:
[HTTP headers]
In = TRUE
Out = FALSE
Key = "Content-Type: 5. Filter XML (in)"
Match = "((text/xml|application/xml)*)\0"
Replace = "\0$FILTER(true)"

2) After enabling this filter why doesn't kick in some of the general ad removers?

With my caveman ability managed to rewrite feedads.googleadservices.com -> f.googleadservices.com (whatever!)
Currently it's filtering EVERYTHING, so I need some scope. Do I need to test for Content-type in the pattern section so it doesn't trigger on every page I load and affecting performance?
EXAMPLES VERY WELCOME!

3) AND, everything between <div class="feedflare"> * </div> isn't needed. Can't get this to work properly.


Example RSS feeds with googleadservices:
http://feeds.urbandictionary.com/UrbanWordOfTheDay
http://arstechnica.com/index.rssx

I'm using Sidki's latest set, Opera 9.63 and Feeddemon for the feed reading.
Add Thank You Quote this message in a reply
Jan. 14, 2009, 11:08 AM
Post: #2
RE: Filtering ads in RSS feeds?
(Jan. 14, 2009 04:13 AM)eclipse Wrote:  1) Sidki's set doesn't filter XML "out of the box" right?

Right.


Quote:2) After enabling this filter why doesn't kick in some of the general ad removers?

The general ad removers are restricted to $TYPE(htm) and $TYPE(js), one of the major reasons why this config isn't slow.

To have them kick in you'd need to change XML content-types to text/html. Besides other major disadvantages this would rarely help because XML uses different tags than HTML (except for application/xhtml+xml which is handled by the config).

To successfully filter feeds, you'd need to write specific filters, which exclusively kick in for XML content-types. There's a bunch of them, use an appropriate $IHDR(Content-Type:..) expression.


Quote:3) AND, everything between <div class="feedflare"> * </div> isn't needed. Can't get this to work properly.


Example RSS feeds with googleadservices:
http://feeds.urbandictionary.com/UrbanWordOfTheDay
http://arstechnica.com/index.rssx

I loaded both feeds with Firefox. I didn't see any ads, and the source didn't contain any HTML tags.

Last time i checked if XML filtering is worthwhile is ~18 months ago. Ads were rare, and all ads i saw were pushed as images.

If feed ads are an issue now, i'd suggest that you and others use this very thread to collect example links for ad'ish feeds. Implementing feed filtering reliably would require a lot of examples! Dozens! Smile!
Add Thank You Quote this message in a reply
Jan. 14, 2009, 11:17 AM (This post was last modified: Jan. 14, 2009 11:40 AM by lnminente.)
Post: #3
RE: Filtering ads in RSS feeds?
Here is my filter for allowing to parse xml and others. I made it to match only when needed, and when it match it logs in the log window.
Code:
[HTTP headers]
In = TRUE
Out = FALSE
Key = "Content-Type: Enable filtering by Content-Type {ln}090114b (in)"
URL = "(^local.ptron/*)"
Match = "((text/(^css|html|javascript|plain)*)|\w/(xml|javascript) *)\1"
Replace = "$FILTER(true)\1$LOG(w$DTM(c): Enable filtering by Content-Type (\1) in \u)"

I also recomend you to write specific filters as Sidki said
by the moment you could block access to http://feedads.googleadservices.com

Maybe adding that host to some list of sidki would be sufficient
Add Thank You Quote this message in a reply
Jan. 14, 2009, 02:41 PM
Post: #4
RE: Filtering ads in RSS feeds?
Quote:I loaded both feeds with Firefox. I didn't see any ads

Beware of the preview mode, as every feed I tried in firefox and opera DIDN'T display any ads (although they WERE there). You could try also in IE, and FeedBurner will kick in and display a preview of their own (something to do with making the subscription more user friendly I guess?). This code apparently is embedded in the source feed.

Please take a look at the attached pics. Different interpretations of the standard? Why would the unfiltered ads display in "subscribed mode" but not in "preview mode"?

.png  ARS FEED OPERA PREVIEW.png (Size: 59.82 KB / Downloads: 721)
.png  ARS FEED OPERA SUBSCRIBED.png (Size: 99.09 KB / Downloads: 825)


Quote:and the source didn't contain any HTML tags.
What about this...
This is part of the source from the preview in opera:
Code:
&lt;p&gt;&lt;a href="http://feedads.googleadservices.com/~at/3l_uhy3mKKW_q5qbxEm2bE8WEP8/a"&gt;&lt;img src="http://feedads.googleadservices.com/~at/3l_uhy3mKKW_q5qbxEm2bE8WEP8/i" border="0" ismap="true"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

Once you are subscribed you can see the source of the feed as an attachment (mime encoded): (exact same extract)
Code:
<p><a href="http://feedads.googleadservices.com/~at/3l_uhy3mKKW_q5qbxEm2bE8WEP8/a"><img src="http://feedads.googleadservices.com/~at/3l_uhy3mKKW_q5qbxEm2bE8WEP8/i" border="0" ismap="true"></img></a></p>

Quote:Implementing feed filtering reliably would require a lot of examples! Dozens!
Up until now, only encountered ads in feeds in some way related to FeedBurner (and always google ads).
An easy way to tell if you're going to have ads is just entering the feed url address in IE. If you get a FeedBurner "friendly" page, you'll probably have ads from google (later in a reader, not in the preview)
Add Thank You Quote this message in a reply
Jan. 14, 2009, 03:55 PM
Post: #5
RE: Filtering ads in RSS feeds?
(Jan. 14, 2009 02:41 PM)eclipse Wrote:  Beware of the preview mode, as every feed I tried in firefox and opera DIDN'T display any ads (although they WERE there).

Ahh, i didn't know that Firefox is skipping ads that get displayed in other readers. I don't think that Firefox also has a "preview" and "subscribed" mode. It's simply interpreting RSS docs that way.

I don't use another feed reader besides Firefox, and i'm not intending to do so, unless sidki-config support would force me. Wink


Quote:
Quote:and the source didn't contain any HTML tags.
What about this...
This is part of the source from the preview in opera:
Code:
&lt;p&gt;&lt;a href="http://feedads.googleadservices.com/~at/3l_uhy3mKKW_q5qbxEm2bE8WEP8/a"&gt;&lt;img src="http://feedads.googleadservices.com/~at/3l_uhy3mKKW_q5qbxEm2bE8WEP8/i" border="0" ismap="true"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

Obviously these aren't HTML tags. Smile!

But seriously, i remember now having seen this before, escaped HTML within <description> feed tags, that is.

Anti-ad HTML filters don't catch these. Your XML web filters would need to look for such escaped ads, possibly only within description blocks. One problem is that - at least theoretically - there are many ways to escape HTML code (entities, Hex, Unicode, ...).
Add Thank You Quote this message in a reply
Jan. 14, 2009, 07:50 PM
Post: #6
RE: Filtering ads in RSS feeds?
Do the escaped/unescaped characters have to do something with the ads being displayed/ not displayed?

1) ffox and opera escaped: don't display

opera unescaped: 2) sometimes display
3) sometimes don't

On the instances in wich they don't appear (3) is because the file moved! If you try to follow the link you'll get a "410 Gone" error. I hope they keep it this way (malfunctioning!)

Example with FFox!:

.png  ARS FEED FFOX ad present.png (Size: 50.29 KB / Downloads: 715)
====================================================

I want to match only between <description> tags as Sidki kindly observed
Code:
&lt;a href="http://feedads.googleadservices.com/~at/L2p4IunI2o6JCJrIO8NknJGCQiY/a"&gt;&lt;img
src="http://feedads.googleadservices.com/~at/L2p4IunI2o6JCJrIO8NknJGCQiY/i" border="0"
ismap="true"&gt;&lt;/img&gt;&lt;/a&gt;

Any suggestion (I mean example Wink ) to get me going? TIA

EDIT - wrong pic!
Add Thank You Quote this message in a reply
Jan. 14, 2009, 11:02 PM
Post: #7
RE: Filtering ads in RSS feeds?
(Jan. 14, 2009 07:50 PM)eclipse Wrote:  Do the escaped/unescaped characters have to do something with the ads being displayed/ not displayed?

Well, apparently readers that feature a threaded mode unescape that code. Nonetheless, it always arrives escaped.


Quote:On the instances in wich they don't appear (3) is because the file moved! If you try to follow the link you'll get a "410 Gone" error. I hope they keep it this way (malfunctioning!)

If that's fine for you, and as long as Google, who acquired FeedBurner, keeps sticking to ad banners, why not skip the entire XML filtering business altogether, and do what lnminente suggested (thanks :-):
(Jan. 14, 2009 11:17 AM)lnminente Wrote:  by the moment you could block access to http://feedads.googleadservices.com

Maybe adding that host to some list of sidki would be sufficient

IncludeExclude-U.ptxt:
Code:
## various redirects
##
## example:
## this.site.com/foo/bad_pic.gif    $RDIR(http://local.ptron/killed.gif)
## ----------------------------------------------------------------------------
feedads.googleadservices.com        $RDIR(http://local.ptron/killed.gif)

Final slash is omitted b/c Google has a tendency to switch to HTTPS (:443/).
Heck, you could even replace killed.gif with a tiny, minimalist, yet beautiful "feed ad" image.
Add Thank You Quote this message in a reply
Jan. 15, 2009, 01:43 AM
Post: #8
RE: Filtering ads in RSS feeds?
Quote:If that's fine for you, and as long as Google, who acquired FeedBurner, keeps sticking to ad banners, why not skip the entire XML filtering business altogether, and do what lnminente suggested (thanks :-):

I wonder why we didn't persue that solution, the simplest of all. I considered a HOST entry, but I prefer to do all my filtering with proxo.

When I used jd5000, and then grypen set (ages ago), there was this beautiful blocklist with entries for domains.
I didn't know how to accomplish this with your filter sidki.
This might come in handy if I encounter any other rogue ad.

Inminente, thanks for your suggestion, I kind of got lost with so many new things to learn, and forgot alltogether your approach (very classy!)
Add Thank You Quote this message in a reply
Jan. 16, 2009, 04:23 PM
Post: #9
RE: Filtering ads in RSS feeds?
(Jan. 14, 2009 11:02 PM)sidki3003 Wrote:  IncludeExclude-U.ptxt:
Code:
## various redirects
##
## example:
## this.site.com/foo/bad_pic.gif    $RDIR(http://local.ptron/killed.gif)
## ----------------------------------------------------------------------------
feedads.googleadservices.com        $RDIR(http://local.ptron/killed.gif)

Final slash is omitted b/c Google has a tendency to switch to HTTPS (:443/).
Heck, you could even replace killed.gif with a tiny, minimalist, yet beautiful "feed ad" image.

So, did that entry work for you?
If so, it may be worth adding to the default list.
Do you have graphic skills? Wink
Add Thank You Quote this message in a reply
Jan. 16, 2009, 06:37 PM
Post: #10
RE: Filtering ads in RSS feeds?
Quote:So, did that entry work for you?
Yes, it did!! Wink (maybe I wasn't too clear before D'oh!)

Quote:If so, it may be worth adding to the default list.
That might be a great idea!
So far, so good, any anomaly will be reported...

Quote:Do you have graphic skills?
I have a cute 1x1 transparent .gif sitting on my desktop, if that's what you mean, LOL! Smile!
Add Thank You Quote this message in a reply
Jan. 16, 2009, 07:00 PM
Post: #11
RE: Filtering ads in RSS feeds?
Glad to help Wink

Agree with adding it to the defaul list Sidki, i read somewhere, google provides near fifty percent of the ads in internet :o So maybe an improve of speed could be done looking if it's from google before looking at the full list...
Maybe would be better adding only the root host googleadservices.com I searched for it at http://adblockplus.mozdev.org/easylist/a...ick752.txt and they use it too

One tip/question Sidki, wouldn't be better $JUMP instead of $RDIR?
-with $RDIR will be safer for sure, but the same image would be cached many times.
-with $JUMP we would cache many images only as one. Ok it's not a big optimization, but if not causing problems could be a small improve


Attached File(s) Image(s)
   
Add Thank You Quote this message in a reply
Jan. 16, 2009, 08:22 PM
Post: #12
RE: Filtering ads in RSS feeds?
Attached pic: I've re-included that one already in beta 2 for object data bugs, e.g.: http://www.dublincityjobs.ie/ (very bottom, halfway left). Smile!
Nice one for a rare bug, but not for - sometimes useful - ad banners, IMO.


(Jan. 16, 2009 07:00 PM)lnminente Wrote:  Agree with adding it to the defaul list Sidki, i read somewhere, google provides near fifty percent of the ads in internet :o So maybe an improve of speed could be done looking if it's from google before looking at the full list...

As for header-blocking Google servers, see below. As for list speed, should be okay, IncludeExclude gets hashed pretty well (pre: 10, url 10).

Quote:Maybe would be better adding only the root host googleadservices.com I searched for it at http://adblockplus.mozdev.org/easylist/a...ick752.txt and they use it too

Personally, i dislike the sledgehammer Hosts file approach (blocking all documents from listed hosts), which dicussed entry actually resembles. People who want Proxomitron to behave like that, can activate the "!-|||||||||||| URL: Block Ad URLs" header filter.
I need to load docs from ad hosts quite often (e.g. via the "Show x Scripts" Prox menu entry).

In fact, thinking about it, i'm not even comfortable with the "feedads.googleadservices.com" sledgehammer entry, and i want to narrow it down. Hence my request below.


Quote:One tip/question Sidki, wouldn't be better $JUMP instead of $RDIR?
-with $RDIR will be safer for sure, but the same image would be cached many times.
-with $JUMP we would cache many images only as one. Ok it's not a big optimization, but if not causing problems could be a small improve

Sure, worth a shot, since the images are off-site anyway, and i doubt that a page script does a location check.


eclipse, could you replace your list entry with the one below?
Code:
feedads.googleadservices.com
  $OHDR(Referer: \3)$ADDLST(Log-Rare,HDR_Out IncEx\tRef: \3 URL: \u)
  $JUMP(http://local.ptron/killed.gif)

If so, and if it doesn't work as well as before, just replace "$JUMP" with "$RDIR" (and tell us).
After a short while you should get a bunch of new entries in your Log-Rare.log. Multiple entries per feed. Looking like:

Code:
HDR_Out IncEx    Ref: http://feeds.urbandictionary.com/UrbanWordOfTheDay URL: http://feedads.googleadservices.com/~a/u7MHwqiRCboZQFJjSXcc_OBYkqc/i
HDR_Out IncEx    Ref: http://feeds.arstechnica.com/arstechnica/BAaf URL: http://feedads.googleadservices.com/~at/Oll3zgunCfm9ibKE7zV-ZcwTzIM/i

Then please post these entries here, as attachment.
Add Thank You Quote this message in a reply
Jan. 16, 2009, 09:00 PM
Post: #13
RE: Filtering ads in RSS feeds?
I took the image of the bug from (Paul|you), i don't remember. But what is for sure is i cut it to 16x16, the same size as the favicons, that's the only difference :P

(Jan. 16, 2009 08:22 PM)sidki3003 Wrote:  Personally, i dislike the sledgehammer Hosts file approach (blocking all documents from listed hosts), which dicussed entry actually resembles.
A mistake from me Sidki, I was presupossing it was working like other filter. I use to block hosts for images together with Content-Type:image/*, some images are unfiltered being sent as html/* but it worth for me.

(Jan. 16, 2009 08:22 PM)sidki3003 Wrote:  Sure, worth a shot, since the images are off-site anyway, and i doubt that a page script does a location check.
I'm not good with javascript by now... If there is even a so small possibility of a javascript check, just forget the idea, it is good as it is Wink
Add Thank You Quote this message in a reply
Jan. 16, 2009, 09:28 PM
Post: #14
RE: Filtering ads in RSS feeds?
(Jan. 16, 2009 09:00 PM)lnminente Wrote:  But what is for sure is i cut it to 16x16, the same size as the favicons, that's the only difference :P

Ahh! I'm replacing (Paul Rupe's, i think) 18x18 with your 16x16. Thanks Smile!
Add Thank You Quote this message in a reply
Jan. 17, 2009, 12:38 AM
Post: #15
RE: Filtering ads in RSS feeds?
sidki3003 Wrote:eclipse, could you replace your list entry with the one below?
Code:
feedads.googleadservices.com
  $OHDR(Referer: \3)$ADDLST(Log-Rare,HDR_Out IncEx\tRef: \3 URL: \u)
  $JUMP(http://local.ptron/killed.gif)

I thought at first it wasn't working, but I have a strange behavior. That entry isn't working at all (doesn't filter ads), BUT THIS ONE DOES:
Code:
feedads.googleadservices.com  $JUMP(http://local.ptron/killed.gif)

I moved around the different header actions, so
$JUMP $OHDR $ADDLST
works, but destroys the URL field (becomes killed.gif)
Ordering $OHDR $JUMP $ADDLST doesn't work either (doesn't filter ads / URL field overwritten).

Am I doing something wrong? I'm very confused.

Also, the only way to get entries to the log is in the opera preview, and it really brings down my system to its knees
EDIT - ONLY WHEN $JUMP IS LOCATED FIRST, enters in an infinite loop
Code:
GET 5682 : Cache-Control killed: no-cache
BlockList 5682: in User-Agents, line 51
JumpTo: http://local.ptron/killed.gif
BlockList 5683: in IncludeExclude-U, line 890
GET 5683 : User Keywords: .
GET 5683 : Cache-Control killed: no-cache
BlockList 5683: in User-Agents, line 51
JumpTo: http://local.ptron/killed.gif
BlockList 5684: in IncludeExclude-U, line 890
GET 5684 : User Keywords: .
GET 5684 : Cache-Control killed: no-cache
BlockList 5684: in User-Agents, line 51
JumpTo: http://local.ptron/killed.gif
BlockList 5685: in IncludeExclude-U, line 890
GET 5685 : User Keywords: .
GET 5685 : Cache-Control killed: no-cache
BlockList 5685: in User-Agents, line 51
JumpTo: http://local.ptron/killed.gif
BlockList 5686: in IncludeExclude-U, line 890
GET 5686 : User Keywords: .
GET 5686 : Cache-Control killed: no-cache
BlockList 5686: in User-Agents, line 51
JumpTo: http://local.ptron/killed.gif


Anyway, the few i catched are in the form of:
_http://feedads.googleadservices.com/~at/ *some random string* /i
Maybe thats enough to narrow it down (if you need the log i'll wait to get some more hits).

I don't have hits in "subscribed mode" which is kind of odd. Only preview by opera and only a few of the feeds (the ones that display a little box instead of the image)

I hope something of what I said makes sense to you Wink
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: