The Un-Official Proxomitron Forum

Full Version: Cut: Chained Ad Path URLs
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
One of the new filters in the 2009 configs is "<script>: Cut: Chained Ad Path URLs", required to deal with concatenated scripts, which get more and more popular:
<script src=",requiredscript.js,trackingscript.js"></script>


So far so good. However, lately i also see concatenated offsite scripts:
<script src=",requiredscript.js,http%3A//"></script>


Below filter tests each chained component against the complete ad-list combo (AdHosts-J, AdDomains, etc.). I'm not sure whether the recursive expressions are correct and sufficiently robust, hence "WIP".
Name = "<script>: Cut: Chained Ad Path URLs     9.03.20 (multi) [sd] (d.1)"
Active = TRUE
Multi = TRUE
URL = "$TYPE(htm)"
Bounds = "$NEST(<script\s,*src=$AV(*\?*,*)*,>)"
Limit = 1024
Match = "(*src=)\1$AVQ("
        ",+((https+%3a)+//($LST(AdHosts-J))\8$SET(a=$GET(a) AdHj \8)|((^(^http))|(^http))("
        "$LST(AdList)$SET(a=$GET(a) \9)|(^$TST(keyword=*.a_track_s.*))"
        "((^http|[/.])|((https+%3a)+//[^/?]+)+*[/=_-])($LST(AdPaths-J)(^[a-z0-9]))\8$SET(a=$GET(a) AdPj \8)"
        "%3Cspan class=%22ProxFly-Span%22>$GET(mHead) Chain URL:%3C/span>"
        "$ESC($GET(a))%3Cbr class=%22ProxFly-Br%22 />"
        "($TST(volat=*.log:2*)$ADDLST(Log-Main,[$DTM(d T)]\tWEB JS_Chain_URL\t$GET(a) \t\u)|)"
        "($TST(volat=*.log:[12]*)$ADDLST(Log-Rare,WEB JS_Chain_URL\t$GET(a) \t\u)|)"
Replace = "\1\2\@\3$SET(a=)"

The benefit of extending the filter as described becomes especially obvious if you look at the second filter hit (as well as the resulting script) on latter example page, after adding below entry (found via Ghostery) to AdHosts-J:
# Ads - Lotame
[^/]$SET(7=var LOTCC={add:function(){},addAction:function()
  &&(($TST(volat=*.log:[12]*)\8&$ADDLST(Log-Rare,ALST AdHj \8 \t\u))|*)

edit: "WIP" flag removed.
the above post is showing this line:
but if i view that line's source code, i see a &# 8203; (without the space) at the very end of that line...

is that 8203 supposed to be there?
i can't seem to find it in any HTML Code Table...
As long as the forum's code tag handles things correctly, all is fine. The real source code doesn't matter.
"&#8203;" usually triggers a word break ( , 2nd para).
(Mar. 21, 2009 12:50 PM)sidki3003 Wrote: [ -> ]As long as the forum's code tag handles things correctly, all is fine. The real source code doesn't matter.

that seems to depend upon your OS, or more specifically, your text editor...

i cut-and-paste the above via Notepad and can not save the file because the pasting pastes a "square character" in place of that 8203 and i get a "This file contains characters in Unicode format which will be LOST if you save this file as an ANSI encoded text file" upon attempted save...

so i cancel the save and track down "why" - it's that 8203...
Ahh okay, i didn't know that, thanks.

You should end up with a list entry that looks exactly as posted.
The line indents are especially important.
Hi Sidki, i found this in my logf (log file) of large urls:
Oh - thanks. Smile!

I haven't seen this script concatenation with separate query params thus far, only with a comma, once or twice with a semicolon, always within the same query param.

If this method is also used for adscript/required-script mixtures, it would be interesting how it's embedded in the page ("&" or "&amp;", etc.). It's important that chained ad paths are intercepted in the page code (vs. headers), because otherwise other anti-adscript filters could be triggered by a chained ad path, which would also remove required components.
Hi Sidki i have an more general idea, i'm thinking we could create a filter wich could log the name of the functions inside a script coming from an ad source. Later we process that log file and create a list of blocking functions.

In that way it wouldn't matter how the script is served to us, also scripts programmed to broke pages if they are not loaded could be fixed by us instead of blocking the full script file.

Let me know if you trust in this idea...
Well, as far as sidki-configs are concerned, the approach is generally multi-layered where possible.
Regarding concatenated scripts:
1 - First try to cut ad/tracking paths.
2 - Then see if the individual modules have introductory comments which match a list of known ad/tracking comments.
3 - Then see if the contained function (or argument) names match an AdKeys-J entry.
4 - Then see if the function body contains ad strings.

I don't see a way around point 1. The original reason why i wrote this filter was to prevent subsequent "block scripts by URL" filters from matching and blocking the whole enchilada, required modules included. Besides, i like that filter. Smile!

I assume that you have something like point 3 in mind. That's fine, but, personally, i doubt that it's sufficient.
Which reminds me... there's an updated version of this filter, too:
Name = "Remove: Ad Functions I  - Names/Params     9.03.04 [jd sd] (d.2)"
Active = TRUE
URL = "($TYPE(htm)|$TYPE(js)|$TYPE(vbs))(^$TST(keyword=*.(a_ads|a_js|a_adjs|a_adfn1).*)|$TST(flag=*.adkey_j:[#*:0].*)|$TST(volat=*.clength:([#3:970]e|[#3:2400]).*))"
Limit = 32766
Match = "function$TST(script=([1s])\3*)"
        "\s(([^( ]++_|)$LST(AdKeys-J)([0-9_.:-][a-z0-9_.:-]+|))\8( $NEST(\(,\)))\4 {(^ })"
        "$SET(1=function \8\4 { return prxVoidV; /* PROX: Ad Function Blocked (Name) */ )"
        "$SET(2=Func Name)"
        "((\s[^( ]+ |)\( )\5"
        "(([^(),]++_|)$LST(AdKeys-J)([0-9_.:-][a-z0-9_.:-]+|(^[a-z])))\8($INEST(\(,\)))\4\) $NEST({, ?*,})"
        "$SET(1=function\5\8\4\) { return prxVoidV; /* PROX: Ad Function Removed (Argument) */ })"
        "$SET(2=Func Arg )"
        "|if \($TST(script=([1s])\3*)"
        " (([^()"']++[._]|)!+$LST(AdKeys-J)([0-9_.:-][a-z0-9_.:-]+|"|(^[a-z])))\8($INEST(\(,\)))\4\)( {+)\5"
        "$SET(1=if (0 /* PROX: Ad Routine Blocked (\8) */)\5)$SET(2=If Block )"
        "&($TST(volat=*.log:2*)$ADDLST(Log-Main,[$DTM(d T)]\tWEB JS_AdFunction I \2 \t\8\4 \t\u)|)"
        "%3Cspan class=%22ProxFly-Span%22>$GET(mHead) \2:%3C/span>"
        " $ESC(\8\4)%3Cbr class=%22ProxFly-Br%22 />"
Replace = "\1"
Nice!! That was exactly the idea, a list of forbidden functions. Veeeery well [Image: happy0034.gif]
Removing "WIP" flag from discussed filter...
Reference URL's