Post Reply 
Path Blocking with Metacharacters
Aug. 10, 2015, 10:19 PM (This post was last modified: Oct. 10, 2016 04:07 PM by Faxopita.)
Post: #1
Path Blocking with Metacharacters
You can either insert the following content in the user.action file or in a separate file. If the former case, make sure it heads the rest of the file's content; if used as a separate .action file, add its file name to Privoxy's config(.txt) file, but make sure the added section comes before the call to the user.action file. Note: exceptions should always come after generic rules as Privoxy reads “from top to bottom”.

Code:
{{alias}}
  +enhanced-block = +block{Restrained Access} +limit-connect{80}

{ +enhanced-block }
#
# Domaines fournis par "Artisan Numérique"
# Source: http://artisan.karma-lab.net/premunir-spywebs-privoxy
#
   www.google-analytics.com/*
   *.xiti.com/*
   *.hit-parade.com/*
   *.toutlemondeenblogue.com/*
   visit.geocities.com/*
   *.yimg.com/*
   *.cybermonitor.com/*
   *.overture.com
   *.mybloglog.com
   *.webtrendslive.com
   adnext.fr
   *.quantserve.com/*
   stats.wordpress.com/*
   *.ixnp.com/*
   *.statcounter.com/*
   *.extreme-dm.com/*
   *.googlesyndication.com/*
   www.typepad.com/t/stats*
   *.sitemeter.com/*
   myustats.com/*
   *.reinvigorate.net/*
   *.clicktale.net/*
   *.hittail.com/*
   */xiti.js
   cetrk.com/*

# Régies publicitaires/Plate-formes d'annonces.
#
   .netavenir.com
   .turn.com
   .bluestreak.com
   .criteo.com
   .blogbang.com/demo/js/blogbang_ad.php\?id=
   /.*\/microsoft_adcenterconversion\.js
   *.*.marketingsolutions.yahoo.com/*
   www.googleadservices.com/*
   .fmpub.net
   pubsrv.allopass.com/*
   .comclick.com
   .regieci.com
   .allo-audience.fr
   .audientia.net
   .clickintext.com
   .clickintext.net
   .intellitxt.com
   payperpost.com

# Enregistreurs/Relecteurs
#
   .clicktale.*
   cetrk.com/*
   *.robotreplay.com/*
   /.*/clickheat.js

# Médiamétrie/Traçage
#
   .estat.com
   .sitemeter.com
   .w3counter.com
   .reinvigorate.net
   /.*\/webanalytics
   .opentracker.net
   .weborama.*
   .quantserve.com
   .performancing.com
  
   cnbc.com.ToutLeMondeEnBlogue.com
   stats.wordpress.com
   *.technorati.com/*
   embed.technorati.com/linkcount
   /.*xiti.js
   *.getclicky.com/*
   *.iminr.com/*
   .netprofitblueprint.com/*
   .converdge.com
   .cybermonitor.com
   my.blogitexpress.com/.*\.js
   www.atoomic.com/js/*
   .clustrmaps.com/counter/*
   .trackalyzer.com
   log.tf1.fr

# Page Ranking
#
   www.free-pagerank.com/fcgi-bin/alive_js.fcgi.*    
   external.wikio.fr/blogs/top/getrank
   www.pagerank.fr/pagerank-actuel.gif

# Loggers un peu trop traçeurs.
#
   .mybloglog.com

# Traçage des flux (feeds)
#
   feedjit.com/*

# Special Google
#
   /.*utm.js
   /.*stat.*\.js
   /.*\/urchin.js
   /.*s_code.js
   /.*google-analyticator.*

# Nuisances
#
   .snap.com/*
   .ixnp.com/*
   .twitter.com/*
   .webreseau.com
   *.devfr.net/*
   badge.facebook.com/badge/*
   .blogbar.org

It's a good start. More path blocking coming soon…

If you like my contribution, please offer me a cup of tea via this Bitcoin address…

Code:
1HxxviDA5MybpewcyAmJ4JhfmYF9AE53xv

May your fighting spirits combined put the tracking industry and the super greedy ad tech down.
Add Thank You Quote this message in a reply
Aug. 11, 2015, 02:15 AM (This post was last modified: Aug. 11, 2015 02:15 AM by whenever.)
Post: #2
RE: Path Blocking Using Wildcard Characters
(Aug. 10, 2015 10:19 PM)Faxopita Wrote:  
Code:
www.google-analytics.com/*
   *.xiti.com/*
   *.hit-parade.com/*
   *.toutlemondeenblogue.com/*
   visit.geocities.com/*

Privoxy uses "Regular Expressions" for matching the path portion. I don't think the single "*" after the slash is valid RE.

If you mean to ".*", they can just be omitted.
Add Thank You Quote this message in a reply
Aug. 11, 2015, 04:09 AM (This post was last modified: Aug. 11, 2015 04:10 AM by cattleyavns.)
Post: #3
RE: Path Blocking Using Wildcard Characters
Just:

Code:
www.google-analytics.com
   *.xiti.com
   *.hit-parade.com
   *.toutlemondeenblogue.com
   visit.geocities.com

is enough, unless we want:
Code:
www.google-analytics.com/.*?ga\.js

Plus:
We can match webbugs-like URL with this rule:
Code:
/.{300}

Example: scorecardresearch.com/.....
If a url with more than 300 characters will get blocked
This rule might cause false positive. Webbugs is the most dangerous tracking method, I don't think we can block them completely.
Add Thank You Quote this message in a reply
Aug. 11, 2015, 10:20 AM (This post was last modified: Aug. 11, 2015 04:42 PM by Faxopita.)
Post: #4
RE: Path Blocking Using Wildcard Characters
Thanks to both of you. Indeed, there were some crude syntax errors that I haven't revised for two years. Below the revised version based upon your suggestions.

Code:
{ +block }
#
# Provided by "Artisan Numérique"
# Source: http://artisan.karma-lab.net/premunir-spywebs-privoxy
#
   .google-analytics.com
   .xiti.com
   .hit-parade.com
   .toutlemondeenblogue.com
    visit.geocities.com
   .yimg.com
   .cybermonitor.com
   .overture.com
   .mybloglog.com
   .webtrendslive.com
   .adnext.fr
   .quantserve.com
   .stats.wordpress.com
   .ixnp.com
   .statcounter.com
   .extreme-dm.com
   .googlesyndication.com
   .typepad.com/t/stats.*
   .sitemeter.com
   .myustats.com
   .reinvigorate.net
   .clicktale.net
   .hittail.com
   /xiti.js
    cetrk.com

# Ad Agencies/Networks
#
   .netavenir.com
   .turn.com
   .bluestreak.com
   .criteo.com
   .blogbang.com/demo/js/blogbang_ad.php\?id=
   /(.*/)?microsoft_adcenterconversion\.js
   .marketingsolutions.yahoo.com
   .googleadservices.com
   .fmpub.net
   .pubsrv.allopass.com
   .comclick.com
   .regieci.com
   .allo-audience.fr
   .audientia.net
   .clickintext.com
   .clickintext.net
   .intellitxt.com
   .payperpost.com

# Recorders
#
   .clicktale.*
   .cetrk.com
   .robotreplay.com
   /(.*/)?clickheat.js

# Audicence Measurement/Tracking
#
   .estat.com
   .sitemeter.com
   .w3counter.com
   .reinvigorate.net
   /(.*/)?webanalytics
   .opentracker.net
   .weborama.*
   .quantserve.com
   .performancing.com
  
    stats.wordpress.com
   .technorati.com
   /.*xiti.js
   .getclicky.com
   .iminr.com
   .netprofitblueprint.com
   .converdge.com
   .cybermonitor.com
   .blogitexpress.com/.*\.js
   .atoomic.com/js
   .clustrmaps.com/counter
   .trackalyzer.com
   .log.tf1.fr

# Page Ranking
#
   .free-pagerank.com/fcgi-bin/alive_js.fcgi.*    
    external.wikio.fr/blogs/top/getrank
   .pagerank.fr/pagerank-actuel.gif

# Tracker Logger
#
   .mybloglog.com

# Feed Tracking
#
    feedjit.com

# Special Google
#
   /.*utm.js
   /.*stat.*\.js
   /.*/urchin.js
   /.*s_code.js
   /.*google-analyticator.*

# Nuisances
#
   .snap.com
   .ixnp.com
   .twitter.com
   .webreseau.com
   .devfr.net
   .facebook.com/badge
   .blogbar.org

# Webbugs
#
   /.{300}

Note: sometimes I feel more comfortable writing /(.*/)? instead of /.*/
Add Thank You Quote this message in a reply
Aug. 11, 2015, 10:56 AM (This post was last modified: Aug. 11, 2015 10:58 AM by cattleyavns.)
Post: #5
RE: Path Blocking Using Wildcard Characters
My lastest experimental about webbugs blocking, this filter will check if a URL contains something like:

Code:
http://12.123/2?=tttttttttttttkkkk&1?=[color=#FF0000][b]0C92C3423CA7811A61745F7ED2F6A01[/b][/color]3
Demo Regex101: https://regex101.com/r/vX3cS9/1

Some websites generate MD5 (32 chars) or SHA1 (40 chars) based on our information (user-agent, plugins, date and time, timezone...) using Javascript and then send to their server and log our information, so this is a very simple method to block their tracking method, this is a variant of webbugs.

Code:
/.*?=(?:.{32}|.{40})(?:$|&)

Like my /.{300} above, use this filter carefully.
Add Thank You Quote this message in a reply
Aug. 11, 2015, 12:11 PM (This post was last modified: Aug. 11, 2015 12:15 PM by Faxopita.)
Post: #6
RE: Path Blocking Using Wildcard Characters
(Aug. 11, 2015 10:56 AM)cattleyavns Wrote:  
Code:
/.*?=(?:.{32}|.{40})(?:$|&)

Like my /.{300} above, use this filter carefully.

For example, I had to protect wikipedia.org through…
Code:
{ +block{Web Beacon} }
   /.{300}
   /.*?=(?:.{32}|.{40})(?:$|&)

{ -block{Web Beacon} }
   .wikipedia.org

Result returned after using Wikipedia search field is blocked otherwise.
Add Thank You Quote this message in a reply
Aug. 11, 2015, 04:14 PM (This post was last modified: Aug. 11, 2015 04:43 PM by Faxopita.)
Post: #7
RE: Path Blocking Using Wildcard Characters
(Aug. 11, 2015 10:56 AM)cattleyavns Wrote:  Some websites generate MD5 (32 chars) or SHA1 (40 chars) based on our information (user-agent, plugins, date and time, timezone...) using Javascript and then send to their server and log our information, so this is a very simple method to block their tracking method, this is a variant of webbugs.

Good Lord! They even use hashing to spy on us!
Add Thank You Quote this message in a reply
Aug. 12, 2015, 10:46 AM
Post: #8
RE: Path Blocking Using Wildcard Characters
(Aug. 11, 2015 10:56 AM)cattleyavns Wrote:  
Code:
/.*?=(?:.{32}|.{40})(?:$|&)

Like my /.{300} above, use this filter carefully.

Dear Cattleyavns,

this request has been blocked according to the above rule:
Code:
http://ixquick.com/js/retina_mainpage.js?v=b6be3321f0250cbebf37ebb98b546e3c

Is it that kind of hash you were talking about that may act as a fingerprint?

Good day to all readers!
Add Thank You Quote this message in a reply
Aug. 12, 2015, 12:23 PM
Post: #9
RE: Path Blocking Using Wildcard Characters
(Aug. 12, 2015 10:46 AM)Faxopita Wrote:  
(Aug. 11, 2015 10:56 AM)cattleyavns Wrote:  
Code:
/.*?=(?:.{32}|.{40})(?:$|&)

Like my /.{300} above, use this filter carefully.

Dear Cattleyavns,

this request has been blocked according to the above rule:
Code:
http://ixquick.com/js/retina_mainpage.js?v=b6be3321f0250cbebf37ebb98b546e3c

Is it that kind of hash you were talking about that may act as a fingerprint?

Good day to all readers!

I don't think so, as far as I know, iquick isn't a evil site and this is a js file so I think it is safe.
Add Thank You Quote this message in a reply
Aug. 12, 2015, 12:51 PM (This post was last modified: Aug. 13, 2015 01:24 PM by Faxopita.)
Post: #10
RE: Path Blocking Using Wildcard Characters
On the other end, this one…
Code:
http://plus.lefigaro.fr/fpservice/user_graph?appid=81325031242245596367369127435013&remote_id=261707&jsonp_callback=window.fpAuth.linksCheckIfUserExistsCallback
looks very suspicous… It's been blocked as well, but did not prevent me from reading the related article and the web page is not broken.
Add Thank You Quote this message in a reply
Aug. 12, 2015, 03:26 PM
Post: #11
RE: Path Blocking Using Wildcard Characters
(Aug. 12, 2015 12:51 PM)Faxopita Wrote:  On the other end, this one…
Code:
http://plus.lefigaro.fr/fpservice/user_graph?appid=81325031242245596367369127435013&remote_id=261707&jsonp_callback=window.fpAuth.linksCheckIfUserExistsCallback
looks very suspicous… It's been blocked as well, but did not prevent me from reading the related article and the web page is not broken.

It's okay too, I think we should remove my second and only use /.{300} .
The second is not really helpful, to be honest.
Add Thank You Quote this message in a reply
[-] The following 1 user says Thank You to cattleyavns for this post:
Faxopita
Aug. 13, 2015, 10:07 AM
Post: #12
RE: Path Blocking Using Wildcard Characters
(Aug. 12, 2015 10:46 AM)Faxopita Wrote:  
Code:
http://ixquick.com/js/retina_mainpage.js?v=b6be3321f0250cbebf37ebb98b546e3c

Is it that kind of hash you were talking about that may act as a fingerprint?

That's to prevent your browser from using an outdated cached version of the js file. It's not for tracking and it's safe to let it go.
Add Thank You Quote this message in a reply
[-] The following 1 user says Thank You to whenever for this post:
Faxopita
Aug. 13, 2015, 02:10 PM (This post was last modified: Aug. 16, 2015 02:50 PM by Faxopita.)
Post: #13
RE: Path Blocking Using Wildcard Characters
Hello Privoxy users,

I created this path blocking file. It has been, so far, very successful—for me, at least—in blocking any suspicious path that could neither be recognised by the converted hosts file nor filtered properly by my .filter files. Often, I felt very lucky to have those loaded path patterns to block some nasty trackers. Anyone is warmly welcome to make this path blocking file far better than it is today. For your info, I rarely touch this file whenever I encounter something that shouldn't be blocked. When I have a problem, it's mainly a .filter file-related issue. Thus, the need to create an exception. Of course, if you visit a news article talking about, for example, a social network, it will be blocked, but you can force Privoxy to let you through the website!

Code:
{ +block{Restrained Access: Declined Paths} }
#
# Paths
#
  /(.*/)?bons?-?plans?

  /(.*/)?core/metrics?/
  /(.*/)?core(/ux/|-)

  /(.*/)?.*(campaign|comm?ercial|marketing|parte?n(er|aire?)|promo|social).*
  /(.*/)?.*(anti-?spam|bug.?snag|detect[^/]*browser|market.?place|zoneid=).*
  /(.*/)?.*(ads?.?loader|browser[^/]*detect|deal(_|-|s)|le.?guide|metrics).*
  /(.*/)?.*(ip=(\.?[0-9]+){4}|retarget|((sm|u)id|referr?er|server.?time)=).*
  /(.*/)?.*(iframe|use?r.?(agent|g?u?id)|lat.?lo?ng|(pub.?id|time.?zone)=).*
  /(.*/)?.*((ever|flash|super|mbie).?cookie|(language|resolution|screen)=).*
  /(.*/)?.*(ad.?module|aff?ill?iate|(browser|country)=|polls?|reff?err?al).*
  /(.*/)?.*(analytics?|(c|p)id=|finger.?print(s|e(d|r)|ing)?|interstitial).*
  /(.*/)?.*((brand|charset|cid|isp|MAC(.?add?r(esse?)?)|model|signature)=).*
  /(.*/)?.*(logge(r|d)|mailchimp|pixel|product.?ads?|track(er|ing)?).*
  /(.*/)?.*(-ads-?|ad.?manager|live.?chat|splash.?page|subscribe).*
  /(.*/)?.*((caid|vpid)(-|_|=|\.)).*

  /(.*/)?.*(chartbeat|cross.?sell|facebook|forester|mobiquo|sessioncam|yahoo).*
  /(.*/)?.*(brightcove|googleads|obelusmedia|tag(commander|man)|xiti|zendesk).*
  /(.*/)?.*(acymailing|bazaarvoice|boomr|cooladata|olark|omniture|trustpilot).*
  /(.*/)?.*(blueconic|bluekai|breadcrumb|freshdesk|dmptag|usabilla|nugg\.?ad).*
  /(.*/)?.*(adchemix|cedexis|segmentify|optincrusher|smartad|visual.?revenue).*
  /(.*/)?.*(adrum|gigya|hapyak|konverto|krux|linkedin|openx|parsely|proximic).*
  /(.*/)?.*(clickfunnel|disqus|google?.?plus|marocrank|optimizely|socket\.io).*
  /(.*/)?.*(captify|geo.?(ip|loc(at(e|ion|or))?|(profile?|service)s?|=)).*
  /(.*/)?.*(runcpa).*

  /(.*/)?sponsor(e?(d|s))?/
  /(.*/)?widgets?/social.*counts?/

{ +block{Restrained Access: Declined Javascript} +handle-as-empty-document }
#
# .JS Files
#
  /(.*/)?(java)?scripts?/xtcore.*\.js

  /(.*/)?(counts?|rokmedia(quer(y|ies))?|silverlight|tapestry.messages?|xtcore)\.js

  /(.*/)?.*(audience|boomerang|conversion|nagad|recomm?end(ation)?|rtb|zepto).*\.js
  /(.*/)?.*(ad.?bloc?k?|advert).*\.js

  /(.*/)?.*(analy(s|z)er?|chat.?box|counter|mouse|profile?|survey|sso|tag)[^/]*\.js
  /(.*/)?.*(click?|compteur|crm|monitor|radar)[^/]*\.js
  /(.*/)?.*(streamsense)[^/]*\.js

  /(.*/)?.*([^a-z]*ads|hitometer|injection|plusone|pub|social.*(pop-?up|tag)s?)\.js

  /(.*/)?([a-zA-Z0-9]+(-|_|\.))?i?stats?[^/]*\.(js|php)

  /(.*/)?bug\.(gif|jpe?g|png)
  /(.*/)?.*cookie.*\.js

{ -block }
  .thetrainline.com/Scripts/src/stationlist.js

Above patterns have been truly matched in actual browsing; they're not invented for the sake of playing with REGEX. However, I must admit I haven't seen any string matching this pattern: MAC(.?add?r(esse?)?); just in case of…

Cattleyavns & Whenever, the baby is yours; tweak it the way you think it should be.

New additions and updates to come soon!
Add Thank You Quote this message in a reply
[-] The following 2 users say Thank You to Faxopita for this post:
defconnect, cattleyavns
Aug. 17, 2015, 03:08 AM (This post was last modified: Aug. 17, 2015 03:08 AM by whenever.)
Post: #14
RE: Path Blocking Using Wildcard Characters
I think (.*/)?.* is just equivalent to .*, and the ending .* could be omitted.
Add Thank You Quote this message in a reply
Aug. 17, 2015, 03:28 AM
Post: #15
RE: Path Blocking Using Wildcard Characters
As far as I know this is Privoxy author's standard.
/(.*/)?ads/ equal to:

/ads/
and
/.*?ads/
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: