Post Reply 
Another Filtering Proxy
Nov. 22, 2014, 09:35 AM
Post: #1
Another Filtering Proxy
I have been working on "Another Filtering Proxy" for several days, as I need one on a 24h running linux gateway and I don't like Privoxy's filtering syntax.

This might not be the right place to post but it is based on ProxHTTPSProxy. It even hasn't a formal name yet. Any suggestions?

I want it to filter URL, http headers and webpages.

So far only basic URL filtering is achieved:

- URL blocking, via Block.txt
- URL redirecting, via Redirect.txt
- filtering bypass, via Bypass.txt, for now it only sets a flag to a request

Those files are like Proxomitron blockfiles, but supports only regex syntax for now.

Above functions are achieved in URLFilter.py, where you can add your own functions via writing python classes. For example, you can write a class to parse the ADB blocking rules to block URL.

Run Proxy.py to start the program.

Happy filtering!


Attached File(s)
.zip  Proxy 0.1.zip (Size: 180.26 KB / Downloads: 796)
Add Thank You Quote this message in a reply
Nov. 29, 2014, 11:24 AM (This post was last modified: Dec. 28, 2014 10:48 AM by whenever.)
Post: #2
RE: Another Filtering Proxy
Here comes the version 0.2.

+ Now it gets a name, AFProxy
+ Privoxy style URL patterns for block, bypass and filters URL matching
+ Basic header filtering: HeaderFilter.py
+ Basic web page filtering: PageFilter.py
+ Config auto reload

You need a little Python knowledge to write your own fitlers, but python code is very good at being readable.

Header filter:

Code:
class PrintReferer(HeaderFilter):
    "Print Referrer if it differs from the request domain"
    name = "Print Referrer"
    In = False
    Out = True

    @classmethod
    def action(cls, req):
        domain = re.match(r"(?:.*?\.)?([^.]+\.[^.]+)$", req.headers['Host']).group(1)
        referer = req.headers['referer']
        if referer:
            referer_host = referer.split('//')[1].split('/', maxsplit=1)[0]
            if not referer_host.endswith(domain):
                logger.info('Referer: %s', referer)

Web page filter:

Code:
class NoIframeJS(PageFilter):
    name = "Remove All JS and Iframe"
    active = True
    urls = ('.badsite.org',
           'andyou.net/.*\.jpe?g',
           'gotyou.com/viewer\.php.*\.jpe?g')
    regex_subs = ((br'(?si)<(script|iframe)\s+.*?</\1>', br'<!-- \1 removed by FILTER: no-js-iframe -->'),
                  )

regex_subs is for regex find replace. string_subs is for string find replace.

Python version attached. EXE version download link: http://proxfilter.net/afproxy/AFProxy%200.2.zip


Attached File(s)
.zip  AFProxy 0.2.zip (Size: 181.92 KB / Downloads: 720)
Add Thank You Quote this message in a reply
Nov. 29, 2014, 02:15 PM
Post: #3
RE: Another Filtering Proxy
Thank you, I will try it and will report bugs if I find some.
Off-Topic: I, myself think Privoxy syntax is really good, it is fast to write filter, maintain or share.
Add Thank You Quote this message in a reply
Nov. 30, 2014, 12:35 PM
Post: #4
RE: Another Filtering Proxy
Privoxy is pretty good. I just don't like that you have to put the filters in a filter file, then apply them in another action file. I like them to be together.
Add Thank You Quote this message in a reply
Dec. 03, 2014, 09:07 PM
Post: #5
RE: Another Filtering Proxy
After some looks, yeah, pretty good but I'm unable to run the python version, just exe version.
I personally like Python.

Did i need to install other component for Python, I only have all component that ProxHTTPSProxy need, but here is what I get when I tried to run AFProxy.py:
Code:
Traceback (most recent call last):
  File "D:\gg\AFProxy 0.2\AFProxy.py", line 350,
in <module>
    config = LoadConfig(CONFIG)
  File "D:\gg\AFProxy 0.2\AFProxy.py", line 41,
in __init__
    self.PORT = int(self.config['GENERAL'].get('Port'))
TypeError: int() argument must be a string or a number, not 'NoneType'
Add Thank You Quote this message in a reply
Dec. 04, 2014, 01:09 AM
Post: #6
RE: Another Filtering Proxy
Please use the provided config file.
Add Thank You Quote this message in a reply
Dec. 04, 2014, 02:33 AM
Post: #7
RE: Another Filtering Proxy
Okay, I found my problem, problem is arround Working Dir, because my bat file to run python + AFProxy is not at the same folder, so working dir is bat file's folder and the result AFProxy miss config.ini file, I just force it use AFProxy folder as Working Dir and okay.
Add Thank You Quote this message in a reply
Dec. 04, 2014, 03:27 PM
Post: #8
RE: Another Filtering Proxy
Hi whenever, do you think that we can block image based on width and height ? Is there a way to know image w/h without fullly downloaded or maybe we have to download whole image then use window API to detect and then block ?
I know that we can write filter to match width height from HTML Source code, but this way is really limited, there is a lot image that is ad and did'nt have width and height attribute in HTML code, so we cannot have universal solution.

And plus, we also have a list of image resolution that always use for ads.
Add Thank You Quote this message in a reply
Dec. 05, 2014, 08:36 AM (This post was last modified: Dec. 05, 2014 08:42 AM by whenever.)
Post: #9
RE: Another Filtering Proxy
(Dec. 04, 2014 03:27 PM)GunGunGun Wrote:  Is there a way to know image w/h without fullly downloaded?

I think it doesn't exist a way like that. That's why major Ad blocking software like ABP or Privoxy have to maintain a blacklist for blocking.

BTW, Version 0.3

+ URLFilter.py now supports multiple list files for each filter
+ Parse Privoxy actions files (default.action, user.action) for URL blocking
* List files moved to <Lists> directory


Attached File(s)
.zip  AFProxy 0.3.zip (Size: 217.03 KB / Downloads: 711)
Add Thank You Quote this message in a reply
Dec. 05, 2014, 09:05 AM (This post was last modified: Dec. 05, 2014 02:56 PM by GunGunGun.)
Post: #10
RE: Another Filtering Proxy
(Dec. 05, 2014 08:36 AM)whenever Wrote:  I think it doesn't exist a way like that. That's why major Ad blocking software like ABP or Privoxy have to maintain a blacklist for blocking.

After some tests I finally figured that image cannot get blocked if it is not fully downloaded, partial download will show image as 0x0 resolution. I try to download the image with my browser: http://upload.wikimedia.org/wikipedia/co...ck_big.jpg
And ESC to stop download, View image info and image size is 0x0, so i think yes, no way.


Edit: Oh no, seem my above test is wrong, I tried to download that image with my Browser, and Pause download and after I Open the image with window image viewer, It show the image with correct width and height but it still not fully downloaded yet.

http://i.imgur.com/0TONiBe.png

10109x4542 but I only download 154KB from that image: http://i.imgur.com/aPGfz41.png

Maybe it is possible ? If we can then I think we can use argothirm something like this: Download a part of an image first, and then try to use OS API to detect image width/height, if match then block, if not match then download full image and send to browser.


Update: GDI Library can get image size, I tested and worked, great! Here is my test, please try it Big Teeth So much hype at this time! This may open a new future for AdBlock software because we can block ad much more effective!

https://app.box.com/s/gtat69ntsdqksrcoy9sg

I included source code inside that archive. I think I will request Privoxy's author add this feature, I think it is possible. And I think clearly Python do get image size too, because there is lots of way to call OS API: http://www.google.com/search?q=get+image...e=0&nord=1

Update2: Holy cow, it is so simple and POSSIBLE for SURE, we can did that from a long time ago: http://php.net/manual/en/function.getima....php#88793

Quote:As noted below, getimagesize will download the entire image before it checks for the requested information. This is extremely slow on large images that are accessed remotely. Since the width/height is in the first few bytes of the file, there is no need to download the entire file. I wrote a function to get the size of a JPEG by streaming bytes until the proper data is found to report the width and height:
And demo using CURL PHP:
http://stackoverflow.com/a/7476094/3763937
Quote:I managed to answer my own question and I've included the PHP code snippet.

The only downside (for me at least) is that this writes the partial image download to the file-system prior to reading in the dimensions with getImageSize.

For me 10240 bytes is the safe limit to check for jpg images that were 200 to 400K in size.

Seem very exciting. What we will achieve if we can block image based on width height ?
- Reliable banner blocking method.
- Kill all webbugs that is 0x0 or 1x1 or 2x2, web developer use webbugs to track us, and almost no way to block them all.

UPDATE3: Works on PNG too, I will report gif later, test: http://groups.csail.mit.edu/graphics/cla...MapBig.png

Update4: WORKS ON GIF TOO! Big Teeth, test: http://www.physics.usyd.edu.au/~gekko/wr...8_2620.gif

What I've done, Google "big image" "big image "png"" "big image "gif"".


Fully download image and then detect image's width/height is not effective because it waste bandwidth but acceptable I think, because still better than display an gif banner that increase browser resource. I think OS API will allow us to detect image width and height after it get fully downloaded

Quote: + Parse Privoxy actions files (default.action, user.action) for URL blocking
Thank you very much, very nice!
Add Thank You Quote this message in a reply
Dec. 08, 2014, 03:30 AM
Post: #11
RE: Another Filtering Proxy
I think this can be done in a form of a URL or header filter, but I'm sorry for now I don't have time to work on it.

There are still something on my to do list to improve the framework of AFProxy, which I think is more important for the limited spare time of mine.
Add Thank You Quote this message in a reply
[-] The following 1 user says Thank You to whenever for this post:
GunGunGun
Dec. 08, 2014, 09:09 AM
Post: #12
RE: Another Filtering Proxy
I found a software named WebCleaner, write using Python too, hope you can analyze some useful feature from it and hope this save time for you: http://webcleaner.sourceforge.net/
Add Thank You Quote this message in a reply
Dec. 08, 2014, 12:11 PM
Post: #13
RE: Another Filtering Proxy
I know it. In fact, I had looked at all the python filtering proxies I could find before reinventing my own wheel.

I know AFProxy is not good at many aspects and I'm pretty sure that it is beyond my ability to make AFProxy a full functional program that meet everybody's requirements. It's more like a personal toy. :-)
Add Thank You Quote this message in a reply
Dec. 28, 2014, 10:50 AM
Post: #14
RE: Another Filtering Proxy
Version 0.4 (20141221)
--------------

* List files not bundled and inited in URLFilter.py any more, now globle available to other filters
+ Unfiltered content is streamed to client, while not cached before sending
* Fix config auto reload
* Fix Privoxy parse (replace '.*' in the host regex with '[^/]*' so it won't match the path string)

Python version attached. EXE version download link: http://proxfilter.net/afproxy/AFProxy%200.4.zip


Attached File(s)
.zip  AFProxy_py 0.4.zip (Size: 218.23 KB / Downloads: 691)
Add Thank You Quote this message in a reply
May. 26, 2015, 06:22 AM (This post was last modified: May. 26, 2015 12:46 PM by cattleyavns.)
Post: #15
RE: Another Filtering Proxy
Hi!
Can you release a new version for urllib3 1.10.4 ? I got a bug when trying to load a page using proxy:
My Python version: 3.4.2
My urllib3 version: 1.10.4
Log: http://pastebin.com/g2mK1CW4
Site: http://www.ghacks.net
Problem: maybe this: https://github.com/shazow/urllib3/pull/544, expect https://github.com/ml31415/urllib3/commi...2b19edad4e
Config: new line:
Code:
[PROXY http://127.0.0.1:7777]
www.ghacks.net

The problem can be solved by replacing headers with self.headers ( that mean url's HTTPDict to BaseHTTPServer/http.server headers ? )

Code:
r = self.pool.urlopen(self.command, self.url, body=self.postdata, headers=headers, retries=1, redirect=False, preload_content=False, decode_content=False)

Code:
r = self.pool.urlopen(self.command, self.url, body=self.postdata, headers=self.headers, retries=1, redirect=False, preload_content=False, decode_content=False)

I can load that page through proxy, but I cannot load all page that don't use any proxy server, even worse.

Code:
[13:15:24] [P] "GET http://www.ghacks.net/" 200 11765

But all other page:
Code:
File "D:\Downloads\Compressed\AFProxy_py 0.4\ProxyTool.py", line 128, in handl
e_one_request
    BaseHTTPRequestHandler.handle_one_request(self)
  File "D:\Downloads\Compressed\AFProxy_py 0.4\AFProxy.py", line 242, in do_METH
OD
    retries=1, redirect=False, preload_content=False, decode_content=False)
----------------------------------------  File "C:\Python34\lib\site-packages\ur
llib3-1.10.4-py3.4.egg\urllib3\poolmanager.py", line 161, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "C:\Python34\lib\http\server.py", line 386, in handle_one_request
    method()

  File "C:\Python34\lib\site-packages\urllib3-1.10.4-py3.4.egg\urllib3\connectio
npool.py", line 523, in urlopen
    headers = headers.copy()
AttributeError: 'HTTPMessage' object has no attribute 'copy'
  File "D:\Downloads\Compressed\AFProxy_py 0.4\AFProxy.py", line 242, in do_METH
OD
    retries=1, redirect=False, preload_content=False, decode_content=False)
----------------------------------------  File "C:\Python34\lib\site-packages\ur
llib3-1.10.4-py3.4.egg\urllib3\poolmanager.py", line 161, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)

  File "C:\Python34\lib\site-packages\urllib3-1.10.4-py3.4.egg\urllib3\connectio
npool.py", line 523, in urlopen
    headers = headers.copy()
AttributeError: 'HTTPMessage' object has no attribute 'copy'

I cannot fix this problem, I already tried to fix this problem but it is not perfect, still partial fix which is not really good in long terms use, I skip using proxy through config.ini but use this way, I added these line right above r =..:

Code:
            if "ghacks" in self.host:
                self.pool = urllib3.proxy_from_url('http://127.0.0.1:7777/')
                headers = self.headers

Work, but not perfect, you see for proxy I use http.server headers feature, but for other I use HTTPDict from urllib3, that will cause more trouble in the future, so I hope you can fix it because I'm not the author of the software so I cannot really understand the codebase, or can you point me what I should do is okay, thank!

And with AFProxy as proxy server I cannot load this page: https://github.com/shazow/urllib3/pull/544
Error:
Code:
This page is taking way too long to load.

Sorry about that. Please try refreshing and contact us if the problem persists.
Contact Support — GitHub Status — @githubstatus

Without AFProxy, I can.
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: