Oct. 31, 2008, 06:47 AM
A new proxy may be coming, but not for several months at least. Hopefully you won't see this post as spamming - The proxy is not currently available, there's no web site I can send you to, and there isn't even a formal name for it yet. The full version probably won't be freeware, though I hope to be able to produce a powerful subset that anybody can have.
I tried to attach a couple of images showing interesting statistics produced by the proxy, but perhaps only certain forum members can do that.
The main objective of the new proxy is probably security / privacy. I think most people may want to use it for ad blocking, but if doing so then I truly hope they'll 'accidentally' wind up being a bit more secure too.
There's at least 2 things that Proxo does that I will be shying away from:
1. I'd really rather not get into decrypting SSL content. We don't want this proxy to be able to see or manipulate things like people's bank accounts. However, we also can't ignore the fact that bad things can happen from crud that hides behind an SSL mask. So we did put in some special consideration for SSL into some of the proxy's blocking methods. For now that's about as far as I'm willing to go for SSL.
2. While it can do several kinds of forwarding to other proxies and servers, there won't be an option to specify multiple proxies to try. Different URL patterns can be sent to different proxies and/or gateways, but once a determination is made then the chosen routing is final. It won't try more than one proxy to see if they're operational, and it won't choose from a list of 2 or more equal choices.
Another difference is how users interact with the proxy. In Windows (the only version being built at this time) there's a tray icon that enables viewing a scrolling log window and menus that enable / disable various options. However, compared to Proxo, most user interaction is via the web interface of the proxy itself. It uses HTML and forms where Proxo has windows and dialog boxes.
We're now trying to throw together some documentation for it and that's much more difficult than anticipated. The proxy has too many options, many of which either required in-depth knowledge of HTTP or are unlikely to ever be truly useful to anybody. Early on it was simple to just throw in some quirky option, but in order to release it we're forced to choose between documenting them or removing them.
This is definitely not a Proxo clone by any stretch of the imagination. Honestly I've never been able to quite grasp the language of the Proxo specifications. I get lost trying to study some of the large Proxo configurations, sometimes I can't even figure out exactly what they were trying to accomplish. Chances are that people may hate our specification formats, but they make sense to me (and maybe only to me). It's likely that Proxo does some things much easier and/or better than ours.
As a general rule, I prefer blocking requests based on domain and URL patterns rather than mucking around in server response content. When I spot something undesirable at some site, I'm much more prone to add that site to a block list rather than neuter its particular HTML / script issues. I don't ever want to scan through every HTML looking for links to doubleclick (whom I dislike for tracking). I think it's easier and faster to leave a site's page content alone as much as possible - and then block crud when the browsers ask to fetch from places I don't want.
Currently the proxy reflects my personal preference for blocking vs. manipulation. It has quite a few more methods to identify requests to prohibit than it has for content management.
I'm not totally against content filtering though. When Google turned on their 'Suggest' keylogger kind of crap, I quickly went for the kill without much contemplation. Google then lost some of their 'Sponsored Links' out of anger. If Google turns OFF their 'Suggest' or restores it to opt-in, then I'll be happy to see all their 'Sponsored Links' again. And "No", I'm not going to permanently keep their cookies nor forge some for the purpose of muting their 'Suggest'.
Here are a few other tidbits and opinions that may be popular for discussion:
In its default configuration, the proxy validates the data content of images for most popular types (GIF, JPG, PNG, etc). We found that it doesn't require much resource to do that, and it adds a beneficial layer to help defend the browsers. It's surprising how often something like a 'gif' file is actually not 'gif' content at all, even when it comes from 'respectable' places. The proxy also does validation of image MIME, occasionally discarding an image based on the Content-Type header presented by the server, or fixing a bogus MIME to reflect the actual image content. I wish browsers would pay more attention to what they should get vs. trying to interpret crap content.
Regular expressions can be used in quite a few places, including URL patterns and data content modification. RegEx knowledge isn't always required because there's also plenty of Non-RegEx methods. Currently it's using a DLL of PCRE ( http://www.pcre.org/ ). I've used other RegEx libraries with the proxy, but PCRE seems efficient enough and is well maintained.
You can configure the proxy to read HOSTS files directly. Too bad the Windows operating systems (and others) don't handle large HOSTS efficiently, but this proxy certainly does. It hashes the entries and builds fast-scan buckets designed just for the purpose. Though compatible with any HOSTS file, the domain blocking goes beyond that. For example, a specification to block 'example.com' could also block 'www.example.com', 'other.stuff.example.com', etc. Some repetitive HOSTS file zone entries can be absorbed into other parent specifications having less zones.
Another optional feature caches IP addresses. I've been surprised how many sites have a low DNS TTL, so the cache feature helps boost the speed of browsing activities. When some goofy network engineer casts a low TTL for purposes of load balancing or fail-over, then we're better off choosing not to participate in the (usually) useless DNS re-queries. But if the proxy can't make a connection using an IP that it cached, then it will re-query DNS and retry the connection if appropriate. An in-process cache also seems to help the Windows threading by avoiding the OS call to (re)get the host server IP. It makes sense for common browsing where there is an HTML fetch followed by a burst of many images, CSS, scripts, etc. competing for resources.
Logging needs to be plentiful. There's the scrolling Log Window showing URL usage and things like blocking status, a flexible in-memory tracking feature that shows usage grouped by host servers, and there's a feature to store hourly or daily activity into files. You really need to know exactly where you've been so that you can decide whether you ever want to go there again. Sorting the in-memory tracking by most-used host servers is helpful to spot usage that may need attention - they're usually the ones Not accessed frequently (like trackers and malware sites) because their domains or IP's are not an integral part of the sites you were viewing.
Once the proxy blocks something, the next step (which is often overlooked) is to determine how to respond to the browser. With ours you can usually specify exactly what you want the browser to get, including things like substituting a file of content or redirecting to another URL. By default the proxy tries to give back content determined by what was asked for: A request for a blocked image returns a valid image, one for a script will return a valid script (consisting of one blank byte), SSL and a few others yield special HTTP response codes, etc. I think a proxy should try to act as a beneficial front-end to assist the client browsers, definitely don't want to throw any garbage back at them.
Pipelining is really tough on a proxy, especially one that wants to interdict and filter on the user's behalf. It would have been SO much easier if the proxy could ask "may I have some more please" vs. having to detect and deal with request floods jamming in. Different browsers seem to have different rules about how they pipeline, but so far we've been able to keep dancing without having to detect who's doing it. In general, Firefox seems to have gotten better at it, and Opera can get downright aggressive at times. We could probably improve the proxy's methods for pipelining support, but in this arena I'm currently more concerned about maintaining future compatibility than I am about being tuned to today's environment.
Thanks for reading. I'm interested in hearing other opinions about proxy functionality in general.
I tried to attach a couple of images showing interesting statistics produced by the proxy, but perhaps only certain forum members can do that.
The main objective of the new proxy is probably security / privacy. I think most people may want to use it for ad blocking, but if doing so then I truly hope they'll 'accidentally' wind up being a bit more secure too.
There's at least 2 things that Proxo does that I will be shying away from:
1. I'd really rather not get into decrypting SSL content. We don't want this proxy to be able to see or manipulate things like people's bank accounts. However, we also can't ignore the fact that bad things can happen from crud that hides behind an SSL mask. So we did put in some special consideration for SSL into some of the proxy's blocking methods. For now that's about as far as I'm willing to go for SSL.
2. While it can do several kinds of forwarding to other proxies and servers, there won't be an option to specify multiple proxies to try. Different URL patterns can be sent to different proxies and/or gateways, but once a determination is made then the chosen routing is final. It won't try more than one proxy to see if they're operational, and it won't choose from a list of 2 or more equal choices.
Another difference is how users interact with the proxy. In Windows (the only version being built at this time) there's a tray icon that enables viewing a scrolling log window and menus that enable / disable various options. However, compared to Proxo, most user interaction is via the web interface of the proxy itself. It uses HTML and forms where Proxo has windows and dialog boxes.
We're now trying to throw together some documentation for it and that's much more difficult than anticipated. The proxy has too many options, many of which either required in-depth knowledge of HTTP or are unlikely to ever be truly useful to anybody. Early on it was simple to just throw in some quirky option, but in order to release it we're forced to choose between documenting them or removing them.
This is definitely not a Proxo clone by any stretch of the imagination. Honestly I've never been able to quite grasp the language of the Proxo specifications. I get lost trying to study some of the large Proxo configurations, sometimes I can't even figure out exactly what they were trying to accomplish. Chances are that people may hate our specification formats, but they make sense to me (and maybe only to me). It's likely that Proxo does some things much easier and/or better than ours.
As a general rule, I prefer blocking requests based on domain and URL patterns rather than mucking around in server response content. When I spot something undesirable at some site, I'm much more prone to add that site to a block list rather than neuter its particular HTML / script issues. I don't ever want to scan through every HTML looking for links to doubleclick (whom I dislike for tracking). I think it's easier and faster to leave a site's page content alone as much as possible - and then block crud when the browsers ask to fetch from places I don't want.
Currently the proxy reflects my personal preference for blocking vs. manipulation. It has quite a few more methods to identify requests to prohibit than it has for content management.
I'm not totally against content filtering though. When Google turned on their 'Suggest' keylogger kind of crap, I quickly went for the kill without much contemplation. Google then lost some of their 'Sponsored Links' out of anger. If Google turns OFF their 'Suggest' or restores it to opt-in, then I'll be happy to see all their 'Sponsored Links' again. And "No", I'm not going to permanently keep their cookies nor forge some for the purpose of muting their 'Suggest'.
Here are a few other tidbits and opinions that may be popular for discussion:
In its default configuration, the proxy validates the data content of images for most popular types (GIF, JPG, PNG, etc). We found that it doesn't require much resource to do that, and it adds a beneficial layer to help defend the browsers. It's surprising how often something like a 'gif' file is actually not 'gif' content at all, even when it comes from 'respectable' places. The proxy also does validation of image MIME, occasionally discarding an image based on the Content-Type header presented by the server, or fixing a bogus MIME to reflect the actual image content. I wish browsers would pay more attention to what they should get vs. trying to interpret crap content.
Regular expressions can be used in quite a few places, including URL patterns and data content modification. RegEx knowledge isn't always required because there's also plenty of Non-RegEx methods. Currently it's using a DLL of PCRE ( http://www.pcre.org/ ). I've used other RegEx libraries with the proxy, but PCRE seems efficient enough and is well maintained.
You can configure the proxy to read HOSTS files directly. Too bad the Windows operating systems (and others) don't handle large HOSTS efficiently, but this proxy certainly does. It hashes the entries and builds fast-scan buckets designed just for the purpose. Though compatible with any HOSTS file, the domain blocking goes beyond that. For example, a specification to block 'example.com' could also block 'www.example.com', 'other.stuff.example.com', etc. Some repetitive HOSTS file zone entries can be absorbed into other parent specifications having less zones.
Another optional feature caches IP addresses. I've been surprised how many sites have a low DNS TTL, so the cache feature helps boost the speed of browsing activities. When some goofy network engineer casts a low TTL for purposes of load balancing or fail-over, then we're better off choosing not to participate in the (usually) useless DNS re-queries. But if the proxy can't make a connection using an IP that it cached, then it will re-query DNS and retry the connection if appropriate. An in-process cache also seems to help the Windows threading by avoiding the OS call to (re)get the host server IP. It makes sense for common browsing where there is an HTML fetch followed by a burst of many images, CSS, scripts, etc. competing for resources.
Logging needs to be plentiful. There's the scrolling Log Window showing URL usage and things like blocking status, a flexible in-memory tracking feature that shows usage grouped by host servers, and there's a feature to store hourly or daily activity into files. You really need to know exactly where you've been so that you can decide whether you ever want to go there again. Sorting the in-memory tracking by most-used host servers is helpful to spot usage that may need attention - they're usually the ones Not accessed frequently (like trackers and malware sites) because their domains or IP's are not an integral part of the sites you were viewing.
Once the proxy blocks something, the next step (which is often overlooked) is to determine how to respond to the browser. With ours you can usually specify exactly what you want the browser to get, including things like substituting a file of content or redirecting to another URL. By default the proxy tries to give back content determined by what was asked for: A request for a blocked image returns a valid image, one for a script will return a valid script (consisting of one blank byte), SSL and a few others yield special HTTP response codes, etc. I think a proxy should try to act as a beneficial front-end to assist the client browsers, definitely don't want to throw any garbage back at them.
Pipelining is really tough on a proxy, especially one that wants to interdict and filter on the user's behalf. It would have been SO much easier if the proxy could ask "may I have some more please" vs. having to detect and deal with request floods jamming in. Different browsers seem to have different rules about how they pipeline, but so far we've been able to keep dancing without having to detect who's doing it. In general, Firefox seems to have gotten better at it, and Opera can get downright aggressive at times. We could probably improve the proxy's methods for pipelining support, but in this arena I'm currently more concerned about maintaining future compatibility than I am about being tuned to today's environment.
Thanks for reading. I'm interested in hearing other opinions about proxy functionality in general.

Ask ProxRocks about eliminating images while gathering important financial data from a site that simply can't be s-canned. (Oh wait, be prepared for more than an earful!)


The essence of making a point is to clarify your intentions for the reader. If I have to stop what I'm doing (reading your prose) and go to the bottom of the post, click a link, wait for something to happen, then peruse that while it's "conveniently" covering part or all of my reading material, then futz around arranging the two (or more) windows, then the drive of your original posting is lost on me. Others may have the attention span necessary for this kind of shenanigan, but I, unfortunately, no long have such - it's a common malady amongst us older farts, we seem to want things to be ever-more-easier. 
), which most folks would account as a good thing. 
![[Image: rolley.gif]](http://i173.photobucket.com/albums/w51/fnulnu/nutz/smilies/rolley.gif)