Post Reply 
A simple reminder?
May. 28, 2008, 05:12 PM
Post: #1
A simple reminder?
Gang;

Hopefully, I've only forgotten how to do this, and you can all laugh at my advancing Alzheimers. Wink However, this has been bugging me for several months now, and I can't seem to figure it out. Here goes:

Many sites on the web, if not most of them, are now serving their pages with strange characters, or just as bad, with normal characters where they don't belong. The best example is a question mark being used to replace an apostrophe, or even just stuck wherever the page's author felt like putting one. (But in checking the source code, I see that sometimes a "page generator" is used, so I can't always blame the author. Wink) This site itself is a prime example, as we'll see in a moment.........

I have a filter that more or less converts those questionable question marks into apostrophes, which is what I'd usually expect to find in that position. Sadly, as often as not, the bogus question mark occurs in the middle of a link. D'oh! As you can guess, this screws up the link, bigtime. When I click it, I get what amounts to a 404 error. I check the address bar, and sure enough, bold as brass, there are a bunch of apostrophe's, and no question marks.Mad with Teeth

Here's the promised example, straight from this very page's source code:

<link rel="stylesheet" type="text/css" href="http://prxbx.com/forums/css.php?theme=5" />

What I end up with, after my filter is applied, is:

<link rel="stylesheet" type="text/css" href="http://prxbx.com/forums/css.php'theme=5" />

And if you haven't guessed it, or you don't want to experiment for yourselves, that kills the stylesheet that applies the Forum's theme. What I get is a bastardized version of Wired News, and that's NOT a good thing. Banging Head

So the question I put to you all is, how do I tell Proxo to not look at links of any sort when searching for potential replacements? I'm not posting my current filter, I don't want to subconsciously influence how you might do the job. Whistling

TIA




Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
May. 28, 2008, 11:03 PM
Post: #2
RE: A simple reminder?
Don't worry, I'm pretty rusty as well! I've gotten more into the $TST() and $SET() functions these last few weeks, so I hope the following filter helps you.

Code:
[Patterns]
Name = "Questionable Question Marks Quencher"
Active = TRUE
URL = "$TYPE(htm)"
Limit = 256
Match = "<([A-z])\0$SET(tag=yes)$SET(1=<\0)"
        "|>$SET(tag=)$SET(1=>)"
        "|\?(^$TST(tag=yes))$SET(1=')"
Replace = "\1"

This seems to work well:

Test:

Quote:<link rel="stylesheet" type="text/css" href="http://prxbx.com/forums/css.php?theme=5" />

Hello?

<a href="http://www.google.com/?fakeparam"> <-- This is cool? </a>

Results:

Quote:<link rel="stylesheet" type="text/css" href="http://prxbx.com/forums/css.php?theme=5" />

Hello'

<a href="http://www.google.com/?fakeparam"> <-- This is cool' </a>
Visit this user's website
Add Thank You Quote this message in a reply
May. 30, 2008, 12:54 PM
Post: #3
RE: A simple reminder?
i've noticed that eary question mark killing some Yahoo sites (CSS won't load, site looks like a dung heap)...

please keep us posted with your end-result filter, i wouldn't mind seein' if it helps my Yahoo-snafu... make that, in a howl-like Yahoo intonation, "snnaaaffoooo"...
Add Thank You Quote this message in a reply
Jun. 09, 2008, 01:07 PM
Post: #4
RE: A simple reminder?
Oddysey Wrote:Many sites on the web, if not most of them, are now serving their pages with strange characters, or just as bad, with normal characters where they don't belong.
....snip....
This site itself is a prime example, as we'll see in a moment.........

hmmm...
I haven't really noticed this, every now then, but not to a large degree.
Not enough to bug me anyway. Smile!
Got some links?

Btw, This site looks fine to me.

ProxRocks Wrote:i've noticed that eary question mark killing some Yahoo sites (CSS won't load, site looks like a dung heap)...

An eary question mark in a url killing some Yahoo sites?
I don't yahoo much, but on occasion, I visit their news site.
Haven't noticed any problems there.

Got some links?

As for it looking like a dung heap, it is yahoo after all. Smile!

ProxRocks Wrote:"snnaaaffoooo"...
Nice.

z12
Add Thank You Quote this message in a reply
Jun. 10, 2008, 10:00 PM
Post: #5
RE: A simple reminder?
Mike,

Nice to see your smiling fingers typing here again, after so long! Smile!

For an example of what's going on, this very page you are now reading is prime, in my book. (Granted, not everyone gets to come over to my place and read over my shoulder, but you get the idea.Wink)

As shown in my first post above, what I get, as a result of my crude filter, is that a link which relies on PHP coding (the question mark is an identifier of such) is broken by virtue of the question mark being changed to an apostrophe. BTW, I didn't mention this before, but I get this exact same result on not just one, but both of my machines. One runs W98SE, the other runs XPSP2. Both have the same filter configuration, runnning under Proxo 4.5j.

If I bypass Proxo entirely, I get the exact same results, a question mark where there should be an apostrophe (or some other such character)..... again, on either machine. I interpret this to mean that nothing in my Proxo config is actually forcing the question marks to appear, they're coming down the pike from the innerweb. The dbug.. function proves this, pretty much beyond a doubt.

Now. I got curious one day, and decided to look at the Character Encoding. Lo and Behold, every page where I my filter fires, I see this in the <head> section:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

(On the Proxo Forums pages, it ends with a backslash, due to the XHTML encoding standards.)

What's this - UTF-8? D'oh! WTF?? Banging Head So I write a filter to replace that with ISO-8859, which usually works properly. The operative word being "usually". Sigh. Only about half the time does this work.

So now my original question has mutated. Instead of gettting rid of the question marks, which I now believe are indicative of the browser choosing an incorrect character set, I want to force all pages to be rendered in ISO-8859, or some other set that also displays correctly. (Possibly Windows 1252?) I've proven that using web filters only is only partially effective, so I'm wondering if there's any way to do this with a header filter. Any takers?



Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Jun. 13, 2008, 12:06 PM
Post: #6
RE: A simple reminder?
Oddysey Wrote:...decided to look at the Character Encoding

...UTF-8
...ISO-8859
...Windows 1252

afraid i have to confess ignorance on "encoding" Sad


z12 Wrote:Got some links? ie, where question marks are killing .css loads
i've only ever seen it on sports pages linked to from sports.yahoo...

here's one - http://www.dailynews.com/ci_9571406?source=rss
Add Thank You Quote this message in a reply
Jun. 13, 2008, 10:16 PM
Post: #7
RE: A simple reminder?
ProxRocks;

It's a lost art, here in the US, but character-encoding (and subsequently, character-sets) is the mechanism that allows non-English users to render text in their native language(s). There are only so many spaces alloted for characters, 128 in the original ASCII set (set #437), and 256 spaces in the extended Microsoft set (set #1252). With the advent of Sagans of users worldwide, it became obvious that they wouldn't be paying very much money, if the only computers they could use were locked into the English language. Thus was born the Uni-code Transformation Format, or UTF-8. Here's a WIKI page on the topic:

http://en.wikipedia.org/wiki/UTF-8#UTF-8

and we'll leave it at that. (For now......Pray)

How that translates into problems for web users is that a browser must select a character set in order to properly render and display the textual information it's receiving. If that selection process is somehow bungled, then the user ends up with gibberish, or nearly so..... as in my case, where if it isn't an exact alphanumerical character, then it's rendered as a question mark. (Well, periods and commas come through OK, but most other punctuation marks get hammered.)

~!~!~!~!~!~!~
Mike (and still ProxRocks);

If you are viewing this Forums page with Proxo bypassed, let me suggest a short experiment: Enable filtering, re-load this page, and look at the source code. Note the following line in the header section:

<link rel="stylesheet" type="text/css" href="http://prxbx.com/forums/css.php?theme=5" />

And just for the sake of completeness, note this line too:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Now, one at a time, institute these two filters:

Code:
Name = "rewrite bullshit question marks into apostrophes"
Active = TRUE
URL = "^$LST(QuestionMarkList)"
Limit = 16
Match = "([a-z0])\1\?([rst])\2"
Replace = "\1'\2"

Re-load this page, with the above filter active. (Remember, Proxo is set to not bypass these forums.) When I do this, I get a very ugly screen:

[Image: forums-uglygf.gif]

This was due to the fact that the stylesheet wasn't loaded. The line that calls this now looks like so:

<link rel="stylesheet" type="text/css" href="http://prxbx.com/forums/css.php'theme=5" />

It was that capital T following the question mark that triggered the filter. Sadly, this happens a lot more than I'd like, these forums aren't the only place that suffers like this. Sad


So all of that was what led me to look at the encoding. Since UTF-8 was never meant to be backwards compatible, the trend towards dragging us Luddites kicking and screaming into the alleged "future" is disheartening, to say the least. Hopefully I/we can get Proxo to force the incoming datastream to stop boogering up my surfing experience!

Here's what I've got so far, in the Web Filters section:

Code:
Name = "force all pages to display in standard character set"
Active = TRUE
Multi = TRUE
Limit = 16
Match = "UTF-8"
Replace = "ISO-8859-1"

In the context of where this would be found on a webpage, it should be found only in the header, appearing somewhat like this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

(Again, the ending slash is from this page's XML coding, most pages don't have it.)

When the filter fires, the line is changed as you'd expect, but by then it's too late, the damage is done - question marks appear where they shouldn't. But this Forums page isn't the best example for showing that, the PHP templates are still simple enough to let text render properly. Instead, let's look at this eBay page:

[Image: ebay-ugly.gif]

Of course, Proxo must be allowed to filter eBay. Whistling

I don't know what you're getting, but I see so many question marks that my eyes get bleary, mosh-kosh! Banging Head This is yet another page where the UTF-8 charset is invoked, but even after Matching and Replacing that statement, the question marks still show up.

Yes, I'm willing to believe that it's something I've done on/to my machines and their respective software setups. But that would mean that nearly at the same time, I had to change something in W98SE and/or IE6 on my desktop, and to WXPSP and/or IE7 on my laptop. That doesn't seem too likely, does it? Not talking

Can anyone help a brother out here?



Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Jun. 13, 2008, 11:07 PM
Post: #8
Toungue RE: A simple reminder?
Hi Oddysey.
Thanks for the "Welcome Back". Smile!

I was thinking this is a character set issue also.

I wonder what would happen if you just delete the meta character set tag?
After all, the character set is (and should be) specified in the header.
I would think that the browser should default to your character set if none were specified.

If that doesn't help, maybe just fixing the character set via a header filter would work.

I know IE can do strange things to (Injected) scripts if a meta character set occurs after the script is injected.
I've had to take some precautions with the meta character set tag set to avoid issues.

I'm contemplating just deleting all meta tags as they seem useless anyway.
Except some meta refresh tags (redirects) that I convert to a link anyway.

I'm also seeing ad servers and trackers using meta tags now also.
Probably for use by scripts.

At any rate, I see no issues with Firefox on this site.
The link ProxRocks provided looks ok too.
(Although it is UTF-8 according to the header.)

I wonder if this is a IE (version) issue?

z12
Add Thank You Quote this message in a reply
Jun. 14, 2008, 01:24 AM
Post: #9
RE: A simple reminder?
Mike;

The reason I don't think it's IE related at all, let alone which version, is because my W98SE has been cooking along for about 9 years now, and only in the last 8 or 9 months has the problem risen from the "why'd they do that" stage, up to "WTF is wrong with these people, anyway" stage. Sad

The laptop was also good for the wife, before she handed it off to me, and that was probably 2 years of good service, then I had it for about a year before it started showing the same symptoms...... [headscratch]

Deleting meta tags shouldn't do it, because I'm already doing effectively the same thing in terms of Matching and Replacing the charset string. I'm now wondering if there's a Header variable that can affect the charset selection. I'm pretty certain, although I don't know it for a fact, that IE has a built in default that is, more or less, Windows 1252 (or 1252-1). It stands to reason that if no other changes have been made, then if the default stops working, there must be some kind of instruction coming in that's over-riding that default.

Do you, or does anyone, have a list of Header variables that might/do affect the charset selection? Any links to where I might find such a list? My google-fu seems to be weak on this one. Banging Head


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Jun. 14, 2008, 03:08 AM
Post: #10
RE: A simple reminder?
Oddysey Wrote:only in the last 8 or 9 months has the problem risen from the "why'd they do that" stage, up to "WTF is wrong with these people, anyway" stage.
LOL

Oddysey Wrote:Do you, or does anyone, have a list of Header variables that might/do affect the charset selection? Any links to where I might find such a list?

Off the top of my head, here's a few that might be relevant:

http://www.w3.org/Protocols/rfc2616/rfc2...ml#sec14.2

http://www.w3.org/Protocols/rfc2616/rfc2...ml#sec14.4

http://www.w3.org/Protocols/rfc2616/rfc2...l#sec14.12

http://www.w3.org/Protocols/rfc2616/rfc2...l#sec14.17

There might be more there.

After seeing this page above, I do seem to recall seeing some unusual "?" placement in some pages.

I unblocked these headers:
Code:
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

Since doing so, I've had few issues.
Add Thank You Quote this message in a reply
Jun. 14, 2008, 06:36 PM
Post: #11
RE: A simple reminder?
Mike,

Well, looky here! I seem to have stumbled onto something.......

I'd like to say that you led me down the correct path, but I think I can safely say that I found this quite by accident - I'm pretty sure you wouldn't have told me to do this, yet again, but for some reason, I did it. It was a D'oh! moment, believe me!

I turned on the Log window, to see what encoding I was accepting. Turns out that it was always us-en, no matter which filters were enabled or not, even with Proxo entirely bypassed, this remained the same. But......

I found a particular filter was firing like about a thousand times for the particular eBay Search Results page. A closer look told me it was one I had written a long time ago, had tested and found it working to my satisfaction, and I filed it away as "done". Seeing it so often triggered, I decided, on a mere whim, mind you, to disable it, and see what happens. Wha'd'ya think, eh? Tha's right, the question marks disappeared!!!!!

Now I don't know about you, but when I tell Proxo to Replace something with a null value, I expect, and usually get, a blank spot on the webpage - no harm, no foul, that sort of thing. Not here. So, without further adieu, let's look at the filter in question:

Code:
Name = "kill funny A character"
Active = FALSE
Multi = TRUE
Limit = 4
Match = "[Â]"

That code is inserting a question mark into my pages whenever IE's default character encoding set is over-ridden. If there's no <meta.... charset=.... instruction to the contrary, I see no problems.

It's obvious that the better solution here is to simply leave the filter disabled until I come across those furshlugginer [Â] characters again, in sufficient quantity to irritate me into writing yet another filter. And I can go back to not messing with character encoding sets, accept-language filters, etc. I do believe that simpler is better, at least most of the time. Whistling

Thanks again all, and Mike, don't make us come hunt you down again, OK? Big Teeth


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Jun. 14, 2008, 06:48 PM
Post: #12
RE: A simple reminder?
Further to the above......

No, additional testing shows that I was premature in thinking I was done. Sigh.

I do need to force the pages to display in ISO-8859-1. By doing so, the  disappears. (BTW, it's in brackets in my filter, or else Proxo can't "see" it, and never fires.) If I use the dbug.. feature, I see those characters in the HTML coding, in spades. If I let the UTF-8 charset meta command go through unmolested, then I see the funny Â's all over the bleepin' place. If I replace them with a null string (per my previously shown filter), then I get question marks. But if I instead just force the page to display in ISO-8859-1, I get neither bogus characters, nor question marks.

At least one thing worked, that of modifying the charset value in the <meta.... tag. I was a bit bothered by that not working as I thought it should. Turned out to be working too well. D'oh!

I do believe that it's Miller Time!Cheers


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Jun. 15, 2008, 02:32 PM
Post: #13
RE: A simple reminder?
Good to hear your problem is (kinda, sort of) solved.

Just curious, in IE under View-->Encoding do you see Unicode(UTF-8)?

z12
Add Thank You Quote this message in a reply
Jun. 16, 2008, 02:31 AM
Post: #14
RE: A simple reminder?
Odd, how are you "forcing" ISO-8859-1?
Add Thank You Quote this message in a reply
Jun. 17, 2008, 02:37 AM
Post: #15
RE: A simple reminder?
ProxRocks Wrote:Odd, how are you "forcing" ISO-8859-1?

Simple, like so:

Code:
Name = "force all pages to display in standard character set"
Active = TRUE
Multi = TRUE
Limit = 16
Match = "UTF-8"
Replace = "ISO-8859-1"

The only problem comes when I view discussions on this topic, then everything gets converted, and it's real difficult to follow the train of thought....... Whistling I could, and did at first, use Match="charset=UTF-8", and build my Replace string the same way, but I thought that was overkill. If I was really into it, I'd use Bounds and the whole <meta*> string, and then I'd be sure that text on the page would be unmolested. But truly I don't really see too much need, in my neck of the interwoods anyway. Drool

However, upon thinking about it a second time, as I wrote this, I now realize that my name for this filter is wrong - it doesn't force every page, only those that specifically call for UTF-8. If the web gets much more pedantic with all it's "but we gotta reach out to....." preachiness, then I will go back and inject this into every page, regardless of whichever charset was selected by the page's author.

HTH



Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: