The Un-Official Proxomitron Forum

Full Version: Converting non latin characters to UTF-8 as Proxomitron can't read Unicode UTF-16
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
There's something really weird about the specific folder (link removed as it's proper UTF-8 now - does someone have another example?) - all of the non latin characters (in this case Hebrew, Arabic and Russian) show up as Gibberish in Proxomitron. But there's no Gibberish when bypassing Proxomitron.

The SSH command "file" reports this about the HTML file:
Quote:Little-endian UTF-16 Unicode character data, with very long
lines, with CRLF, CR line terminators
While it reports this about normal Unicode files:
Quote:UTF-8 Unicode HTML document text, with CR, LF line terminators

Why does Proxomitron break this folder? I can't even convince them it's a real problem because they don't have Proxomitron (or know anyone with it except me)...can you at least tell me what do you think the admins did to create such weird files?

Anyway, I've tried running these filters but they don't match anything:
  1. Top Remove: Unicode BOM: HTML 6.12.09 (multi) [sj sd] (d.1)
  2. Top Remove: Unicode BOM: Other 7.10.28 (multi) [sj sd] (d.1)
  3. UTF-16 to UTF-8 Page Converter 7.01.06 (multi) [sj sd mona] (d.1)

Neither did (plus ASCII-Table.ptxt):
  1. <iframe>: Unicode to ASCII 7.01.06 (multi) [sd] (d.1 l.2)
  2. <iframe>: BASE16 to ASCII 8.11.21 (multi) [gz sd] (d.2 l.2)
  3. <a>: Unicode to ASCII 7.11.14 (multi) [sd] (d.1 l.3)
I think i see it well. Bypassing or parsing with proxomitron i see the same.
So maybe the problem is one of your filters: http://prxbx.com/forums/showthread.php?tid=1173
You're a genious. It is caused by
  1. Kill pop-up windows v2
  2. Stop status bar scrollers
because they put their lines at the top of the page. Why is this folder so weird? How can I fix on my side and how can it be fixed it on theirs?

Thanks!
I'm only curious, but thanks Smile!

Who wrote these filters?
They're from the official config of the program. But the real point is to find what's wrong with the encoding of this folder and know how to fix it from both sides.
Be warned: I don't view international sites, and I'm not even a Proxo user. But I am pretty good at HTTP and guessing about things Smile!

See the first attached TXT file named '_Raw_Bytes' which shows the raw byte dump of the page's headers and data. Ignore the "===" lines, they were put there by my own proxy. You were right earlier about it being UTF encoded using Little Endian. It even starts with the BOM (Byte-Order-Mark) for that. The 1st 2 bytes being FF FE are the 16-Bit Little Endian BOM.

I'm guessing that if you did a View-Source of the Proxo result of your page, then your HTML portions would appear correct but the text shown to the user would be corrupted.

As I understand it, Proxo has some rudimentary filter capability for UTF detection and converstion. However I don't think the UTF capability is "real", it may be just a filter mechanism. I think once it detects a UTF then it may assume Latin or ASCII in that it' may look at every other byte for a binary zero and wipe it out.

That semi-UTF conversion (if in fact that's what it's doing) works fine on many content that are UTF but probably didn't need to be. When you get into 'real' UTF things may break down quickly. In the attached illustration that starts to happen at offset hex 30 when the paired bytes no longer have 00. That begins a region of data (not HTML tags) that seems to be a form of Unicode within the transport's UTF encoding.

Perhaps more robust UTF decoding would look like the 2nd attached TXT file. That one represents the decoding of the raw transport's encoding, and within that you can see the Unicode data portions. I'm not sure how browsers are able to differentiate the character sets of the HTML vs. the Unicode data without a DOCTYPE or other indicator.

If my assumptions are true, then hopefully someone with more Proxo knowledge will know of a resolution for you. My limited-knowledge suggestion would be to bypass that page from Proxo filtering or other transport modification, try to leave the stream as-is if possible and let the browsers sort it out per each viewer's international settings.
Ok, i took a look, the page is broken because three scripts are inserted at the beginning instead of being inserted in the head. The culprits are:

1. Kill pop-up windows
2. Suppress all JavaScript errors
3. Stop browser window resizing

That's an old config, i suggest you move to the Sidki config and start it with a low level mode.

Anyway, if you want to play with it, this is a possible solution. Replace their match code
Code:
(<!DOCTYPE*> |)\1
for
Code:
(<head*> )\1

- maybe you should enable multimatches for them
We've already established it happens with any header filters.

But I can't match it with <head because Proxomitron can't read the source code there properly. That's the bottom line and the subject of this topic. The first question is why. The second is how can it be fixed on both sides.

Just to stress the point, while header filters crash such pages, non header filters just don't work because they don't match anything. It means (assuming header filters are bypassed so as to not crash such pages) Proxomitron does not apply to such pages. It's as if you click bypass for such pages. If such pages have ads, Proxomitron would display the ads. If such pages script a virus, you'd need to format your drive.
Yes, i did know the changes proposed would not match for that page, but the best is it would not break any page. So problem of breaking pages solved.

Now the other theme is how to filter pages in UTF-16, and i think simply we cant, maybe with an future update of proximodo if we find a programmer...
Actually, you can filter UTF-16 with Proxomitron. I'm doing so.

However, you need a webfilter to convert the page (i don't have a config independent version to post), *and* you're losing any double-byte information that goes beyond UTF-8. For the little-endian case that means the second byte is supposed to be x00. If not, the double byte will be replaced by a dummy char.

Luckily, most little-endian and all big-endian pages i've seen are indeed using just one byte for char information. But not in your example. I've once written a UTF-16 example to test with Proxomitron (little-endian).
Wow, hard filters Sidki, interesting work [Image: happy0034.gif]
Guess what? It seems Google has the same limitation as Proxomitron. The only difference is Google did find a way to convert the non latin characters to UTF-8. So if you enter it through Google's cache, you'll see it converted to UTF-8, but there's Gibberish in the header. The Gibberish happens even without Proxomitron, so it's not related to Proxomitron this time...
For the records, especially after the thread originator has renamed the topic:

(Jun. 14, 2009 12:39 PM)bugmenot Wrote: [ -> ]Just to stress the point, while header filters crash such pages, non header filters just don't work because they don't match anything. It means (assuming header filters are bypassed so as to not crash such pages) Proxomitron does not apply to such pages. It's as if you click bypass for such pages. If such pages have ads, Proxomitron would display the ads. If such pages script a virus, you'd need to format your drive.

This isn't only damaging Proxomitron's reputation but isn't correct either.
Without header filters, UTF-16 pages with non latin characters bypass Proxomitron. How can you deny that?
Without filters, Proxomitron is nothing but an empty sheet of paper.
With very old filters, you're risk isn't dependent on the actual character encoding anyway.

With recent filters (and there are more good configs around than those discussed here), UTF-16 *will* be filtered as UTF-8, and you'll get a warning if something isn't kosher on displayed page, so you can chose to better not bypass Proxomitron, if double-byte chars are missing.

If i got that correctly, using the second (or first if big-endian) byte isn't required for any language, so that isn't coupled to Latin chars:
Hebrew, Hindi, Bangla, Khmer
Pages: 1 2
Reference URL's