HTTRequest Unicode Bug

Technical support and scripting issues

Moderators: JRL, Dorian (MJT support)

Post Reply
armsys
Automation Wizard
Posts: 1108
Joined: Wed Dec 04, 2002 10:28 am
Location: Hong Kong

HTTRequest Unicode Bug

Post by armsys » Fri May 24, 2013 1:52 pm

When reading a Chinese webpage into a variable, the content is corrupted or misinterpreted.
HTTPRequest>http://www.mingpao.com/,,GET,,strHTML,,,,
Please help. Thanks.

armsys
Automation Wizard
Posts: 1108
Joined: Wed Dec 04, 2002 10:28 am
Location: Hong Kong

Post by armsys » Fri May 24, 2013 2:21 pm

Marcus,
Now I discover and confirm another bug: HTTPRequest doesn't automatically insert or write a byte order marker (BOM) to the text file.
As a result, other MS command cannot detect UTF-8 coding.
Please help. Thanks.

armsys
Automation Wizard
Posts: 1108
Joined: Wed Dec 04, 2002 10:28 am
Location: Hong Kong

Post by armsys » Fri May 24, 2013 2:24 pm

Marcus,
With respect HTTPRequest>, would you please allow user to enable or disable BOM insertion? It would be nice if it could be automatically set to UTF-8 by default.
Thanks.

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Fri May 24, 2013 4:09 pm

I will look into HTTPRequest and check Unicode compatibility.

If you want to write to a Unicode text file you can do so with:

Let>WLN_ENCODING=UNICODE
WriteLn>%SCRIPT_DIR%\temp.txt,wr,DATA

But the issue may be with HTTPRequest when it receives the page, not necessarily when writing to text file. If it's not already Unicode then the above won't help.

Don't worry, it's in our tracker, we'll look at it.
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

armsys
Automation Wizard
Posts: 1108
Joined: Wed Dec 04, 2002 10:28 am
Location: Hong Kong

Post by armsys » Fri May 24, 2013 9:44 pm

Marcus,
Thanks for your fast reply.
Thanks for your assistnace in earnest.
Marcus Tettmar wrote:Let>WLN_ENCODING=UNICODE
WriteLn>%SCRIPT_DIR%\temp.txt,wr,DATA
My issue, supposed to be common routine, has nothing to do with WriteLn>.

Code Objective:
1. Grab a web page;
2. Remove all HTML tags with regex;
3. Export data of interest (DOI) to Excel for further numerical analysis.

Now I'm stuck at Stage One.
Neither HTTPRequest nor ReadFile honors WLN_ENCODING settings.
The following 2-line code can help you understand and test the bug:

Code: Select all

HTTPRequest>http://www.mingpao.com/,,GET,,strHTML,,,,
MDL>strHTML
In accordance with your Mon May 17, 2010 4:52 pm post:
ReadFle/ReadLn looks at the header to see if the file is unicode or not and adjusts how it reads accordingly.

Obviously, HTTPRequest didn't insert the BOM automatically.
The text/content extracted HTTPRequest is correct and readable if I insert BOM with EditPad Pro manually.

Marcus, my report/request is simple:
Insert BOM by default automatically to HTTPRequest.

armsys
Automation Wizard
Posts: 1108
Joined: Wed Dec 04, 2002 10:28 am
Location: Hong Kong

Post by armsys » Fri May 24, 2013 9:58 pm

Marcus,
Alternatively, would you be very kind enough to show us how to insert BOM into a text string with Macro Scheduler code?
Thanks.

armsys
Automation Wizard
Posts: 1108
Joined: Wed Dec 04, 2002 10:28 am
Location: Hong Kong

Post by armsys » Fri May 24, 2013 11:17 pm

The situation becomes more complicated.
For the HKEJ.com webpage, the WLN_ENCODING=UNICODE actually corrupt the original text.
What's interesting is the non-BOM text file created by the HTTPRequest is readable by Notepad even though it's non-BOM.
But ReadFile will corrupt the content.
Even worse, the WriteLn under WLN_ENCODING=UNICODE mode will corrupt the original text, namely, Scn1.txt.

Code: Select all

Let>HTTP_CHARSET=UNICODE
Let>WLN_ENCODING=UNICODE
DeleteFile>c:\temp\scn?.txt
HTTPRequest>[color=red]http://www.hkej.com/template/landing11/jsp/main.jsp[/color],C:\Temp\Scn1.txt,GET,,strHTML,,,,
ReadFile>C:\Temp\Scn1.txt,ClpText
MDL>ClpText
WriteLn>C:\Temp\Scn2.txt,result,ClpText
ReadFile>C:\Temp\Scn2.txt,ClpText
MDL>ClpText
Marcus,
I also discover:
1. not ALL Chinese web pages have the same problem.
2. not ALL Chinese web pages have the similar headers.

For example , for the particular webpage:
HTTPRequest>http://www.mingpao.com,C:\Temp\Scn1.txt,GET,,strHTML,,,,
my above script work perfectly. That's, WriteLn does perfect Unicode transformation.
Does the header play a role in corrupting the Unicode text?

The headers for http://www.mingpao.com and http://www.hkej.com/template/landing11/jsp/main.jsp are very different.

armsys
Automation Wizard
Posts: 1108
Joined: Wed Dec 04, 2002 10:28 am
Location: Hong Kong

Post by armsys » Fri May 24, 2013 11:35 pm

In effec, the WriteLn under WLN_ENCODING=UNICODE mode will double the file size. It make no sense. It merely insert a space
between each character. That's how it corrupt the whole text file.

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Sat May 25, 2013 6:24 am

I've already said we are investigating HTTPRequest. It is in bug tracker. No need to repeat yourself.

I can't answer your other questions yet, but there are different encoding systems, so this could explain why some pages retrieve ok and others don't. We will check Unicode support in http request, and make any changes necessary and within our power to ensure Unicode pages are saved with BOM.

As for file size. Of course a Unicode file will be twice the size of an ANSI file. Unicode takes twice the bytes.
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

armsys
Automation Wizard
Posts: 1108
Joined: Wed Dec 04, 2002 10:28 am
Location: Hong Kong

Post by armsys » Sat May 25, 2013 6:52 am

Marcus,
Thanks.

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Thu Jun 06, 2013 4:44 pm

Update. I have been unable to determine any problem with HTTPRequest.

HTTPRequest correctly returns the raw data as provided to it.

As an example here is a Unicode test page I have created which has a BOM:

http://www.mjtnet.com/demos/unicode1.html

This document has a BOM (is saved as UTF-8) and the HTML sets the charset to UTF-8 and it contains some Chinese characters (from the CJK Compatibility block).

This test script requests the document to file, then to a string without setting the charset to the header's charset, and then again by setting the charset to match to get the data in a representable format.

Code: Select all

Let>URL=http://www.mjtnet.com/demos/unicode1.html

//get the raw file, and display it in IE
HTTPRequest>URL,%SCRIPT_DIR%\unicode.html,GET,,data

IECreate>IE0
IENavigate>IE0,%SCRIPT_DIR%\unicode.html,res

//raw data to string
HTTPRequest>URL,,GET,,data
MessageModal>data

//this time interpret based on charset matching header charset
Let>HTTP_CHARSET=utf-8
HTTPRequest>URL,,GET,,data
MessageModal>data
Assuming you have a modern version of IE you see the page appear in IE and the unicode characters are intact. And if you look at the file you will see the BOM. Notepad correctly identifies it as UTF-8

You asked if files saved by HTTPRequest include a BOM. The answer is: it depends whether one was there in the first place. If the HTML document has a BOM then the file saved by HTTPRequest will have a BOM (we simply download and store the raw data as any web browser does). If the document does not already have a BOM then WE DO NOT ADD ONE.

If you retrieve a page which does not have a BOM then the resultant file will not be Unicode. It may however have a charset, which a web browser uses to present the data correctly. Modern web browsers, as I understand it, will either use the BOM or the header charset or both.

Important point: Macro Scheduler is not a web browser. It therefore simply retrieves the raw data. Where a web browser reads and parses the html tags to present it (the job of a web browser is to PRESENT the information) Macro Scheduler does not and you would need to set the charset to match the header charset [note: I cannot vouch that we support ALL possible charsets but certainly should manage the main ones like UTF-8]

[Apologies: I think I misinformed in a recent post where I said that setting the charset variable won't help you - actually it will because as you can see in the above script Macro Scheduler can use it to interpret the data. It doesn't parse the meta tags automatically and find it, but you can set it and if you set it to match the meta header tag it should convert the data correctly assuming the charset is supported]
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

armsys
Automation Wizard
Posts: 1108
Joined: Wed Dec 04, 2002 10:28 am
Location: Hong Kong

Post by armsys » Thu Jun 06, 2013 10:34 pm

Marcus,
Thanks for taking enormous time to troubleshoot my reported HTTPRequest issue.
After receiving your post above, I experimented with exhaustive permutations of scripts to deal with the webpages I have trouble with.
My conclusion can be summed up: It's a mess. It's complicated. It's not worth of our efforts.
My observations:
[1] The webpages in questions are using nonstandard localized chartsets.
[2] The actual solution, if viable, requires a correct setting of both charset and font.
[3] For example, your example is actually in Japense, not Chinese (that isn't an issue here). In order to display the captured text correct, it requires Kozuka Gothic Pr6N-Japanese, for example, as well.

My solutions:
[1] HTTPRequest is perfectly faultless for standard UTF-8 or English webpages.
[2] For those nonstandard webpages, it saves our tremendous time and pain by capturing with mouse-selection and Ctrl+C by a Macro Scheduler script. The solution isn't eloquent but doable. Life isn't perfect. A solution is still a solution.
Marcus, thanks again.

Post Reply
Sign up to our newsletter for free automation tips, tricks & discounts