Saturday, April 20, 2013

Character set of a html document

To continue with the previous post on character encoding, my actual topic of interest is how the browser detect what is the character set is used before rendering the page when it is not specified in html header.

Below is my "test code", and I saved them into 4 different file formats (the additional one is Unicode).


<html>
<head>
</head>
<body>
</body>
</html>


Note, there's no character set or doctype being specified in the html header. It is rendered in quirks mode.

I tested in 2 different browsers, IE 9 and FireFox 20, surprisingly, I got 2 different results. I am using document.charset and document.characterSet for IE and FF respectively to check for the character encoding of the document.

File format IE9 FF20
ANSI big5 windows-1252
Unicode unicode UTF-16
Unicode (big endian) unicodeFEFF UTF-16
UTF-8 utf-8 UTF-8

FF20 is giving Mojibake characters, it only show correctly when the encoding of the browser is changed to Chinese Traditional (Big5). The rest are rendered fine (readable) by default.

So, I added the charset attribute in meta tag section. I also added the 4.01 strict doctype.


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=<charset>"></head>
<body>

</body>
</html>

File format charset IE9 FF20
ANSI big5 big5 big5
Unicode UTF-8 unicode UTF-16
Unicode (big endian) UTF-8 unicodeFEFF UTF-16
UTF-8 UTF-8 utf-8 UTF-8

Except ANSI file in FF20 is changed to big5, the rest remain same encoding. However, for Unicode (big endian) file in IE9, the following warning message is observed :


HTML1114: Codepage unicodeFEFF from (UNICODE byte order mark) overrides conflicting codepage utf-8 from (META tag)

There is no selection to change the page encoding to UTF-16 in FF, and there is no selection of unicodeFEFF in IE. I have no idea (yet?) why the document character set is returning those results.

From the above result, the recommended file format to have the html document to be saved is in UTF-8 format, that if we are using characters which is out of US-ASCII character set.


No comments:

Post a Comment