Character set of a html document

To continue with the previous post on character encoding, my actual topic of interest is how the browser detect what is the character set is used before rendering the page when it is not specified in html header.

Below is my "test code", and I saved them into 4 different file formats (the additional one is Unicode).

<html>
<head>
</head>
<body>
一
</body>
</html>

Note, there's no character set or doctype being specified in the html header. It is rendered in quirks mode.

I tested in 2 different browsers, IE 9 and FireFox 20, surprisingly, I got 2 different results. I am using document.charset and document.characterSet for IE and FF respectively to check for the character encoding of the document.

File format	IE9	FF20
ANSI	big5	windows-1252
Unicode	unicode	UTF-16
Unicode (big endian)	unicodeFEFF	UTF-16
UTF-8	utf-8	UTF-8

FF20 is giving Mojibake characters, it only show correctly when the encoding of the browser is changed to Chinese Traditional (Big5). The rest are rendered fine (readable) by default.

So, I added the charset attribute in meta tag section. I also added the 4.01 strict doctype.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=<charset>"></head>
<body>
一
</body>
</html>

File format	charset	IE9	FF20
ANSI	big5	big5	big5
Unicode	UTF-8	unicode	UTF-16
Unicode (big endian)	UTF-8	unicodeFEFF	UTF-16
UTF-8	UTF-8	utf-8	UTF-8

Except ANSI file in FF20 is changed to big5, the rest remain same encoding. However, for Unicode (big endian) file in IE9, the following warning message is observed :

HTML1114: Codepage unicodeFEFF from (UNICODE byte order mark) overrides conflicting codepage utf-8 from (META tag)

There is no selection to change the page encoding to UTF-16 in FF, and there is no selection of unicodeFEFF in IE. I have no idea (yet?) why the document character set is returning those results.

From the above result, the recommended file format to have the html document to be saved is in UTF-8 format, that if we are using characters which is out of US-ASCII character set.

3-minute fever 三分鐘熱度

Search This Blog

Character set of a html document

Labels

Comments

Post a Comment