Friday, April 19, 2013

Character Encoding

My study object is the Chinese character "一".

I am using Notepad in Window 7 to save in different formats, namely ANSI, Unicode and UTF-8. My system locale is Chinese (Traditional, Taiwan). I am using Traditional Chinese Google IME as input method.

In simple words, encoding is to represent "something" in some "notation", decoding is to return the "something" from some "notation". Notation that I am refering here, is the binary representation.

I downloaded Hex Edit to observe the differences of the "encoding" or format (used in Notepad).

ANSI

ANSI, which always refer to ASCII character set plus an extended character set. See http://ascii-table.com/ansi-table.php. However, this character set contains only 256 characters, it does not and unable to cover Chinese character. Sometimes, when saving Chinese text in notepad, it will prompt to save in other format, but sometimes, however, I am able to save it in ANSI format. As I search for the mistery encoding for this, I found out it is actually saving in CNS 11643-1986. "一" in this case is represented by A4 40, which is first character of the start of hanzi level 1. See http://www.cns11643.gov.tw/ and http://libai.math.ncu.edu.tw/~shann/Chinese/big5cns-eng.html. (Too complicated to use these web sites)

Unicode (Big Endian)

A quick reference : http://www.unicode.org/charts/. For my test case, notepad is saving as CJK unified ideographs 4E 00. However, as I observed in Hex Editor, it is actually saved as FE FF 4E 00. FE FF is actually represents "zero width no-break space". It occurs on every file that saved as Unicode.

UTF-8

Well, this seems like the "most popular" one, at least in one of our research project on Japanese related encoding, it is prefered to have the data save as UTF-8, for easy translate to other encodings. It is a variabled-width encoding which can go up to 6 bytes to represent a single character. Same as Unicode, the binary representation starts with EF BB BF, which is the "zero width no-break space", followed by E4 B8 80, which is "一". See http://www.utf8-chartable.de/unicode-utf8-table.pl.

I studied somewhere online that Unicode is character set, while UTF-8 is one of the character encodings. So, Unicode encodes the character, and UTF-8 encodes the Unicode? Guess that is the case.

No comments:

Post a Comment