Sunday, April 21, 2013

Sending picture file in an email

It is easy to attach a picture in an email, from user's perspective, of course. But, what is actually sent over the Internet to the recipient server?

There are 2 types how a picture can be transmitted over an email, as inline picture, or as an attachement. Before they are transfer over, they need to go through some conversion, that is binary-to-text encoding called Base64.

Say, the binaries of the image starts with FF D8 FF, and the translation process can be as below :

Original binaries FF   D8   FF
Regroup the binaries111111111101100011111111
Associate decimal 62613563
From the Base64 index table /9j/


Inline image

To send as an inline, the email message shall have the following attributes.


Content-Type: image/gif; name="file name"
Content-Transfer-Encoding: base64
X-Attachment-Id: image-id
Content-ID: <image-id>

Followed by the image binaries in Base64. The attributes and the image binaries is placed within the content type boundary.

In the email content, to refer to the image, the source of image will be refered as


<img src="cid:image-id">

Attachment image

For image as an attachment (for download), the email message attributes will be different from the inline image.


Content-Type: image/gif; name="file name"
Content-Disposition: attachment; filename="file name"
Content-Transfer-Encoding: base64
X-Attachment-Id: image-id

Similar to inline image, the message is followed by the image binaries in Base64 format, and is enclosed within the content type boundary.


The highlighted are basically the attributes that differentiate the inline image and attachment image.


Saturday, April 20, 2013

Character set of a html document

To continue with the previous post on character encoding, my actual topic of interest is how the browser detect what is the character set is used before rendering the page when it is not specified in html header.

Below is my "test code", and I saved them into 4 different file formats (the additional one is Unicode).


<html>
<head>
</head>
<body>
</body>
</html>


Note, there's no character set or doctype being specified in the html header. It is rendered in quirks mode.

I tested in 2 different browsers, IE 9 and FireFox 20, surprisingly, I got 2 different results. I am using document.charset and document.characterSet for IE and FF respectively to check for the character encoding of the document.

File format IE9 FF20
ANSI big5 windows-1252
Unicode unicode UTF-16
Unicode (big endian) unicodeFEFF UTF-16
UTF-8 utf-8 UTF-8

FF20 is giving Mojibake characters, it only show correctly when the encoding of the browser is changed to Chinese Traditional (Big5). The rest are rendered fine (readable) by default.

So, I added the charset attribute in meta tag section. I also added the 4.01 strict doctype.


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=<charset>"></head>
<body>

</body>
</html>

File format charset IE9 FF20
ANSI big5 big5 big5
Unicode UTF-8 unicode UTF-16
Unicode (big endian) UTF-8 unicodeFEFF UTF-16
UTF-8 UTF-8 utf-8 UTF-8

Except ANSI file in FF20 is changed to big5, the rest remain same encoding. However, for Unicode (big endian) file in IE9, the following warning message is observed :


HTML1114: Codepage unicodeFEFF from (UNICODE byte order mark) overrides conflicting codepage utf-8 from (META tag)

There is no selection to change the page encoding to UTF-16 in FF, and there is no selection of unicodeFEFF in IE. I have no idea (yet?) why the document character set is returning those results.

From the above result, the recommended file format to have the html document to be saved is in UTF-8 format, that if we are using characters which is out of US-ASCII character set.


Friday, April 19, 2013

Character Encoding

My study object is the Chinese character "一".

I am using Notepad in Window 7 to save in different formats, namely ANSI, Unicode and UTF-8. My system locale is Chinese (Traditional, Taiwan). I am using Traditional Chinese Google IME as input method.

In simple words, encoding is to represent "something" in some "notation", decoding is to return the "something" from some "notation". Notation that I am refering here, is the binary representation.

I downloaded Hex Edit to observe the differences of the "encoding" or format (used in Notepad).

ANSI

ANSI, which always refer to ASCII character set plus an extended character set. See http://ascii-table.com/ansi-table.php. However, this character set contains only 256 characters, it does not and unable to cover Chinese character. Sometimes, when saving Chinese text in notepad, it will prompt to save in other format, but sometimes, however, I am able to save it in ANSI format. As I search for the mistery encoding for this, I found out it is actually saving in CNS 11643-1986. "一" in this case is represented by A4 40, which is first character of the start of hanzi level 1. See http://www.cns11643.gov.tw/ and http://libai.math.ncu.edu.tw/~shann/Chinese/big5cns-eng.html. (Too complicated to use these web sites)

Unicode (Big Endian)

A quick reference : http://www.unicode.org/charts/. For my test case, notepad is saving as CJK unified ideographs 4E 00. However, as I observed in Hex Editor, it is actually saved as FE FF 4E 00. FE FF is actually represents "zero width no-break space". It occurs on every file that saved as Unicode.

UTF-8

Well, this seems like the "most popular" one, at least in one of our research project on Japanese related encoding, it is prefered to have the data save as UTF-8, for easy translate to other encodings. It is a variabled-width encoding which can go up to 6 bytes to represent a single character. Same as Unicode, the binary representation starts with EF BB BF, which is the "zero width no-break space", followed by E4 B8 80, which is "一". See http://www.utf8-chartable.de/unicode-utf8-table.pl.

I studied somewhere online that Unicode is character set, while UTF-8 is one of the character encodings. So, Unicode encodes the character, and UTF-8 encodes the Unicode? Guess that is the case.