C# Text Encoding and Transcoding in few steps

While developing on a tool that to allow user to enter a message in whatever language and to print it I discovered that I didn’t know enough about character encoding.

Character encoding is is a set of unique rapresentations called character: they can be the 26 letters of the English alphabet or even the set of signals in the Morse code. As a byte can store up to 256 characters, originally, computers were using ASCII to map the first 128 elements. After a while the limit of 128 became to be too restrictive, computers were spread all over the world and the need of supporting different languages implied the mapping of the remaining free space (128 to 255) with the other languages. Loads of encoding schemes(named also character maps or code pages) have been released over the years. By the way, one byte is not sufficient to include all the characters (Chinese, Russian and so on..) so the Unicode encoding model have been created. Unicode uses 2 bytes and the 65536 combinations it allows covers all the characters actually possible. The .Net framework uses Unicode for string encoding. Here’s an encoding/decoding example using cyrillic text.

[TestMethod]
public void Encoding_Test()
{
   string cyrillicText = "Мне очень понравилась ваша фотография и письмо";

   System.Text.ASCIIEncoding encodingASCII = new System.Text.ASCIIEncoding();
   System.Text.UTF8Encoding encodingUTF8 = new System.Text.UTF8Encoding();
   System.Text.UnicodeEncoding encodingUNICODE = new System.Text.UnicodeEncoding();

   byte[] textBytesASCII = encodingASCII.GetBytes(cyrillicText);
   byte[] textBytesUTF8 = encodingUTF8.GetBytes(cyrillicText);
   byte[] textBytesCyrillic = encodingUNICODE.GetBytes(cyrillicText);

   Console.WriteLine("{0}: {1}", encodingASCII.ToString(), encodingASCII.GetString(textBytesASCII));
   Console.WriteLine("{0}: {1}", encodingUTF8.ToString(), encodingUTF8.GetString(textBytesUTF8));
   Console.WriteLine("{0}: {1}", encodingUNICODE.ToString(), encodingUNICODE.GetString(textBytesCyrillic));
}

image

The framework also expose a Convert method to switch from an encoding to another one, this operation is usally callled Transcoding:

[TestMethod]
public void Transcoding_Test()
{
   string sampleText = "Unicode character \u0066";
   System.Text.ASCIIEncoding encodingASCII = new System.Text.ASCIIEncoding();
   System.Text.UnicodeEncoding encodingUNICODE = new System.Text.UnicodeEncoding();
   byte[] sampleTextEncoded = encodingUNICODE.GetBytes(sampleText);
   //print out the string with UNICODE encoding
   Console.WriteLine("{0}: {1}", encodingUNICODE.ToString(), encodingUNICODE.GetString(sampleTextEncoded));
   //this is the output we get if we try to decode with ASCII without converting
   Console.WriteLine("Not converted - {0}: {1}", encodingASCII.ToString(), encodingASCII.GetString(sampleTextEncoded));
   //convert the text with Unicode encoding
   sampleTextEncoded = Encoding.Convert(encodingUNICODE, encodingASCII, sampleTextEncoded);
   Console.WriteLine("Converted - {0}: {1}", encodingASCII.ToString(), encodingASCII.GetString(sampleTextEncoded));
}

image

For more info : http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx

Advertisements

One thought on “C# Text Encoding and Transcoding in few steps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s