Windows Programming/Unicode

For a reference of Unicode standard, see Unicode.

Introduction to Unicode edit

Wikipedia has related information at Unicode

Unicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers. Originally, text-characters were represented in computers using byte-wide data: each printable character (and many non-printing, or "control" characters) were implemented using a single byte each, which allowed for 256 characters total. However, globalization has created a need for computers to be able to accommodate many different alphabets from around the world.

The old codes were known as ASCII or EBCDIC, but it was apparent that neither of these codes were capable of handling all the different characters and alphabets from around the world. The solution to this problem created Unicode. Windows NT implements many of its core functions with a "wide" 16-bit characters set, close to Unicode standard, although it provides a series of functions that are compatible with the standard ASCII characters as well.

UNICODE characters are frequently called "Wide Characters", "Generic Characters", or "T Characters". This book may use any of these terms interchangeably.

Variable-Width Characters edit

Before Unicode, there was an internationalization attempt that introduced character strings with variable-width characters. Some characters, such as the standard ASCII characters would be 1 byte long. Other characters, such as extended character sets, were two bytes long. These types of character formats fell out of favor with the advent of UNICODE because they are harder to write and much harder to read. Windows does still maintain some functionality to deal with variable-width strings, but we won't discuss those here.

Unfortunately all advantages of using wide characters were lost because the number of characters needed quickly exceeded the 65,536 possible 16-bit values. Windows actually uses what is called UTF-16 to store characters, where a large number of characters actually take //two// words, these are called "surrogate pairs". This development is after much of the Windows API documentation was written and much of it is now obsolete. You should never treat string data as an "array of characters", instead always treat it as a null-terminated block. For instance always send the entire string to a function to draw it on the screen, do not attempt to draw each character. Any code that puts a square bracket after a LPSTR is wrong.

At the same time, variable-width character-based strings made a big comeback in the multi-platform standard called UTF-8, which is pretty much the same idea as UTF-16 except with 8-bit units. Its primary advantage is that there is no need for two APIs. The 'A' and 'W' APIs would have been the same if this were used, and since both are variable-sized, it has no disadvantage. Although most Windows programmers are unfamiliar with it, you may see increased references to using the non-UNICODE API.

Windows Implementation edit

The Win32 API classifies all of its functions that require text input into two categories. Some of the functions have an "A" suffix (for ASCII), and some have a "W" suffix (for Wide characters, or Unicode). These functions are differentiated using the macro "UNICODE":

#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif

Because of this differentiation, when you receive a compiler error, you will get an error on "MessageBoxW" instead of simply "MessageBox". In these cases, the compiler is not broken. It is simply trying to follow a complex set of macros.

Unicode Environment edit

All Windows functions that require character strings are defined in this manner. If you want to use unicode in your program, you need to explicitly define the UNICODE macro before you include the windows.h file:

#define UNICODE
#include <windows.h>

Also, some functions in other libraries require you to define the macro _UNICODE. The standard library functions can be provided in unicode by including the <tchar.h> file as well. So, to use unicode in your project, you need to make the following declarations in your project:

#define UNICODE
#define _UNICODE
#include <windows.h>
#include <tchar.h>

Some header files include a mechanism like the following, so that when one of the two UNICODE macros is defined, the other is automatically defined as well:

#ifdef UNICODE
  #ifndef _UNICODE
    #define _UNICODE
  #endif
#endif

#ifdef _UNICODE
  #ifndef UNICODE
    #define UNICODE
  #endif
#endif

If you are writing a library that utilizes UNICODE, it might be worthwhile for you to include this mechanism in your header files as well, so that other programmers don't need to worry about including both macros.

TEXT macro edit

In C, to make a string of wide characters, you need to prefix the string with the letter "L". Here is an example:

char *asciimessage = "This is an ASCII string.";
wchar_t *unicodemessage = L"This is a Wide Unicode string.";

The data type "TCHAR" is defined as being a char type if unicode is not defined, and is defined as being a wide type if UNICODE is defined (in tchar.h). To make strings portable between unicode and non-unicode, we can use the TEXT() macro to automatically define a string as being unicode or not:

TCHAR *automessage = TEXT("This message can be either ASCII or UNICODE!");

Using TCHAR data types, and the TEXT macro are important steps in making your code portable between different environments.

Also, the TEXT macro can be written as:

TEXT("This is a generic string");
_T("This is also a generic string");
T("This is also a generic string");

All three of these statements are equivalent.

The TEXT macro is typically defined like this:

#ifdef UNICODE
#define TEXT(t) L##t
#define _T(t) L##t
#define T(t) L##t
#else
#define TEXT(t) t
#define _T(t) t
#define T(t) t
#endif

Unicode Reference edit

see Unicode

Control Characters edit

Unicode characters 0 to 31 (U+0000 to U+001F) are part of the C0 Controls and Basic Latin block. They are all control characters. These characters correspond to the first 32 characters of the ASCII set.

Code point	Decimal equivalent	Name
U+0000	0	null character
U+0001	1	start of header
U+0002	2	start of text
U+0003	3	end of text
U+0004	4	end of transmission
U+0005	5	inquiry
U+0006	6	acknowledgment
U+0007	7	bell
U+0008	8	backspace
U+0009	9	horizontal tab
U+000A	10	line feed
U+000B	11	vertical tab
U+000C	12	form feed
U+000D	13	carriage return
U+000E	14	shift out
U+000F	15	shift in
U+0010	16	data link escape
U+0011	17	device control 1
U+0012	18	device control 2
U+0013	19	device control 3
U+0014	20	device control 4
U+0015	21	negative acknowledgment
U+0016	22	synchronous idle
U+0017	23	end of transmission block
U+0018	24	cancel
U+0019	25	end of medium
U+001A	26	substitute
U+001B	27	escape
U+001C	28	file separator
U+001D	29	group separator
U+001E	30	record separator
U+001F	31	unit separator

Next Chapter edit

Dynamic Link Libraries (DLL)