| ... | ... | @@ -30,15 +30,15 @@ It might be helpful to outline some of the background to Unicode itself, because |
|
|
|
- UTF-16 makes all characters 16-bits wide. Unfortunately this does not cover the entire code space, so there are some
|
|
|
|
'page-switch' characters that swap out the current 'page' of the code book for a different one. So although most characters
|
|
|
|
end up fitting in a single 16-bit field, some must be coded as two successive fields.
|
|
|
|
- UTF-32 uses a full 32-bit word per character glyph.
|
|
|
|
- To make things more exciting, the UTF-16 and UTF-32 encodings have two variations, depending on the endianness of
|
|
|
|
- UCS-4 uses a full 32-bit word per character glyph.
|
|
|
|
- To make things more exciting, the UTF-16 and UCS-4 encodings have two variations, depending on the endianness of
|
|
|
|
the machine they were originally written on. So if you read a raw byte-stream and want to convert it to 16-bit chunks,
|
|
|
|
you first need to work out the byte-ordering. This is often done by reading a few bytes and then looking up a heuristic table,
|
|
|
|
although there is also a 'byte-order mark' which is a non-printing character which may or may not be present.
|
|
|
|
|
|
|
|
- Since unix-like systems traditionally deal with byte-streams, UTF-8 is the most common encoding on those platforms.
|
|
|
|
- The NTFS file system (Windows) stores filenames and file contents in UTF-16.
|
|
|
|
- Almost no system stores UTF-32 natively, but there is a C library type 'wchar' (wide character) which has 32 bits.
|
|
|
|
- Almost no system stores UCS-4 natively, but there is a C library type 'wchar' (wide character) which has 32 bits.
|
|
|
|
- But of course any system must be able to read/write files that originated on any other platform.
|
|
|
|
- As an example of the complex heuristics needed to guess the encoding of any particular file, see the
|
|
|
|
[ XML standard](http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing) |