Chapter Thirteen
Every time we tap on a tablet or poke on our phones or sit down at a laptop or desktop computer, we’re dealing with text. We’re either reading text, typing text, or cutting and pasting text from one place to another—from webpages to word processors, from email to social networks, from quips we see online to friends that we message.
None of this would be possible without a standardized way to represent text characters in computer bits and bytes. Character encoding is easily the most vital computer standard. This standard is crucial for the ability of modern communication to transcend the differences between computer systems and applications, between hardware and software manufacturers, and even between national boundaries.
Yet the representation of text on computers can still sometimes fail. In early 2021 as I began revising this chapter, I received an email from a web-hosting provider with the subject line
1. We’ve received your payment, thanks.
You’ve undoubtedly seen such oddities yourself, and they seem bizarre, but by the end of this chapter, you’ll know exactly how such a thing can happen.
This book began with a discussion of two systems for representing text with binary codes. Morse code might not seem like a pure binary code at first because it involves short dots and longer dashes with various lengths of pauses between the dots and dashes. But recall that everything in Morse code is a multiple of the length of a dot: A dash is three times the length of a dot, pauses between letters are the length of a dash, and pauses between words are the length of two dashes. If a dot is a single 1 bit, then a dash is three 1 bits in a row, while pauses are strings of 0 bits. Here are the words “HI THERE” in Morse code with the equivalent binary digits:
Morse code is categorized as a variable bit-length code because different characters require a different number of bits.
Braille is much simpler in this regard. Each character is represented by an array of six dots, and each dot can be either raised or not raised. Braille is unmistakably a 6-bit code, which means that each character can be represented by a 6-bit value. One little catch is that additional Braille characters are necessary to represent numbers and uppercase letters. You might recall that numbers in Braille require a shift code—a Braille character that changes the meaning of subsequent characters.
Shift codes also show up in another early binary code, invented in connection with a printing telegraph in the 1870s. This was the work of Émile Baudot, an officer in the French Telegraph Service, and the code is still known by his name. Baudot code was used into the 1960s—for example, by Western Union for sending and receiving text messages called telegrams. You might even today hear a computer old-timer refer to transmission speeds of binary data as baud rates.
The Baudot code was often used in the teletypewriter, a device that has a keyboard that looks something like a typewriter, except that it has only 30 keys and a spacebar. The keys are switches that cause a binary code to be generated and sent down the teletypewriter’s output cable, one bit after the other. Teletypewriters also contain a printing mechanism. Codes coming through the teletypewriter’s input cable trigger electromagnets that print characters on paper.
Baudot is a 5-bit code, so there are only 32 possible codes, in hexadecimal ranging from 00h through 1Fh. Here’s how these 32 available codes correspond to the letters of the alphabet:
Code 00h isn’t assigned to anything. Of the remaining 31 codes, 26 are assigned to letters of the alphabet, and the other five are indicated by italicized words or phrases in the table.
Code 04h is the Space code, which is used for the space separating words. Codes 02h and 08h are labeled Carriage Return and Line Feed. This terminology comes from typewriters: When you’re typing on a typewriter and you reach the end of a line, you push a lever or button that does two things. First, it causes the carriage with the paper to be moved to the right (or the printing mechanism to be moved to the left) so that the next line begins at the left side of the paper. That’s a carriage return. Second, the typewriter rolls the carriage so that the next line is underneath the line you just finished. That’s the line feed. In Baudot, separate codes represent these two actions, and a Baudot teletypewriter printer responds to them when printing.
Where are the numbers and punctuation marks in the Baudot system? That’s the purpose of code 1Bh, identified in the table as Figure Shift. After the Figure Shift code, all subsequent codes are interpreted as numbers or punctuation marks until the Letter Shift code (1Fh) causes them to revert to the letters. Here are the codes for the numbers and punctuation:
That table shows how these codes were used in the United States. Outside the US, codes 05h, 0Bh, and 16h were often used for the accented letters of some European languages. The Bell code is supposed to ring an audible bell on the teletypewriter. The “Who Are You?” code activates a mechanism for a teletypewriter to identify itself.
Like Morse code, Baudot doesn’t differentiate between uppercase and lowercase. The sentence
I SPENT $25 TODAY.
is represented by the following stream of hexadecimal data:
I S P E N T $ 2 5 T O D A Y .
0C 04 14 0D 10 06 01 04 1B 16 19 01 1F 04 01 03 12 18 15 1B 07 02 08
Notice the three shift codes: 1Bh right before the dollar sign, 1Fh after the number, and 1Bh again before the final period. The line concludes with codes for the carriage return and line feed.
Unfortunately, if you sent this stream of data to a teletypewriter printer twice in a row, it would come out like this:
I SPENT $25 TODAY.
8 '03,5 $25 TODAY.
What happened? The last shift code the printer received before the second line was a Figure Shift code, so the codes at the beginning of the second line are interpreted as numbers until the next Letter Shift code.
Problems like this are the typically nasty results of using shift codes. When the time came to replace Baudot with something more modern and versatile, it was considered preferable to avoid shift codes and to define separate codes for lowercase and uppercase letters.
How many bits do you need for such a code? If you focus just on English and begin adding up the characters, you’ll need 52 codes just for the uppercase and lowercase letters in the Latin alphabet, and ten codes for the digits 0 through 9. You’re up to 62 already. Throw in a few punctuation marks, and that’s more than 64, which is the limit for 6 bits. But there’s now some leeway before exceeding 128 characters, which would then require 8 bits.
So the answer is: 7. You need 7 bits to represent all the characters that normally occur in English text without shift codes.
What replaced Baudot was a 7-bit code called the American Standard Code for Information Interchange, abbreviated ASCII, and referred to with the unlikely pronunciation of ['askē]. It was formalized in 1967 and remains the single most important standard in the entire computer industry. With one big exception (which I’ll describe soon), whenever you encounter text on a computer, you can be sure that ASCII is involved in some way.
As a 7-bit code, ASCII uses binary codes 0000000 through 1111111, which are hexadecimal codes 00h through 7Fh. You’re going to see all 128 ASCII codes shortly, but I want to divide the codes into four groups of 32 each and then skip the first 32 codes initially because these codes are conceptually a bit more difficult than the others. The second group of 32 codes includes punctuation and the ten numeric digits. This table shows the hexadecimal codes from 20h to 3Fh, and the characters that correspond to those codes:
Notice that 20h is the space character that divides words and sentences.
The next 32 codes include the uppercase letters and some additional punctuation. Aside from the @ sign and the underscore, these punctuation symbols aren’t normally found on typewriters, but they’ve come to be standard on computer keyboards.
The next 32 characters include all the lowercase letters and some additional punctuation, again not often found on typewriters but standard on computer keyboards:
Notice that this table is missing the last character corresponding to code 7Fh. You’ll see it shortly.
The text string
Hello, you!
can be represented in ASCII using the hexadecimal codes
Notice the comma (code 2Ch), the space (code 20h), and the exclamation point (code 21h), as well as the codes for the letters. Here’s another short sentence:
I am 12 years old.
And its ASCII representation:
Notice that the number 12 in this sentence is represented by the hexadecimal numbers 31h and 32h, which are the ASCII codes for the digits 1 and 2. When the number 12 is part of a text stream, it should not be represented by the hexadecimal codes 01h and 02h, or the hexadecimal code 0Ch. These codes all mean something else in ASCII.
A particular uppercase letter in ASCII differs from its lowercase counterpart by 20h. This fact makes it quite easy for computer programs to convert between uppercase and lowercase letters: Just add 20h to the code for an uppercase letter to convert to lowercase, and subtract 20h to convert lowercase to uppercase. (But you don’t even need to add. Only a single bit needs be changed to convert between uppercase and lowercase. You’ll see techniques to do jobs like that later in this book.)
The 95 ASCII codes you’ve just seen are said to refer to graphic characters because they have a visual representation. ASCII also includes 33 control characters that have no visual representation but instead perform certain functions. For the sake of completeness, here are the 33 ASCII control characters, but don’t worry if they seem mostly incomprehensible. At the time that ASCII was developed, it was intended mostly for teletypewriters, and many of these codes are currently quite obscure.
The idea here is that control characters can be intermixed with graphic characters to do some rudimentary formatting of the text. This is easiest to understand if you think of a device—such as a teletypewriter or a simple printer—that types characters on a page in response to a stream of ASCII codes. The device’s printing head normally responds to character codes by printing a character and moving one space to the right. The most important control characters alter this behavior.
For example, consider the hexadecimal character string
The 09 character is a Horizontal Tabulation code, or Tab for short. If you think of all the horizontal character positions on the printer page as being numbered starting with 0, the Tab code usually means to print the next character at the next horizontal position that’s a multiple of 8, like this:
A B C
This is a handy way to keep text lined up in columns.
Even today, some computer printers respond to a Form Feed code (OCh) by ejecting the current page and starting a new page.
The Backspace code can be used for printing composite characters on some old printers. For example, suppose the computer controlling the teletypewriter wanted to display a lowercase e with a grave accent mark, like so: è. This could be achieved by using the hexadecimal codes 65 08 60.
By far the most important control codes are Carriage Return and Line Feed, which have the same meaning as the similar Baudot codes. On some older computer printers, the Carriage Return code moved the printing head to the left side of the page on the same line, and the Line Feed code moved the printing head one line down. Both codes were generally required to go to a new line. A Carriage Return could be used by itself to print over an existing line, and a Line Feed could be used by itself to skip to the next line without moving to the left margin.
Text, pictures, music, and video can all be stored on the computer in the form of files, which are collections of bytes identified by a name. These filenames often consist of a descriptive name indicating the contents of the file, and an extension, usually three or four letters indicating the type of the file. Files consisting of ASCII characters often have the filename extension txt for “text.” ASCII doesn’t include codes for italicized text, or boldface, or various fonts and font sizes. All that fancy stuff is characteristic of what’s called formatted text or rich text. ASCII is for plain text. On a Windows desktop computer, the Notepad program can create plain-text files; under macOS, the TextEdit program does the same (although that is not its default behavior). Both these programs allow you to choose a font and font size, but that’s only for viewing the text. That information is not stored with the text itself.
Both Notepad and TextEdit respond to the Enter or Return key by ending the current line and moving to the beginning of the next line. But these programs also perform word wrapping: As you type and come to the rightmost edge of the window, the program will automatically continue your typing on the next line, and the continued text really becomes part of a paragraph rather than individual lines. You press the Enter or Return key to mark the end of that paragraph and begin a new paragraph.
When you press the Enter or Return key, the Windows Notepad inserts hexadecimal code 0Dh and 0Ah into the file—the Carriage Return and Line Feed characters. The macOS TextEdit inserts just a 0Ah, the Line Feed. What’s now called the Classic Mac OS (which existed from 1984 to 2001) inserted just 0Dh, the Carriage Return. This inconsistency continues to cause problems when a file created on one system is read on another system. In recent years, programmers have worked to reduce those problems, but it’s still quite shocking—shameful, even—that there is still no computer industry standard for denoting the end of lines or paragraphs in a plain-text file.
Soon after its introduction, ASCII became the dominant standard for text in the computing world, but not within IBM. In connection with the System/360, IBM developed its own character code, known as the Extended BCD Interchange Code, or EBCDIC, which was an 8-bit extension of an earlier 6-bit code known as BCDIC, which was derived from codes used on IBM punch cards. This style of punch card—capable of storing 80 characters of text—was introduced by IBM in 1928 and used for over 50 years.
The black rectangles are holes punched in the card. Punch cards have a practical problem that affects how they are used to represent characters: If too many holes are punched in the card, it can lose its structural integrity, tear apart, and jam up a machine.
A character is encoded on a punch card by a combination of one or more rectangular holes punched in a single column. The character itself is often printed near the top of the card. The lower ten rows are called digit rows and identified by number: the 0-row, the 1-row, and so on through the 9-row. These are remnants of computer systems that worked directly with decimal numbers. The two unnumbered rows near the top are zone rows and are called the 11-row and 12-row, which is the one at the very top. There is no 10-row.
EBCDIC character codes are combinations of the zone punches and digit punches. The EBCDIC codes for the ten digits are F0h through F9h. The EBCDIC codes for the uppercase letters are in three groups, from C1h to C9h, from D1h to D9h, and from E2h to E9h. EBCDIC codes for lowercase letters are also in three groups, from 81h to 89h, from 91h to 99h, and from A2h to A9h.
In ASCII, all the uppercase and lowercase letters are in continuous sequences. This makes it convenient to alphabetically sort ASCII data. EBCDIC, however, has gaps in the sequences of the letters, making sorting more complex. Fortunately, at this time EBCDIC is mostly a historical curiosity rather than something you’ll likely encounter in your personal or professional life.
At the time that ASCII was being developed, memory was very expensive. Some people felt that to conserve memory, ASCII should be a 6-bit code using a shift character to differentiate between lowercase and uppercase letters. Once that idea was rejected, others believed that ASCII should be an 8-bit code because it was considered more likely that computers would have 8-bit architectures than they would 7-bit architectures. Of course, 8-bit bytes are now the standard, and although ASCII is technically a 7-bit code, it’s almost universally stored as 8-bit values.
The equivalence of bytes and ASCII characters is certainly convenient because we can get a rough sense of how much computer memory a particular text document requires simply by counting the characters. For example, Herman Melville’s Moby-Dick; or, The Whale is about 1.25 million characters and therefore occupies 1.25 million bytes of computer storage. From this information, an approximate word count can also be derived: The average word is considered to be five characters in length, and counting the space that appears between words, Moby-Dick is therefore about 200 thousand words in length.
A plain-text file of Moby-Dick can be downloaded from the Project Gutenberg website (gutenberg.org) along with many other works of classic literature in the public domain. Although Project Gutenberg pioneered the availability of books in plain text, it also makes available these same books in a couple of e-book formats as well as in HTML (Hypertext Markup Language).
As the format used for webpages throughout the internet, HTML is definitely the most popular rich-text format. HTML adds fancy formatting to plain text by using snippets of markup or tags. But what’s interesting is that HTML uses normal ASCII characters for markup, so an HTML file is also a normal plain-text file. When viewed as plain text, HTML looks like this:
This is some bold text, and this is some italic text.
The angle brackets are just ASCII codes 3Ch and 3Eh. But when interpreted as HTML, a web browser can display that text like this:
This is some bold text, and this is some italic text.
It’s the same text but just rendered in different ways.
ASCII is certainly the most important standard in the computer industry, but even from the beginning the deficiencies were obvious. The big problem is that the American Standard Code for Information Interchange is just too darn American! Indeed, ASCII is hardly suitable even for other nations whose principal language is English. ASCII includes a dollar sign, but where is the British pound sign? Where it fails badly is in dealing with the accented letters used in many Western European languages, to say nothing of the non-Latin alphabets used in Europe, including Greek, Arabic, Hebrew, and Cyrillic, or the Brahmi scripts of India and Southeast Asia, including Devanagari, Bengali, Thai, and Tibetan. And how can a 7-bit code possibly handle the tens of thousands of ideographs of Chinese, Japanese, and Korean and the ten thousand–odd Hangul syllables of Korean?
Including all the world’s languages in ASCII would have been much too ambitious a goal in the 1960s, but the needs of some other nations were kept in mind, although only with rudimentary solutions. According to the published ASCII standard, ten ASCII codes (40h, 5Bh, 5Ch, 5Dh, 5Eh, 60h, 7Bh, 7Ch, 7Dh, and 7Eh) are available to be redefined for national uses. In addition, the number sign (#) can be replaced by the British pound sign (£), and the dollar sign ($) can be replaced by a generalized currency sign ($). Obviously, replacing symbols makes sense only when everyone involved in using a particular text document containing these redefined codes knows about the change.
Because many computer systems store characters as 8-bit values, it’s possible to devise something called an extended ASCII character set that contains 256 characters rather than just 128. In such a character set, the first 128 codes, with hexadecimal values 00h through 7Fh, are defined just as they are in ASCII, but the next 128 codes (80h through FFh) can be whatever you want. This technique was used to define additional character codes to accommodate accented letters and non-Latin alphabets. Unfortunately, ASCII was extended many times in many different ways.
When Microsoft Windows was first released, it supported an extension of ASCII that Microsoft called the ANSI character set, although it had not actually been approved by the American National Standards Institute. The additional characters for codes A0h through FFh are mostly useful symbols and accented letters commonly found in European languages. In this table, the high-order nibble of the hexadecimal character code is shown in the top row; the low-order nibble is shown in the left column:
The character for code A0h is defined as a no-break space. Usually when a computer program formats text into lines and paragraphs, it breaks each line at a space character, which is ASCII code 20h. Code A0h is supposed to be displayed as a space but can’t be used for breaking a line. A no-break space might be used in a date such as February 2 so that February doesn’t appear on one line and 2 on the next line.
Code ADh is defined as a soft hyphen. This is a hyphen used to separate syllables in the middle of words. It appears on the printed page only when it’s necessary to break a word between two lines.
The ANSI character set became popular because it was part of Windows, but it was just one of many different extensions of ASCII defined over the decades. To keep them straight, they accumulated numbers and other identifiers. The Windows ANSI character set became an International Standards Organization standard known as ISO-8859-1, or Latin Alphabet No. 1. When this character set was itself extended to include characters for codes 80h through 9Fh, it became known as Windows-1252:
The number 1252 is called a code page identifier, a term that originated at IBM to differentiate different versions of EBCDIC. Various code pages were associated with countries requiring their own accented characters and even entire alphabets, such as Greek, Cyrillic, and Arabic. To properly render character data, it was necessary to know what code page was involved. This became most crucial on the internet, where information at the top of an HTML file (known as the header) indicates the code page used to create the webpage.
ASCII was also extended in more radical ways to encode the ideographs of Chinese, Japanese, and Korean. In one popular encoding—called Shift-JIS (Japanese Industrial Standard)—codes 81h through 9Fh actually represent the initial byte of a 2-byte character code. In this way, Shift-JIS allows for the encoding of about 6000 additional characters. Unfortunately, Shift-JIS isn’t the only system that uses this technique. Three other standard double-byte character sets (DBCS) became popular in Asia.
The existence of multiple incompatible double-byte character sets is only one of their problems. Another problem is that some characters—specifically, the normal ASCII characters—are represented by 1-byte codes, while the thousands of ideographs are represented by 2-byte codes. This makes it difficult to work with such character sets.
If you think this sounds like a mess, you’re not alone, so can somebody please come up with a solution?
Under the assumption that it’s preferable to have just one unambiguous character encoding system that’s suitable for all the world’s languages, several major computer companies got together in 1988 and began developing an alternative to ASCII known as Unicode. Whereas ASCII is a 7-bit code, Unicode is a 16-bit code. (Or at least that was the original idea.) In its original conception, each and every character in Unicode would require 2 bytes, with character codes ranging from 0000h through FFFFh to represent 65,536 different characters. That was considered sufficient for all the world’s languages that are likely to be used in computer communication, with room for expansion.
Unicode didn’t start from scratch. The first 128 characters of Unicode—codes 0000h through 007Fh—are the same as the ASCII characters. Also, Unicode codes 00A0h through 00FFh are the same as the Latin Alphabet No. 1 extension of ASCII that I described earlier. Other worldwide standards are also incorporated into Unicode.
Although a Unicode code is just a hexadecimal value, the standard way of indicating it is by prefacing the value with a capital U and a plus sign. Here are a few representative Unicode characters:
Many more can be found on the website run by the Unicode Consortium, unicode.org, which offers a fascinating tour of the richness of the world’s written languages and symbology. Scroll down to the bottom of the home page and click Code Charts for a portal into images of more characters than you ever believed possible.
But moving from an 8-bit character code to a 16-bit code raises problems of its own: Different computers read 16-bit values in different ways. For example, consider these two bytes:
Some computers would read that sequence as the 16-bit value 20ACh, the Unicode code for the Euro sign. These computers are referred to as big-endian machines, meaning that the most significant byte (the big end) is first. Other computers are little-endian machines. (The terminology comes from Gulliver’s Travels, wherein Jonathan Swift describes a conflict about which side of a soft-boiled egg to break.) Little-endian machines read that value as AC20h, which in Unicode is the character 갠 in the Korean Hangul alphabet.
To get around this problem, Unicode defines a special character called the byte order mark, or BOM, which is U+FEFF. This is supposed to be placed at the beginning of a file of 16-bit Unicode values. If the first two bytes in the file are FEh and FFh, the file is in big-endian order. If they’re FFh and FEh, the file is in little-endian order.
By the mid-1990s, just as Unicode was starting to catch on, it became necessary to go beyond 16 bits to include scripts that have become extinct but are still necessary to represent for historic reasons and to include numerous new symbols. Some of these new symbols were those popular and delightful characters known as emojis.
At the time of this writing (the year 2021), Unicode has been expanded to become a 21-bit code with values ranging through U+10FFFF, potentially supporting over 1 million different characters. Here are a just a few of the characters that couldn’t be accommodated with a 16-bit code:
Including emojis in Unicode might seem frivolous, but only if you believe that it’s acceptable for an emoji entered in a text message to show up as something completely different on the recipient’s phone. Misunderstandings could result, and relationships could suffer!
Of course, people’s needs regarding Unicode are different. Particularly when rendering ideographs of Asian languages, it’s necessary to make extensive use of Unicode. Other documents and webpages have more modest needs. Many can do just fine with plain old ASCII. For that reason, several different methods have been defined for storing and transmitting Unicode text. These are called Unicode transformation formats, or UTFs.
The most straightforward of the Unicode transformation formats is UTF-32. All Unicode characters are defined as 32-bit values. The 4 bytes required for each character can be specified in either little-endian order or big-endian order.
The drawback of UTF-32 is that it uses lots of space. A plain-text file containing the text of Moby-Dick would increase in size from 1.25 million bytes in ASCII to 5 million bytes in Unicode. And considering that Unicode uses only 21 of the 32 bits, 11 bits are wasted for each character.
One compromise is UTF-16. With this format, most Unicode characters are defined with 2 bytes, but characters with codes above U+FFFF are defined with 4 bytes. An area in the original Unicode specification from U+D800 through U+DFFF was left unassigned for this purpose.
The most important Unicode transformation format is UTF-8, which is now used extensively throughout the internet. A recent statistic indicates that 97% of all webpages now use UTF-8. That’s about as much of a universal standard as you can want. Project Gutenberg’s plain-text files are all UTF-8. Windows Notepad and macOS TextEdit save files in UTF-8 by default.
UTF-8 is a compromise between flexibility and concision. The biggest advantage of UTF-8 is that it’s backward compatible with ASCII. This means that a file consisting solely of 7-bit ASCII codes stored as bytes is automatically a UTF-8 file.
To make this compatibility possible, all other Unicode characters are stored with 2, 3, or 4 bytes, depending upon their value. The following table summarizes how UTF-8 works:
For the ranges of codes shown in the first column, each character is uniquely identified with a number of bits shown in the second column. These bits are then prefaced with 1s and 0s, as shown in the third column, to form a sequence of bytes. The numbers of x’s in the third column is the same as the count in the second column.
The first row of the table indicates that if the character is from the original collection of 7-bit ASCII codes, the UTF-8 encoding of that character is a 0 bit followed by those 7 bits, which is the same as the ASCII code itself.
Characters with Unicode values of U+0080 and greater require 2 or more bytes. For example, the British pound sign (£) is Unicode U+00A3. Because this value is between U+0080 and U+07FF, the second row of the table indicates that it’s encoded in UTF-8 with 2 bytes. For values in this range, only the least significant 11 bits need to be used to derive the 2-byte encoding, as shown here:
The Unicode value of 00A3 is shown at the top of this diagram. Each of the four hex digits corresponds to the 4-bit value shown directly under the digit. We know that the value is 07FFh or less, which means that the most significant 5 bits will be 0 and they can be ignored. The next 5 bits are prefaced with 110 (as shown at the bottom of the illustration) to form the byte C2h. The least significant 6 bits are prefaced with 10 to form the byte A3h.
Thus, in UTF-8 the two bytes C2h and A3h represent the British £ sign. It seems a shame to require 2 bytes to encode what is essentially just 1 byte of information, but it’s necessary for the rest of UTF-8 to work.
Here’s another example. The Hebrew letter א (alef) is U+05D0 in Unicode. Again, that value is between U+0080 and U+07FF, so the second row of the table is used. It’s the same process as the £ character:
The first 5 bits of the value 05D0h can be ignored; the next 5 bits are prefaced with 110, and the least significant 6 bits are prefaced with 10 to form the UTF-8 bytes D7h and 90h.
Regardless of how unlikely the image might be, the Cat Face with Tears of Joy emoji is represented by Unicode U+1F639, which means that UTF-8 represents it as a sequence of 4 bytes. This diagram shows how those 4 bytes are assembled from the 21 bits of the original code:
By representing characters using a varying number of bytes, UTF-8 spoils some of the purity and beauty of Unicode. In the past, schemes like this used in connection with ASCII caused problems and confusion. UTF-8 isn’t entirely immune from problems, but it has been defined very intelligently. When a UTF-8 file is decoded, every byte can be identified quite precisely:
· If the byte begins with a zero, that’s simply a 7-bit ASCII character code.
· If the byte begins with 10, it’s part of a sequence of bytes representing a multibyte character code, but it’s not the first byte in that sequence.
· Otherwise, the byte begins with at least two 1 bits, and it’s the first byte of a multibyte character code. The total number of bytes for this character code is indicated by the number of 1 bits that this first byte begins with before the first 0 bit. This can be two, three, or four.
Let’s try one more UTF-8 conversion: The Right Single Quotation Mark character is U+2019. This requires consulting the third row of the table because the value is between U+0800 and U+FFFF. The UTF-8 representation is 3 bytes:
All the bits of the original Unicode number are necessary to form the 3 bytes. The first 4 bits are prefaced with 1110, the next 6 bits with 10, and the least significant 6 bits also with 10. The result is the 3-byte sequence of E2h, 80h, and 99h.
Now it’s possible to see the problem with the email that I mentioned at the beginning of this chapter with the subject line
We⯙ve received your payment, thanks.
The first word is obviously “We’ve” but the contraction uses not the old-fashioned ASCII apostrophe (ASCII 27h or Unicode U+0027) but the fancier Unicode Right Single Quotation Mark, which as we’ve just seen is encoded in UTF-8 with the three bytes E2h, 80h, and 99h.
So far, no problem. But the HTML file in this email indicated that it was using the character set “windows-1252.” It should have said “utf-8” because that’s how the text was encoded. But because this HTML file indicated “windows-1252,” my email program used the Windows-1252 character set to interpret these three bytes. Check back in the table of the Windows-1252 codes on page 162 to confirm for yourself that the three bytes E2h, 80h, and 99h do indeed map to the characters â, ¯, and ™, precisely the characters in the email.
Mystery solved.
By extending computing to become a universal and multicultural experience, Unicode has been an enormously important standard. But like anything else, it doesn’t work unless it’s used correctly.