As the first plane, aka. "Basic Multilingual Plane" or BMP, contains almost everything you will ever use, many have made the wrong assumption that Unicode was a 16-bit character set.〔【出典】Gentoo Linux ◆【License】CC-BY-SA-2.5 ◆【編集】独立行政法人情報通信研究機構 〕
There is a Unicode encoding that uses four bytes per character. It's called UTF-32, because 32 bits = 4 bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the Nth character of a string in constant time, because the Nth character starts at the 4×Nth byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every freaking character.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. "Is this string UTF-8?" is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
One of Python's built-in exceptions is ImportError, which is raised when you try to import a module and fail. This can happen for a variety of reasons, but the simplest case is when the module doesn't exist in your import search path. You can use this to include optional features in your program. For example, the chardet library provides character encoding auto-detection. Perhaps your program wants to use this library if it exists, but continue gracefully if the user hasn't installed it. You can do this with a try..except block.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
The urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don't deal in abstractions. If you request a resource, you get bytes. If you want it as a string, you'll need to determine the character encoding and explicitly convert it to a string.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
UTF-8 UTF-8 is a variable-length character encoding, which in this instance means that it uses 1 to 4 bytes per symbol.〔【出典】Gentoo Linux ◆【License】CC-BY-SA-2.5 ◆【編集】独立行政法人情報通信研究機構 〕
When you talk about "text," you're probably thinking of "characters and symbols on my computer screen." But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
What's the #1 cause of gibberish text on the web, in your inbox, and across every computer system ever written? It's character encoding. In the Strings chapter, I talked about the history of character encoding and the creation of Unicode, the "one encoding to rule them all." I'd love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
That was mostly OK in a non-networked world, where "text" was something you typed yourself and occasionally printed. There wasn't much "plain text". Source code was ASCII, and everyone else used word processors, which defined their own (non-text) formats that tracked character encoding information along with rich styling, &c. People read these documents with the same word processing program as the original author, so everything worked, more or less.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more "mode switching" to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character? That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it's wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
There are character encodings for each major language in the world. Since each language is different, and memory and disk space have historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0-255) to represent that language's characters. For instance, you're probably familiar with the ASCII encoding, which stores English characters as numbers ranging from 0 to 127. (65 is capital "A", 97 is lowercase "a", &c.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that's 7 out of the 8 bits in a byte.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
Then there are languages like Chinese, Japanese, and Korean, which have so many characters that they require multiple-byte character sets. That is, each "character" is represented by a two-byte number from 0-65535. But different multi-byte encodings still share the same problem as different single-byte encodings, namely that they each use the same numbers to mean different things. It's just that the range of numbers is broader, because there are many more characters to represent.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
Now think about the rise of global networks like email and the web. Lots of "plain text" flying around the globe, being authored on one computer, transmitted through a second computer, and received and displayed by a third computer. Computers can only see numbers, but the numbers could mean different things. Oh no! What to do? Well, systems had to be designed to carry encoding information along with every piece of "plain text." Remember, it's the decryption key that maps computer-readable numbers to human-readable characters. A missing decryption key means garbled text, gibberish, or worse.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
Now think about trying to store multiple pieces of text in the same place, like in the same database table that holds all the email you've ever received. You still need to store the character encoding alongside each piece of text so you can display it properly. Think that's hard? Try searching your email database, which means converting between multiple encodings on the fly. Doesn't that sound fun?〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
Advantages: super-efficient encoding of common ASCII characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you'll have to trust me on this, because I'm not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of pages that have no encoding information whatsoever. Mozilla Firefox contains an encoding auto-detection library which is open source. I ported the library to Python 2 and dubbed it the chardet module. This chapter will take you step-by-step through the process of porting the chardet module from Python 2 to Python 3.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
Surely you've seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn't declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it's merely annoying; in other languages, the result can be completely unreadable.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes -- a file, a web page, whatever -- and claims it's "text," you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you're left with the unenviable task of cracking the code yourself. Chances are you'll get it wrong, and the result will be gibberish.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕
It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It's like cracking a code when you don't have the decryption key.〔【出典】"Dive Into Python 3" by Mark Pilgrim ◆【和訳】Fukada & Fujimoto ◆【License】CC-BY-SA-3.0 〕