Representing Characters: Giving Letters a Binary Voice (Syllabus 3.5.4)
Welcome to the fascinating world of character representation! In this chapter, we explore how computers, which fundamentally only understand binary (sequences of 1s and 0s), manage to store, process, and display all the letters, numbers, and symbols we use every day—from the English alphabet to complex emojis and international characters.
Understanding this process is vital because it explains how text data is handled and why certain systems (like older computers) sometimes struggle to display characters from different languages.
1. The Core Concept: Character Sets and Codes
To store a character (like 'A', '7', or '$'), the computer doesn't store the shape of the character itself. Instead, it stores a numerical code which is then converted into a bit pattern (binary).
- Character Set: Think of this as a huge, standardized dictionary or lookup table. It's a defined list of characters that a computer can recognize and interpret.
- Unique Code: Each character in the set is assigned a unique numerical value (its code). When you type 'A', the computer stores the number corresponding to 'A'. When it displays 'A', it looks up that number in the character set to know which shape to draw on the screen.
Key Takeaway: Text data is stored as numbers, which are themselves stored as bit patterns.
2. ASCII: The Original Standard
The earliest and most fundamental standard is ASCII (American Standard Code for Information Interchange).
What is ASCII?
ASCII was introduced in the 1960s and became the standard way to encode English characters.
- Bit Length: ASCII is a 7-bit encoding system.
-
Capacity: With 7 bits, we can represent \(2^7\) different values.
\(2^7 = 128\) unique characters.
These 128 codes cover:
- Uppercase letters (A-Z)
- Lowercase letters (a-z)
- Numeric digits (0-9)
- Common punctuation and symbols (like !, ?, &)
- Control characters (like carriage return or tab)
Did you know? In many older systems, ASCII was extended to 8 bits (often called "Extended ASCII") to represent another 128 characters for European languages, but these extensions were not globally standardised, which caused major compatibility issues!
3. The Need for Unicode and the Rise of UTF-8
The main problem with ASCII was its limitation: 128 characters is nowhere near enough to represent all the characters used globally (think of Chinese characters, Arabic scripts, or even just European letters with accents like 'é').
Unicode was introduced to solve this international limitation.
Understanding Unicode
Unicode is a character set that assigns a unique number to every character in virtually every language and script used across the globe. It currently defines over 140,000 characters.
- Multiple Encoding Systems: Unlike ASCII, Unicode is vast and uses more than one way (encoding system) to store these numbers.
- The most common and important encoding system you need to know is UTF-8.
Focus on UTF-8
UTF-8 (Unicode Transformation Format – 8-bit) is the most widely used Unicode encoding system today, especially on the internet. It is popular because of its efficiency and compatibility.
Variable Length Encoding:
UTF-8 is a variable-length encoding system. This means it uses different numbers of bits depending on the character:
- For common characters (like English text), it uses 8 bits (1 byte).
- For less common characters, it may use 16, 24, or 32 bits (2, 3, or 4 bytes).
Backwards Compatibility:
The brilliance of UTF-8 is its backwards compatibility with ASCII. The first 128 characters (codes 0 to 127) in Unicode/UTF-8 are exactly the same as their counterparts in ASCII. This allows older systems that understand ASCII to easily process basic UTF-8 text without error.
Quick Review: ASCII vs. UTF-8
ASCII: Simple, 7 bits, 128 characters, English only.
UTF-8: Complex, Variable length (8 to 32 bits), Global coverage, Backward compatible with ASCII.
4. Understanding Character Code Groupings
When solving exam questions, you won't be expected to memorize every character code, but you must know the starting points for the main "blocks" of characters in both ASCII and Unicode (since they match in this range).
-
Numeric Digits (0-9): Start at code 48.
For example: '0' = 48, '1' = 49, '9' = 57. -
English Uppercase Letters (A-Z): Start at code 65.
Example: 'A' = 65, 'B' = 66. If you are told 'A' is 65, you can easily calculate 'G' = 65 + 6 = 71. -
English Lowercase Letters (a-z): Start at code 97.
Example: 'a' = 97, 'b' = 98.
Memory Aid: Remember the key numbers: 48, 65, 97.
5. The Crucial Distinction: Character Code vs. Pure Binary
This is a common source of confusion! You must understand the difference between the character code representation of a decimal digit (like the character '6') and the pure binary representation of the numerical value (like the number 6).
Analogy: The Label vs. The Quantity
Imagine you have 6 apples.
1. The quantity of apples is 6. In pure binary, this is 110 (using 3 bits).
2. The label you write on the box is the character '6'.
Step-by-Step Example (The Digit 6):
We want to store the digit 6.
Case 1: Storing the value "6" (Pure Binary)
If you are storing the number 6 as an integer data type, the computer uses pure binary for calculation.
- Decimal value: 6
- Pure Binary (8 bits): 00000110
Case 2: Storing the character '6' (Character Code)
If you are storing the symbol '6' as part of a string data type (like in a phone number or address), the computer uses its character code.
- We know numeric digits start at code 48.
- The character '6' is the 7th character (since '0' is the 1st).
- ASCII Code: \(48 + 6 = 54\).
- ASCII Binary (7 bits): 0110110
- UTF-8 Binary (8 bits, padded): 00110110
Notice the difference:
Pure Binary (6): 00000110
Character Code ('6'): 00110110
If a computer tried to calculate using the character code '6', the math would be completely wrong! This is why conversion operations (like string to integer, as discussed in other sections of the syllabus) are essential in programming.
Common Mistake Alert!
Don't confuse the code for a digit with the value of that digit. If you read the text "123", the computer sees three separate character codes (for '1', '2', and '3'), not the single mathematical value 123.
Chapter Summary: Representing Characters
We represent characters by assigning them a unique numerical code within a character set.
- ASCII (7-bit) was the original standard, limiting representation to 128 characters.
- Unicode was created to support all global characters and uses multiple encoding systems.
- UTF-8 is the dominant variable-length Unicode encoding, specifically designed to be backward compatible with ASCII (codes 0-127 are identical).
- Remember the key starting codes: Digits (48), Uppercase (65), Lowercase (97).
- The character code for a digit is different from its pure binary value. For instance, the character '1' is stored using code 49, while the numerical value 1 is stored as 00000001 (in 8 bits).