💡 Welcome to Data Storage and Compression!
Hello IGCSE Computer Scientists! This chapter is all about understanding how much space your data takes up and how we can make digital files smaller without losing quality (or sometimes accepting a little quality loss to save massive amounts of space).
Understanding data storage is fundamental. Whether you are dealing with huge images, long sound files, or planning a network, you need to know how to measure and manage digital sizes.
1. Measuring Data Storage (The Digital Hierarchy)
Before we calculate file sizes, we need a common language to measure data. Think of it like measuring liquids (millilitres, litres) or distances (metres, kilometres) – digital data has its own units!
1.1 The Basic Units
Bit (b)
The smallest unit of data. A single binary digit: a 0 or a 1.
Nibble
A collection of 4 bits. (Did you know? It’s half a byte, so they named it a nibble!)
Byte (B)
A collection of 8 bits. This is the fundamental unit for measuring most characters (like 'A' or '?' or '5').
1.2 Large Storage Units (The 1024 Rule)
In Computer Science, these large storage units increase by a factor of 1024, not 1000. Why 1024? Because 1024 is \(2^{10}\), making it a perfect power of two, which fits perfectly with how computers handle binary.
Warning: Common Mistake Alert!
The syllabus requires you to use 1024 for all calculations involving large units, not 1000.
| Unit | Abbreviation | Size (in the denomination below) |
|---|---|---|
| Byte | B | 8 bits |
| Kibibyte | KiB | 1024 Bytes |
| Mebibyte | MiB | 1024 KiB |
| Gibibyte | GiB | 1024 MiB |
| Tebibyte | TiB | 1024 GiB |
| Pebibyte | PiB | 1024 TiB |
| Exbibyte | EiB | 1024 PiB |
*Remember: In exams, they use the binary prefixes (KiB, MiB, GiB) which are based on 1024.
Key Takeaway (Section 1)
Data storage is measured starting with the bit. Larger units increase by a factor of 1024 (a power of 2). You must use 1024 in calculations!
2. Calculating File Sizes
The size of a file depends entirely on how much binary data is needed to represent it. We will look at how to calculate the size of uncompressed images and sound files.
2.1 Image File Size Calculation
To determine the size of an uncompressed image, we need two main factors: Resolution and Colour Depth.
Resolution
This is the total number of pixels in the image (Width x Height).
Colour Depth
This is the number of bits used to represent the colour of each individual pixel.
- If an image has a colour depth of 1 bit, it can only store two colours (\(2^1\)), usually black and white.
- If an image has a colour depth of 8 bits, it can store 256 colours (\(2^8\)).
- If an image has a colour depth of 24 bits, it can store over 16 million colours (\(2^{24}\)).
Analogy: Think of colour depth as the number of different crayons you have. The more bits (crayons), the more detail and variety you can draw, but the larger the picture needs to be to hold all those crayons.
The Formula (Image File Size)
$$ \text{File Size (in bits)} = \text{Image Width (pixels)} \times \text{Image Height (pixels)} \times \text{Colour Depth (bits)} $$
Step-by-Step Calculation Example:
An image has a resolution of 100 pixels by 50 pixels and a colour depth of 8 bits. Calculate the file size in Bytes.
- Calculate total pixels: \(100 \times 50 = 5000\) pixels
- Calculate total bits: \(5000 \times 8 \text{ bits/pixel} = 40,000\) bits
- Convert to Bytes (divide by 8): \(40,000 / 8 = 5000\) Bytes
2.2 Sound File Size Calculation
Sound is recorded by taking 'snapshots' of the sound wave, a process called sampling. The quality of the sound depends on the rate and resolution of this sampling.
Sample Rate
The number of samples taken per second, measured in Hertz (Hz).
Effect: A higher sample rate (more snapshots per second) means the recorded sound wave is closer to the original, increasing the accuracy of the recording and file size.
Sample Resolution (Bit Depth)
The number of bits used to store the amplitude (loudness) of each sample.
Effect: A higher sample resolution means a wider range of amplitudes can be stored, increasing the accuracy of the recording and file size.
Note: You must also account for the Length of Track (in seconds) and whether the sound is Mono (1 track) or Stereo (2 tracks). Assume mono unless told otherwise.
The Formula (Sound File Size)
$$ \text{File Size (in bits)} = \text{Sample Rate (Hz)} \times \text{Sample Resolution (bits)} \times \text{Time (seconds)} $$
Step-by-Step Calculation Example:
A 30-second mono sound track is recorded at 44,100 Hz (Sample Rate) with a Sample Resolution of 16 bits. Calculate the file size in Mebibytes (MiB).
- Calculate total bits: \(44,100 \times 16 \times 30 = 21,168,000\) bits
- Convert to Bytes (divide by 8): \(21,168,000 / 8 = 2,646,000\) Bytes
- Convert to KiB (divide by 1024): \(2,646,000 / 1024 = 2584\) KiB (approximately)
- Convert to MiB (divide by 1024): \(2584 / 1024 = 2.52\) MiB (approximately)
Key Takeaway (Section 2)
Image and sound quality (and file size) increase with higher resolution/sample rate and higher colour/sample depth. Always remember to convert bits to Bytes (divide by 8) and use 1024 for larger unit conversions.
3. Data Compression
Digital files often contain a lot of repeated or unnecessary information. Compression is used to reduce the size of the file, making it more efficient to store and transmit.
3.1 Purpose and Need for Compression (1.3.3)
Why do we compress data?
- Less storage space required: Large files take up less room on hard drives or in the cloud.
- Less bandwidth required: The data stream is smaller, which is essential for streaming videos or downloading files over the internet.
- Shorter transmission time: Smaller files transfer faster, improving performance and speed when sending emails or uploading.
3.2 Lossy Compression (1.3.4)
Definition
Lossy compression reduces file size by permanently removing data that is deemed less important to human perception (such as high-frequency sounds or small colour variations).
The file size reduction is usually high, but the original data cannot be perfectly restored.
Examples of Data Removal:
- Images (JPEG): Reducing the colour depth or resolution.
- Sound (MP3): Reducing the sample rate or sample resolution.
- Removing very high or very low frequencies that the human ear/eye barely notices.
Analogy: Lossy compression is like summarising a novel. You keep the main plot (the important data), but you lose the exact wording and details (the less important data). You can't get the original novel back just from the summary.
3.3 Lossless Compression (1.3.4)
Definition
Lossless compression reduces file size by finding patterns and encoding the data more efficiently, but without permanently removing any data.
The original file can be reconstructed exactly as it was before compression. The file size reduction is usually smaller than lossy compression.
Key Method: Run Length Encoding (RLE)
RLE is a simple form of lossless compression often used for files containing lots of repetition (like simple bitmap images, where long lines of the same colour appear).
How RLE Works (Step-by-Step):
- Instead of storing a long sequence of identical data values, RLE stores the data value and the number of times it repeats (the "run length").
- The original sequence of data is replaced by a shorter pair of values: (Count, Value).
Example of RLE:
Imagine a line of pixels that reads:
Uncompressed, this is 15 pieces of data.
Using RLE, we encode it as:
This is only 6 pieces of data (three pairs), achieving significant compression!
Lossless Examples:
- Text files (ZIP, RAR): Text needs to be perfectly restored, so lossless methods are required.
- Images (PNG, GIF): Used where image quality integrity is essential.
Quick Review: Lossy vs. Lossless
Lossy (Loss-y): You lose data. Used for images (JPEG) and sound (MP3). Best for maximum size reduction.
Lossless (Loss-less): You lose less (no permanent data loss). Used for archives (ZIP) and perfect-quality images (PNG). Uses methods like RLE.