Welcome to Data Storage and Compression!

Hello future computer scientists! This chapter is incredibly important because it explains how all the information we use every day—photos, videos, programs, and documents—actually fits inside our devices and travels across the internet.

Think of data storage as finding the right sized box for your stuff, and data compression as learning how to fold your clothes perfectly so they all fit in the box! We will explore the units used to measure data, the different types of storage devices, and the clever methods computers use to shrink file sizes.

Don't worry if some of the terms seem new; we will break everything down step-by-step. Let's dive in!


Section 1: Measuring Data Storage Capacity

1.1 The Basics: Bits and Bytes

Everything a computer processes is just electrical signals that are either ON or OFF. We represent these states using 1s and 0s.

  • Bit (b): The smallest unit of data. It is a single binary digit (1 or 0).
  • Nibble: 4 bits (half a byte).
  • Byte (B): 8 bits. This is the fundamental unit used to store a single character (like the letter 'A' or the number '5').

Memory Aid: If you remember the word bite, a byte is the basic unit that computers take a 'bite' of when storing information.

1.2 Larger Units of Measurement

When dealing with computer storage, the units are based on powers of 2, specifically 1024, not 1000 (which is used in standard metric measurements like kilometres).

Why 1024? Because \(2^{10} = 1024\), and computers deal with binary powers!

Unit Abbreviation Size Equivalent
Kilobyte KB 1024 Bytes
Megabyte MB 1024 KB
Gigabyte GB 1024 MB
Terabyte TB 1024 GB
Petabyte PB 1024 TB

A modern smartphone might have 128 GB of storage, while a large server farm might manage multiple Petabytes!

Quick Review: Capacity Order

The units go in size order: B, KB, MB, GB, TB, PB.

Key Takeaway: All computer data storage capacity is measured using these units, where the key multiplier is 1024, not 1000.


Section 2: Data Storage Devices

Not all storage is created equal. We classify storage devices based on how they work, how fast they are, and whether they lose data when powered off.

2.1 Primary vs. Secondary Storage

  • Primary Storage: This is memory directly accessible by the CPU (like RAM). It is very fast but usually volatile (it loses its data when the power is turned off).
  • Secondary Storage: This is used for long-term storage of files and programs (HDD, SSD, etc.). It is non-volatile (data is kept even when the power is off) but slower than primary storage.

2.2 Common Secondary Storage Devices

We need to understand the characteristics and suitability of the three main types of secondary storage: Magnetic, Solid State, and Optical.

A. Magnetic Storage (e.g., Hard Disk Drive - HDD)

HDDs store data using magnetised spots on rapidly spinning disks (platters). A moving read/write head accesses the data.

  • Characteristics: High capacity (up to 20 TB or more), relatively cheap per GB.
  • Suitability: Used for desktop computers, servers, and systems needing large storage volume at low cost.
  • Drawbacks: Contains moving parts, making it slower, prone to damage, and generates noise/heat.
B. Solid State Storage (e.g., Solid State Drive - SSD, USB sticks)

SSDs use electrical circuits (flash memory chips) to store data. There are no moving parts.

  • Characteristics: Extremely fast read/write speeds, high portability (USB sticks).
  • Suitability: Used in modern laptops, smartphones, and devices where speed and durability are critical.
  • Drawbacks: More expensive per GB than HDD, has a limited number of write cycles (though this limit is now very high).

Analogy: Comparing an HDD to an SSD is like comparing an old record player (slow spinning parts) to a modern digital playlist (instant access chips).

C. Optical Storage (e.g., CDs, DVDs, Blu-ray discs)

Optical storage uses lasers to read tiny pits and bumps on a reflective surface.

  • Characteristics: Durable (if not scratched), relatively low capacity compared to HDDs/SSDs.
  • Suitability: Distributing software, movies, music, and archiving data for long periods.
  • Access: Very slow access speed.
D. Magnetic Tape (Offline/Archival Storage)

Magnetic tape stores data sequentially on large reels.

  • Suitability: Used primarily for archiving and large-scale corporate backups (backing up huge amounts of data that won't need to be accessed quickly).
  • Access: Must read all the data sequentially (one after the other) to find the desired file, making access extremely slow.

Quick Comparison Table (Focus on Speed and Volatility):

RAM (Primary): Very Fast, Volatile (Data lost when power off).
SSD (Secondary): Very Fast, Non-Volatile (Data kept when power off).
HDD (Secondary): Slow/Medium, Non-Volatile.
Tape (Archival): Very Slow, Non-Volatile.

Key Takeaway: Choose the right storage based on required speed, capacity, portability, and cost. SSDs are fast and durable; HDDs are cheap and large; Optical is good for distribution; Tape is best for deep, long-term archives.


Section 3: Data Compression

Data compression is the process of reducing the size of a file so it takes up less storage space and transfers faster across networks.

3.1 Why Do We Compress Data?

  • Saves Space: We can fit more files onto a hard drive or USB stick.
  • Faster Transmission: Smaller files take less time to download, upload, or email.

3.2 Lossy Compression

Lossy compression permanently removes some data from the file. Once the data is removed, you cannot get it back.

  • How it works: It removes details that the human eye or ear is unlikely to notice.
  • Result: Significant reduction in file size, but a slight reduction in quality.
  • Suitability: Generally used for multimedia where a little lost quality is acceptable (e.g., pictures, audio).
  • Examples: JPEG (images), MP3 (audio), MPEG (video).

Analogy: Lossy compression is like summarizing a long novel. You keep the main plot points (most important data) but throw away some descriptive details (less important data). You can't reconstruct the original novel exactly.

3.3 Lossless Compression

Lossless compression reduces file size by identifying and removing redundant (repeated) data without losing any information. The original file can be perfectly reconstructed from the compressed file.

  • How it works: It uses algorithms to encode repetitive patterns or common sequences with shorter codes.
  • Result: Reduced file size with zero loss of quality.
  • Suitability: Used for text files, program code, and images where accuracy is crucial.
  • Examples: ZIP (archiving folders), PNG (images), GIF.

Analogy: Lossless compression is like creating a perfect, tidy instruction manual for building flatpack furniture. Everything is still there, but arranged much more efficiently.

Common Mistake to Avoid:

Students often confuse the two. Remember: Lossy means quality Lost. Lossless means No Loss of data.

3.4 Compression Methods

How do computers actually achieve compression? Two common techniques are Run Length Encoding and Dictionary Encoding.

A. Run Length Encoding (RLE)

RLE is a simple, lossless technique that works best on files with long sequences (runs) of the same data, like certain types of images (e.g., simple black and white graphics).

Step-by-Step Example:

  1. Look for consecutive identical characters or data units.
  2. Replace the sequence with the number of times it appears, followed by the unit itself.

Original Data: B B B B W W W W W R R R R R R
Compressed (RLE): 4B 5W 6R

The original string had 15 characters, the compressed string has only 6 characters (counting the number and the letter), achieving compression!

B. Dictionary Encoding (or Lempel-Ziv variants)

This lossless method replaces common, repeated patterns or words with a short code or pointer that is stored in a 'dictionary'.

  • How it works: The algorithm scans the data, finds phrases or sequences that occur often, and adds them to a reference list (the dictionary).
  • Every time that sequence is found again, it is replaced by the much shorter dictionary index/code.

Example: If the phrase "Computer Science" appears 100 times in a document, the dictionary might assign it the code #15. Instead of storing 18 characters repeatedly, the file only stores the 3-character code #15, saving space.

Key Takeaway: Lossy compression sacrifices quality for maximum size reduction (MP3/JPEG), while Lossless compression achieves perfect reconstruction by eliminating redundancy (ZIP/PNG) using methods like RLE or Dictionary Encoding.


Chapter Summary Review

Key Concepts to Remember:
  • Storage capacity units are based on 1024 (Byte up to Petabyte).
  • Primary Storage is fast and volatile; Secondary Storage is slower and non-volatile.
  • SSDs are faster, HDDs are cheaper and larger.
  • Lossy compression loses data permanently (JPEG), used mainly for media.
  • Lossless compression retains all original data (ZIP), used for text and programs.
  • RLE is a method of lossless compression that counts repeating sequences.