Unit 3: Manipulating Data - Study Notes

Hello future IT professional! Welcome to the "Manipulating Data" chapter. Don't worry if this sounds complicated—it's essentially the process of taking raw information and making it useful, reliable, and easy to find. Think of it like organizing your messy room so you can actually find your favorite jacket!

In this unit, we will explore the fundamental steps needed to ensure data is high quality and learn the essential techniques computers use to quickly organize and retrieve information.


Foundations: Data Types

Before we manipulate data, the computer needs to know what kind of data it is dealing with. Assigning a data type tells the system how much memory to allocate and what operations are allowed on that data.

Imagine you have three different boxes: one for apples, one for oranges, and one for money. You wouldn't put money in the apple box! Data types are the labels on those boxes.

  • Integer (INT): Used for whole numbers (numbers without a fractional part).
    Examples: 10, -5, 4000. Used for counting items or ages.
  • Real / Floating Point (FLOAT): Used for numbers that include a decimal or fractional part.
    Examples: 10.5, 3.14159, -0.01. Used for measurements or currency calculations.
  • Boolean (BOOL): Used for data that can only have two possible states: True or False (often represented as 1 or 0).
    Examples: IsLoggedIn (True), IsActive (False). Used for logical checks.
  • Character / String (CHAR/STRING): Used for text, including letters, symbols, spaces, and numbers treated as text.
    Examples: "Hello World", "123 Main Street", "A". A telephone number is often stored as a string because you don't perform math on it.

Quick Review Box: Data types ensure memory is used efficiently and prevents logical errors (like trying to add a name to a number).


Section 1: Data Quality – Validation vs. Verification

The quality of your output depends entirely on the quality of your input. This is often summarized as: Garbage In, Garbage Out (GIGO).

We use two critical processes to ensure data quality: Validation and Verification. Students often confuse these, so let’s look closely!

Data Validation: Is the data REASONABLE?

Data Validation checks if the input data meets predefined rules and is reasonable or acceptable. It does *not* check if the data is correct, only if it is in the right format or range.

Analogy: If you ask for someone's age, validation checks if the number is between 1 and 150. It won't check if the person is lying about being 50!

Types of Validation Checks
  • Presence Check: Ensures that a required field has been filled in (is not left blank).
    Example: A username field on a signup form must have data entered.
  • Range Check: Ensures that data falls between an upper and lower limit.
    Example: Hours worked per week must be between 1 and 60.
  • Format Check: Ensures the data is entered in a specific pattern or structure.
    Example: A postcode must follow the format LLNN NLL (e.g., SW1A 0AA).
  • Length Check: Ensures the data contains a minimum or maximum number of characters.
    Example: A password must be at least 8 characters long.
  • Lookup Table/Value List: Checks the data against a list of acceptable values stored elsewhere.
    Example: When selecting a country, the input must match one of the options in the pre-approved list of countries.
  • Check Digit: A special digit added to the end of a long number (like an ISBN or barcode) which is calculated from the other digits. The system recalculates the check digit upon entry to ensure the number was typed correctly.
    Example: Used widely in banking (account numbers) and product codes.

Memory Aid for Validation: Validation ensures the data follows the Rules (Range, Format, Length, etc.).


Data Verification: Is the data ACCURATE?

Data Verification checks if the data entered matches the original source data. It ensures accuracy.

Analogy: Verification is checking the number on your shopping receipt against the price tag you saw on the shelf.

Methods of Verification
  • Double Entry: The data is input twice, often by two different people or by the same person prompted to re-enter. The computer compares the two entries. If they match, it assumes the data is accurate.
    Common Mistake: If the same typo is made twice, double entry won't catch it.
  • Visual Check: A person manually compares the data displayed on screen against the original source document (e.g., checking a scanned form against the original paper).
    Benefit: Simple and effective for smaller datasets, but slow and prone to human error for large volumes.

Key Takeaway: Validation ensures data is sensible; Verification ensures data is correct.


Section 2: Manipulating Data – Organization and Retrieval

Once data is input and validated, we often need to organize it to make it useful. The two most crucial manipulation operations are sorting and searching.

Sorting Data

Sorting is the process of arranging data records into a specific order, either ascending (A-Z, 1-10) or descending (Z-A, 10-1).

Why Sort? It makes it much easier for humans to read reports, but more importantly, it makes computerized searching significantly faster.

Did You Know? Databases usually allow you to sort by multiple fields (e.g., Sort by Last Name, and then by First Name within the same Last Name group).


Searching Data

Searching is the process of finding a specific piece of data within a large collection of records.

Linear Search (Sequential Search)

This is the simplest search technique. It checks every single item in the dataset, starting from the beginning, until the target item is found or the end of the list is reached.

  • Process:
    1. Start at the first item.
    2. Is this the item we are looking for?
    3. If yes, stop.
    4. If no, move to the next item and repeat step 2.
  • Advantage: Works on unsorted data.
  • Disadvantage: Very slow for large datasets, especially if the target item is near the end or doesn't exist.
Binary Search

The Binary Search is a highly efficient searching algorithm, but it has one vital prerequisite:

Prerequisite: The data MUST be sorted first.

Analogy: Binary search is like finding a word in a dictionary. You don't look at every page; you jump straight to the middle!

The Binary Search works by repeatedly halving the portion of the list that could contain the item.

Step-by-Step Binary Search (The Divide and Conquer Method):

  1. Find the middle item of the list.
  2. Compare the middle item with the target item:
    • If they match, the search is complete.
    • If the target item is smaller than the middle item, ignore the entire upper half of the list (and the middle item).
    • If the target item is larger than the middle item, ignore the entire lower half of the list (and the middle item).
  3. Repeat the process (starting from step 1) on the remaining half of the list until the item is found or the list is exhausted.

Key Advantage: Extremely fast, especially for large lists, because it eliminates half the possible items in every comparison.

Key Disadvantage: Requires the dataset to be sorted initially, which takes time.


Indexing

Sometimes, we need to search quickly but don't want to physically re-sort the entire massive database every time a new record is added. This is where Indexing comes in.

An Index is a separate, structured list of primary key values and pointers (memory addresses) that tell the system where the full record is stored.

Analogy: An index in a textbook lists keywords (the primary key) and the page numbers (the pointer). You use the index to find information quickly without having to read every page of the book.

  • Process: Instead of searching the huge main data file, the system searches the much smaller and already sorted index file.
  • Benefit: Faster retrieval of records because the index is smaller and optimized for searching (often using the Binary Search technique). The main data file remains untouched and unsorted, allowing for faster insertions and updates of new records.

Key Takeaway: Sorting makes searching (Binary Search) possible and fast. Indexing achieves fast retrieval without having to physically reorganize the entire dataset.