Welcome to Option A: Databases!
Hi future Computer Scientists! This option is crucial because databases are the silent engine of the modern world. Every time you log into a social media site, check your grades, or buy something online, you are interacting with a complex database system.
In this chapter, we will move beyond simple spreadsheets and learn how to structure, manage, and query massive amounts of interconnected data efficiently and securely. Don't worry if terms like normalization sound intimidating—we’ll break them down using clear, everyday analogies!
1. Fundamentals of Data Management
1.1 Data vs. Information
First, let's clarify the difference between two terms often used interchangeably:
- Data: Raw, unprocessed facts and figures (e.g., 1987, Smith, $500.00).
- Information: Data that has been processed, organized, and structured, making it meaningful (e.g., Mr. Smith, born in 1987, earned $500.00 this month.).
1.2 The Database Approach vs. Flat Files
Before dedicated database systems, people used flat files (like simple text documents or basic spreadsheets) to store data.
Flat File Limitations (Why we need Databases):
- Data Redundancy: The same information (like a customer's address) is stored multiple times, wasting space.
- Data Inconsistency: If the address is stored 10 times, and 3 are updated incorrectly, the data becomes unreliable.
- Data Dependence: The data structure is tightly linked to the application accessing it, making updates difficult.
- Limited Security: Hard to control who sees specific fields or records.
A modern Database Management System (DBMS) solves these issues by centralizing data, controlling access, and defining clear relationships between different pieces of data.
Quick Review: Key Takeaway 1
A DBMS allows controlled, consistent, and structured storage, moving past the problems of redundant flat files.
2. The Relational Database Structure (RDBMS)
The most common type of database you will study is the Relational Database Management System (RDBMS). This system organizes data into tables (called relations) which are linked together by shared keys.
2.1 Core Database Terminology
Think of a single spreadsheet (a Table) where the rows and columns have specific names:
- Table (Relation): A collection of related data organized into rows and columns. (Example: The STUDENTS table)
- Field (Attribute): A column in the table, defining a specific piece of data. (Example: Student_ID, Name, Date_of_Birth)
- Record (Tuple): A row in the table, representing a single complete set of related data. (Example: All the data for one specific student)
2.2 Understanding Keys
Keys are essential for enforcing structure and linking tables. They are attributes (fields) that uniquely identify records or create relationships.
a) Primary Key (PK)
A field (or set of fields) that uniquely identifies every record in a table.
- Rule: The PK cannot be empty (Entity Integrity) and must be unique.
- Analogy: Your unique student ID number.
b) Foreign Key (FK)
A field in one table that refers to the Primary Key in another table. This is how relationships are established.
- Analogy: A list of students (in the STUDENTS table) includes a "Course_ID". This Course_ID is the Foreign Key, linking back to the unique Primary Key (Course_ID) in the COURSES table.
- Did you know? A Foreign Key doesn't have to be unique in its own table, as many students can take the same course.
c) Composite Key
A Primary Key that is made up of two or more fields combined. This is needed when a single field isn't unique enough.
- Example: In a table tracking registrations, the combination of (Student_ID + Course_ID) might be needed to uniquely identify a single registration event.
3. Data Integrity and Consistency
3.1 Data Integrity
Data Integrity refers to the accuracy, completeness, and consistency of the data in the database. A healthy database requires high integrity.
- Entity Integrity: Ensures every record has a unique identity (Primary Keys cannot be NULL).
- Domain Integrity: Ensures that data entered into a field falls within acceptable boundaries (e.g., age must be > 0).
3.2 Referential Integrity (RI)
This is one of the most critical concepts in RDBMS. Referential Integrity ensures that the relationships between tables remain consistent.
- The Rule: Any Foreign Key value must either match a Primary Key value in the referenced table, or it must be NULL.
- Common Mistake to Avoid: Deleting a record in the "primary" table when there are still records in the "foreign" table referring to it. This creates a "dangling reference" or orphan record.
Example: If you delete Course 'CS101' from the COURSES table, but there are 50 students in the STUDENTS table still listing 'CS101' as their course, you have violated referential integrity.
4. Database Design: Normalization
Normalization is a systematic process used to organize tables and fields to minimize redundancy and dependency. It makes the database design efficient, easier to maintain, and more reliable.
Don't worry if this seems tricky at first. Think of normalization as cleaning up a messy, overcrowded filing cabinet so that everything is stored in its proper, specialized place.
4.1 Normal Forms (NF)
4.1.1 First Normal Form (1NF)
Requirement: Eliminate repeating groups (lists within cells) and ensure data is atomic.
- Atomic Data: Each column must contain only one value, and that value cannot be subdivided further (e.g., separate first name and last name fields).
- Goal: Every intersection of a row and column contains a single, indivisible value.
If you have a cell that lists "Course A, Course B, Course C," you are not in 1NF. You must move those into separate records or a separate junction table.
4.1.2 Second Normal Form (2NF)
Requirement: Must be in 1NF, AND all non-key attributes must depend on the entire Primary Key.
- This is only relevant if you have a Composite Key (a PK made of two or more fields).
- Goal: Ensure you don't store data related to only *part* of the composite key within that table. If you do, move it to a separate table.
Step-by-step example:
1. Table: ORDERS (OrderID, ProductID, Product_Name, Product_Price)
2. PK is (OrderID + ProductID).
3. Notice: Product_Name and Product_Price depend only on ProductID, not the whole key.
4. Solution: Create a separate PRODUCTS table (ProductID, Product_Name, Product_Price) and link it back to the ORDERS table. Now the ORDERS table is in 2NF.
4.1.3 Third Normal Form (3NF)
Requirement: Must be in 2NF, AND eliminate transitive dependencies.
- Transitive Dependency: A non-key attribute determines another non-key attribute (A -> B -> C, but A is the PK). Data in the table does not depend solely on the PK, but on another non-key field.
- Goal: Ensure all fields depend only on the Primary Key, and nothing else.
Example:
1. Table: EMPLOYEES (Employee_ID, Employee_Name, Department_ID, Department_Manager)
2. Employee_ID is the PK.
3. The Department_Manager depends on the Department_ID, not the Employee_ID. This is transitive.
4. Solution: Create a separate DEPARTMENTS table (Department_ID, Department_Manager) and link it back. Now the EMPLOYEES table is in 3NF.
Memory Aid (3NF): The three rules for 3NF: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
Quick Review: Normalization
We normalize to 3NF to prevent redundancy, minimize update/delete anomalies, and ensure data integrity.
5. Database Querying: Structured Query Language (SQL)
SQL (Structured Query Language) is the standard language used to interact with and manage relational databases. It is essential for retrieving, inserting, modifying, and deleting data.
5.1 Data Manipulation Language (DML)
DML commands handle the actual data stored within the tables.
a) SELECT (Retrieving Data)
Used to retrieve records from one or more tables. This is the most common command.
Syntax structure:
SELECT [columns] FROM [table] WHERE [condition] ORDER BY [column]
Example 1 (Basic Retrieval): Retrieve all student names.
SELECT Name FROM Students;
Example 2 (Conditional Retrieval): Retrieve the names of students older than 18.
SELECT Name FROM Students WHERE Age > 18;
b) INSERT (Adding Data)
Used to add new records (rows) into a table.
Syntax structure:
INSERT INTO [table] (column1, column2, ...) VALUES (value1, value2, ...);
Example:
INSERT INTO Courses (Course_ID, Title) VALUES ('CSHL', 'HL Computer Science');
c) UPDATE (Modifying Existing Data)
Used to change the values of existing records in a table. Be careful with the WHERE clause!
Syntax structure:
UPDATE [table] SET [column] = [new_value] WHERE [condition];
Example: Change the name of the 'CSHL' course to 'Advanced Comp Sci'.
UPDATE Courses SET Title = 'Advanced Comp Sci' WHERE Course_ID = 'CSHL';
d) DELETE (Removing Data)
Used to remove records (rows) from a table.
Syntax structure:
DELETE FROM [table] WHERE [condition];
Example: Remove the student with ID 999.
DELETE FROM Students WHERE Student_ID = 999;
Common Mistake: Forgetting the WHERE clause in an UPDATE or DELETE command. If you forget it, the operation will affect every single record in the table!
5.2 Joining Tables (The Power of Relational Data)
The primary reason for using an RDBMS is the ability to link (join) data across multiple tables using Foreign Keys.
An INNER JOIN combines records from two tables where there are matching values in the specified columns (usually the PK/FK pair).
Example: Retrieve student names along with the titles of the courses they are enrolled in.
SELECT S.Name, C.Title
FROM Students S
INNER JOIN Courses C ON S.Course_ID = C.Course_ID;
6. Database Security and Access Rights
Databases often hold sensitive information (personal data, financial records). Security is paramount and includes both physical and logical measures.
6.1 Logical Security Measures
- User Authentication: Requiring unique usernames and strong passwords to access the database system.
- Access Rights (Permissions): Defining exactly what actions a specific user or user group can perform (e.g., Jane can only
SELECTdata; Bob canINSERTandUPDATE). This is often controlled by SQL commands likeGRANTandREVOKE. - Views: A View is a virtual table based on the result-set of an SQL query. It allows you to restrict the columns and rows a user can see without giving them access to the underlying tables. (Example: A teacher's view only shows student names and scores, hiding addresses and medical history.)
- Encryption: Storing sensitive data (like passwords) in an encrypted format so that even if the database is breached, the data is unreadable.
6.2 Physical Measures and Recovery
- Backups: Regular copying of the entire database to a secure, off-site location to allow recovery from system failure or corruption.
- Transaction Logs: Recording every operation (transaction) performed on the database. If a crash occurs, the logs allow the database to roll back incomplete transactions or restore recent changes.
- Redundancy: Using duplicated hardware (like RAID systems) to ensure that a single component failure does not lead to data loss or system downtime.
You have now mastered the structure and core principles of relational databases, the foundation for nearly all large-scale modern applications! Remember that good database design (normalization) is the key to reliable, efficient systems. Keep practicing those SQL query structures!