📚 Chapter Notes: Big Data (3.15.2) 📊

Welcome to the fascinating world of Big Data! This topic moves us beyond traditional databases and introduces the enormous scale of information modern systems handle—think billions of transactions per second. Understanding Big Data is crucial because it’s the data structure challenge of our digital age. Don't worry if this seems massive; we will break down the "bigness" into simple, understandable components!

1. Defining Big Data: The Three Vs

The term Big Data is a catch-all phrase for datasets so large, complex, and rapidly changing that they cannot be stored, managed, or processed efficiently using traditional database methods (like a single, standard relational database server).

To understand Big Data, we usually define it using three key characteristics, often called the Three Vs (V³):

Volume (Too Big to Fit)

Volume refers to the sheer scale of the data.

  • What it means: The amount of data generated is immense—we are talking petabytes and exabytes (huge powers of 10).
  • The Challenge: The data volume is often too big to fit onto a single server.
  • Analogy: Imagine trying to fit the water from a huge lake into a single drinking glass. It requires a different strategy!
Velocity (Too Fast to Stop)

Velocity refers to the speed at which the data is generated and must be processed.

  • What it means: Data is often streaming data, meaning it is arriving continuously (like stock market tickers or social media feeds).
  • Required response time: Processing must happen rapidly, often requiring responses in milliseconds to seconds.
  • Example: Fraud detection systems must analyze transactions instantly as they happen.
Variety (Too Diverse to Table)

Variety refers to the different forms and structures the data takes.

  • What it means: Big Data comes in many formats:
    • Structured: Data that fits neatly into rows and columns (like a traditional database).
    • Unstructured: Data with no defined format (e.g., raw text documents, emails, social media posts).
    • Multimedia: Videos, images, and audio files.
  • The primary difficulty: The syllabus notes that the lack of structure is the most difficult aspect of Big Data, making analysis significantly harder than analyzing neat, structured data.


Quick Review: The 3 Vs Mnemonic
Volume (Size)
Velocity (Speed)
Variety (Types/Structure)

2. Why Traditional Relational Databases Struggle

When dealing with Big Data, traditional relational databases (like those using SQL) are generally not appropriate.

  • They require the data to fit into a strict row-and-column format (structured data).
  • When data is highly varied or unstructured (like 90% of Big Data), forcing it into fixed tables is inefficient or impossible.
  • Relational systems were not fundamentally designed for processing that must be split across hundreds or thousands of servers (distributed processing).

Key Takeaway: Big Data is defined by its scale, speed, and messy nature. Its lack of uniform structure breaks traditional relational database rules.

3. Handling Big Data: Distributed Processing

Since the volume of data is too large to process on a single server, the solution is to use distributed processing.

This means the processing tasks must be distributed across more than one machine working together simultaneously.

Functional Programming as a Solution

Writing correct and efficient code that runs across many separate servers simultaneously (distributed code) is difficult. Functional programming (FP) is often preferred for Big Data environments because its core characteristics simplify this challenge:

  • Immutable Data Structures: Data cannot be changed after it is created. This is vital in distributed systems because if two servers read the same piece of data, they know it won't be suddenly altered by a third server.
  • Statelessness: Functions do not rely on or change any external "state" (data outside the function itself). This makes the order in which servers complete their tasks irrelevant and prevents unexpected side effects.
  • Higher-Order Functions (like Map-Reduce): These functions are essential for combining results from many servers.
    • Map: Processes data on individual servers.
    • Reduce: Combines the results from all the servers into a single, final output.

Did you know? Many of the core technologies behind Big Data processing (like Hadoop and Spark) rely heavily on the principles of map-reduce, a concept rooted deeply in functional programming.

Key Takeaway: Big Data must be processed using distributed computing, and functional programming concepts (immutability, statelessness, map-reduce) make this complex task manageable.

4. Modelling Big Data Structure

Because the data doesn't fit into tables, we use different models to understand its structure.

Fact-Based Model

The fact-based model is a simple way to represent data where every piece of information is captured as a single, undeniable fact.

  • Principle: Each fact captures one tiny piece of information.
  • Example: Instead of a huge record, the system records three separate facts: "Student X has a DOB of 11/03/2012", "Student X is Male", "Student X is a member of Form 11R".
Graph Schema Model

The graph-based schema focuses on relationships and is excellent for modeling complex, interconnected data (like social networks or supply chains).

It captures the structure of the dataset using three components:

  1. Nodes (Entities): Represent the individual items or entities about which data is stored.
    • Diagram Symbol: Represented as an oval.
    • Example: A Student, a Course, a Room.

  2. Properties (Attributes): Detail the characteristics of the entity.
    • Diagram Symbol: Drawn in rectangles, which are attached to the oval for the entity by a dashed line.
    • Example: The student's name, DOB, or gender.

  3. Edges (Relationships): Represent the connections or relationships between entities.
    • Diagram Symbol: Represented by solid line edges drawn between the nodes.
    • Labeling: These lines must be labelled with text to describe the nature of the relationship (e.g., "Attends," "Member of," "Taught by").

Important note for diagrams: Be sure to correctly use dashed lines for connecting properties to the entity node, and solid lines for connecting entities to each other!

Key Takeaway: When relational tables fail, Big Data often uses models like the Fact-Based model or Graph Schemas (using nodes, properties, and labelled edges) to manage structure.