Welcome to Web Science (Option C) Study Notes!

Hello future Computer Scientists! This chapter, Web Science, is one of your exciting optional topics. It’s not just about how to build a website; it’s about understanding the Web as a massive, living system—a socio-technical network that connects billions of people, machines, and ideas.

We will explore the underlying structure, how search engines actually think, the mathematics of social connections, and the vision of a "smarter" Web.

Don't worry if terms like "PageRank" or "Ontology" sound complex. We will break them down into simple, real-world examples!

C.1 The Structure of the Web

The Web as a Graph

At its heart, the World Wide Web is fundamentally a massive graph structure.

  • Nodes (Vertices): These are the individual web pages, documents, images, or resources.
  • Edges (Links): These are the hyperlinks that connect one node to another.

This simple graph structure is vital for understanding everything from how search engines work (by following edges) to how information spreads (across nodes).

Key Addressing and Protocol Concepts

Uniform Resource Identifiers (URIs) and URLs

A URI (Uniform Resource Identifier) is a general term for any string of characters that identifies a resource.

A URL (Uniform Resource Locator) is the specific type of URI that also tells you *how* to locate that resource (i.e., its access mechanism). Think of a URI as a name and a URL as a name plus a physical address.

Protocols: HTTP and TCP/IP

The backbone of communication relies on protocols:

1. Hypertext Transfer Protocol (HTTP)

  • HTTP is the protocol used to transfer data between the client (your browser) and the server.
  • Key Feature: HTTP is Stateless. This is incredibly important. It means the server forgets everything about the previous requests from a client. Every request is treated as brand new.
    Analogy: Imagine a waiter who immediately forgets your drink order after placing your main course. You have to remind them of who you are and what you are doing every time you speak.
  • To overcome this stateless nature (especially for logins or shopping carts), the Web uses cookies and session IDs to track users.

2. TCP/IP

  • TCP (Transmission Control Protocol): Ensures data packets are delivered reliably and in the correct order.
  • IP (Internet Protocol): Handles addressing, ensuring packets are routed to the correct destination IP address.
The Domain Name System (DNS)

The DNS acts as the Internet's phonebook.

  • Humans use easy-to-remember domain names (like www.google.com).
  • Computers use numerical IP addresses (like 172.217.14.174).
  • Process: When you type a domain name, your computer sends a request to a DNS server, which looks up the corresponding IP address and returns it. Your browser then uses the IP address to connect to the server.
Quick Review C.1 Key Takeaway:
The Web is a graph defined by nodes (pages) and edges (links). HTTP is stateless and relies on DNS to translate human-readable names into machine-readable IP addresses.

C.2 Finding Information: Search Technologies

The Components of a Search Engine

Search engines are complex systems designed to index and retrieve the vast amount of information on the Web. They typically consist of three main parts:

  1. The Crawler (Spider/Bot): This program systematically browses the Web, following links, and downloading pages. It discovers new and updated content.
  2. The Indexer: This component processes the downloaded pages, extracting keywords, calculating link structures, and storing the information in a massive index (like a giant library catalog).
  3. The Query Processor (Search Interface): This takes the user's search terms, checks them against the index, and applies a ranking algorithm to present the results.

Page Ranking Algorithms

Simply finding a relevant page isn't enough; search engines need to present the most important pages first. This is the job of the ranking algorithm, most famously, PageRank (developed by Google).

The Core Principle of PageRank: A page is considered important if important pages link to it.

  • It's not just the number of links a page receives (in-degree); it's the quality of the links.
  • If Page A links to Page B, Page A is "voting" for Page B.
  • If Page A is highly important, its vote carries more weight than a vote from an unimportant page.
  • The PageRank calculation is recursive: A page’s importance depends on the importance of the pages linking to it, creating a complex iteration process.

Limitations of Current Search Engines

Despite their power, search engines face significant challenges:

  • Web Spamming (SEO Manipulation): People try to trick algorithms using keyword stuffing, hidden text, or creating fake networks of links (link farms) to artificially boost ranking.
  • The Deep Web/Invisible Web: Vast amounts of data are not indexed because they are behind password walls, require form submissions (e.g., bank accounts, private databases), or are proprietary.
  • Linguistic and Cultural Barriers: Ranking relevance can be different across languages and cultures.
  • Filter Bubbles and Bias: Algorithms personalize results based on your history, leading to a situation where you are only shown information that confirms your existing views, potentially isolating you from diverse perspectives.
Did You Know? Google uses many proprietary factors beyond PageRank (often hundreds) to rank results, making the exact ranking process highly secretive and constantly evolving.

C.3 The Social Web and Network Analysis

The Social Web (Web 2.0) introduced platforms where users generate content and connections. Web science analyzes these connections using Social Network Analysis (SNA).

Key Concepts in Network Analysis

In a social network (like Facebook or Twitter):

  • Nodes: Individuals, groups, or entities.
  • Edges: Relationships or interactions (e.g., friendship, follow, mention).
Network Metrics

We use metrics to measure the importance and structure of nodes:

  • Degree Centrality: The number of direct connections a node has. (Example: A person with 500 friends has a high degree.)
  • Path Length: The shortest distance (minimum number of steps/edges) between two nodes. This is the basis of the famous "six degrees of separation" concept.
  • Clustering Coefficient: Measures how interconnected a node's immediate neighbors are. High clustering suggests the node belongs to a tight community.

Power Law Distributions and Hubs

Many real-world networks, including social networks and the Web itself, follow a Power Law distribution (or scale-free network).

  • In a normal distribution, most items are near the average.
  • In a Power Law distribution, a few nodes (hubs) have an extremely high number of connections, while the vast majority of nodes have very few connections.
  • Analogy: On Instagram, a few celebrities have millions of followers (hubs), while the average user has hundreds.
  • Importance: Hubs are critical for network resilience (if they fail, connectivity suffers massively) and for information spread (they are the main vectors for viral content).

Ethical and Legal Issues in the Social Web

The collection and use of user data raise serious ethical and legal concerns:

  • Privacy and Data Leakage: Massive data collection (often without full user awareness) makes users vulnerable to profiling, tracking, and potential data breaches.
  • Manipulation: Analysis of social network structure can be used to identify key influencers (hubs) for targeted advertising or political campaigning, potentially exploiting behavioral vulnerabilities.
  • Governance: Who regulates speech and content (e.g., hate speech, fake news) on these global platforms? National laws often struggle to keep up with borderless digital communication.

C.4 Moving Beyond Syntax: The Semantic Web

The Need for Meaning (Semantics)

The current Web (Web 2.0) is primarily designed for human consumption. While computers can read the *structure* (the syntax) of a webpage using HTML, they cannot easily understand the *meaning* (the semantics) of the content.

Example: A computer sees "

15.00

" but doesn't know if that is a time, a price, a temperature, or a count.

The goal of the Semantic Web (often called Web 3.0) is to create a Web of data where computers can understand the meaning of information, making automated processing possible.

Core Semantic Web Technologies

Resource Description Framework (RDF)

RDF is the foundational technology for expressing information in a way machines can understand. It uses simple declarative statements called triples.

A triple always consists of three parts:

  1. Subject: The resource being described.
  2. Predicate (Property): The relationship or attribute.
  3. Object: The value or another resource.

Example: If we want to state that "Tim Berners-Lee invented the World Wide Web," the triple is:
(Tim Berners-Lee, invented, World Wide Web).
This structure allows machines to build large, linked databases of meaningful facts.

Ontologies

While RDF provides the format for facts (triples), ontologies provide the rules and structure for knowledge.

  • An ontology is a formal, explicit specification of a shared conceptualization.
  • Think of an ontology as a shared dictionary or knowledge map that defines classes (types of things), properties (relationships), and constraints (rules) within a specific domain.
  • Example: An ontology for biology would formally define "Mammal," define its properties (like "has_fur"), and define constraints (like "is_a_type_of" Animal).

Ontologies allow different systems to agree on the precise meaning of terms, enabling effective data integration and automated reasoning.

Challenges of the Semantic Web

  • Adoption Rate: Getting millions of users and organizations to consistently use standardized, semantic markup is difficult and slow.
  • Complexity: Developing and maintaining complex, detailed ontologies is resource-intensive.
  • Uncertainty and Contradictions: Real-world data is often messy, contradictory, or fuzzy. Standard semantic languages struggle to handle ambiguities easily.

C.5 The Future of the Web

The Evolution of the Web

The Web has generally evolved through three phases:

  • Web 1.0 (The Read-Only Web): Static content, simple browsing, limited interaction. (1990s).
  • Web 2.0 (The Social/Interactive Web): User-generated content, blogs, social networks, dynamic interfaces. (Early 2000s).
  • Web 3.0 (The Semantic/Spatial Web): Focuses on machine understanding (Semantic Web), decentralized systems (Blockchain), and integrating physical and digital reality (IoT, Metaverse).

The Internet of Things (IoT)

IoT is critical to the future Web. It refers to the network of physical objects (devices, vehicles, appliances, sensors) embedded with technology that enables them to collect and exchange data.

IoT generates massive quantities of data (Big Data) that need to be processed, requiring the structure and intelligence envisioned by the Semantic Web to make sense of the real-time inputs.

Ongoing Challenges for the Future Web

The continued expansion of the Web faces several major socio-technical challenges:

  • Security and Trust: As more devices (IoT) and personal data move online, the risks from cyberattacks, identity theft, and data manipulation increase exponentially.
  • Information Overload: The sheer volume of content makes quality filtering and effective search increasingly difficult.
  • Digital Divide: The gap between those who have access to high-speed internet and necessary technology, and those who do not, risks creating severe social and economic inequalities globally.
  • Web Governance: Determining who controls the infrastructure, standards, and content rules for a global, borderless system remains a complex political and technical challenge.
Quick Review: Web Science Core Concepts

1. Structure: The Web is a graph. HTTP is stateless.

2. Search: PageRank values links from important pages (not just many links).

3. Social: Networks often follow a Power Law (hubs).

4. Semantic: Uses RDF Triples and Ontologies to give meaning to data for machines.