🎉 Chapter 3.13.4: Backus-Naur Form (BNF) and Syntax Diagrams

Hello future Computer Scientists! This chapter introduces powerful tools used to define the structure, or syntax, of programming languages.
Ever wondered how a computer knows if your Python code is valid, or if you made a mistake writing a simple email address? The rules are laid out using formal grammar systems like Backus-Naur Form (BNF) and visual aids like Syntax Diagrams.

Understanding these concepts is key to the Theory of Computation because they formalise exactly what a "valid program" or a "valid piece of data" looks like. Let's dive into the rulebooks of computer languages!


1. Defining Syntax: The Grammar of Languages

Before we tackle BNF, remember that any language (like Python, English, or even a simple date format) has rules about how its elements can be combined.

What is Syntax?

Syntax refers to the rules that govern the structure of sentences or instructions in a language. If you follow the syntax rules, your sentence or instruction is correctly formed, even if it doesn't make logical sense.
Example: "The computer runs quickly" has correct English syntax. "Runs computer quickly the" has incorrect syntax.

In programming, we need a precise, unambiguous way to write down these rules—that's where BNF and Syntax Diagrams come in.

2. Backus-Naur Form (BNF)

Backus-Naur Form (BNF) is a standard notation system used to express the production rules (the grammar) of a programming language or other formal language. It allows us to define complex structures by breaking them down into simpler components.

Key Symbols in BNF (The Alphabet Soup)

Don't worry about the formal names—focus on what these symbols *mean* and *do*.

  • Non-Terminal Symbols: Represent categories or components that still need to be defined further. They are always enclosed in angle brackets.
    Example: <digit>, <variable name>
  • Terminal Symbols: These are the actual characters or keywords that appear literally in the language. They cannot be broken down further.
    Example: a, 5, =, IF, ;
  • Definition Symbol (::=): Means "is defined as" or "can be replaced by". It separates the non-terminal being defined from its definition.
  • Alternation Symbol (|): Represents a choice, meaning "OR".

💡 Memory Aid: Think of the angle brackets < > as wrapping a concept, while plain text is the final, concrete character.

Step-by-Step: Formulating Simple Production Rules

A production rule (or definition) specifies how a non-terminal symbol can be constructed.

Let’s define a simple <digit>:

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Translation: A <digit> is defined as (::=) the terminal symbol 0 OR (|) the terminal symbol 1 OR... and so on.

Now, let's define a <binary number>:

A binary number is either a single 0 or 1, or it is a 0 or 1 followed by another binary number (this is called recursion).

<binary digit> ::= 0 | 1
<binary number> ::= <binary digit> | <binary digit> <binary number>

Don't worry if the recursive rule looks tricky! It simply means you can repeat the structure. If we use this rule, we can generate:

  • 1 (using the first alternative)
  • 10 (using the second alternative: <binary digit> (1) followed by <binary number> (0))
  • 1011 (and so on...)
Checking Language Syntax with BNF

To check if a specific string follows the language, we try to trace the string back to the initial non-terminal symbol. This is called parsing.

Example: Checking if '5F' is a valid hexadecimal digit.

Assume the following rules:

<hex digit> ::= <digit> | <letter A-F>
<letter A-F> ::= A | B | C | D | E | F
<digit> ::= 0 | ... | 9

The string '5F' is not valid as a single <hex digit> because the rule only allows one digit OR one letter. It does not allow one digit AND one letter.
If we wanted to define a two-character hex code, we would need a new rule:

<hex code two char> ::= <hex digit> <hex digit>

Now, '5F' is valid under <hex code two char>.

🔑 Key Takeaway: BNF

BNF is a text-based formal grammar that uses Non-terminals (<>), Terminals (literal text), Definition (::=), and Alternation (|) to define the structure of a language.

3. Syntax Diagrams (The Visual Grammar)

While BNF is precise for computers, Syntax Diagrams (or railroad diagrams) offer a visual way to represent the same production rules. They are often easier for humans to read and understand at a glance.

How to Read a Syntax Diagram

Think of a syntax diagram like a roadmap:

  1. You start on the left of the diagram.
  2. You follow the arrows to the right.
  3. Any shape you pass through is part of the required syntax.
  4. You must finish on the far right of the diagram.

The Shapes:

  • Ovals or Circles: Used for Terminal Symbols (literal characters or keywords). You must use exactly the symbol shown.
  • Rectangles: Used for Non-Terminal Symbols (a concept defined elsewhere, usually in its own diagram).

Imagine a diagram defining a simple "Command":

A Command could be defined as the keyword PRINT followed by a <String> OR the keyword INPUT followed by a <Variable>.

If you see a path that loops back on itself, that indicates repetition (like iteration in programming). If there are alternative paths (like a fork in the road), that indicates alternation (the OR operator).

Checking Syntax Using Diagrams

To check syntax, you simply try to draw a continuous line through the diagram, matching the input string exactly to the symbols encountered. If you can traverse the diagram completely, the input is syntactically correct.

Quick Review: BNF vs. Syntax Diagrams
  • BNF: Text-based, precise, good for machines (compilers/parsers).
  • Diagrams: Visual, intuitive, good for humans (programmers/students).
  • They both define the exact same language grammar.

4. The Power of BNF (Context-Free Languages)

This is a high-level concept, but an important one for the syllabus, as it explains why we use BNF instead of simpler methods like Regular Expressions (REs) for defining complex programming languages.

In the Theory of Computation, languages are ranked by complexity. Regular Expressions define the simplest class of languages, known as regular languages (which can be recognised by Finite State Machines, or FSMs).

However, programming languages are generally Context-Free Languages. BNF is specifically designed to handle these more complex rules.

Why Regular Expressions Cannot Represent Everything

Regular expressions are limited because they have no "memory" or counting ability. They can only check sequence and simple repetition.

BNF, through its use of recursion (a non-terminal referring back to itself), allows us to define structures that require balancing or internal counting.

Languages that Require BNF (The Syllabus Must-Knows)

The syllabus explicitly requires you to know that BNF can represent certain languages that regular expressions cannot. These languages are usually dependent on context or require matching pairs.

1. Palindromes

  • A palindrome is a string that reads the same forwards and backwards (e.g., 'madam', 'racecar').
  • To check for a palindrome, you must ensure the first character matches the last, the second matches the second-to-last, and so on. This requires a form of comparison or memory that regular expressions lack.
  • BNF Rule (for simple 'a' and 'b' palindromes):
    <P> ::= a | b | a <P> a | b <P> b | ε (where ε means 'empty string')

2. Strings containing equal numbers of certain characters

  • A language where, for example, the number of 'A's must exactly equal the number of 'B's (e.g., 'AABB', 'ABAB', but not 'AAAB').
  • Regular expressions cannot count or ensure this balance, but BNF handles it through recursive structure.
Did You Know?

The term "Backus-Naur Form" is named after John Backus and Peter Naur, who used it to define the syntax of one of the earliest high-level programming languages, Algol 60, in the late 1950s. This formal definition was revolutionary for compiler design!

🌟 Why is BNF/Context-Free Important?

The syntax of almost all modern programming languages (like C++, Java, Python) is defined using BNF or an extension of it. This capability is essential because these languages require structures like nested parentheses (), balanced brackets [], and matching IF/END IF blocks—all of which require the "memory" provided by context-free grammar.

5. Summary: Core Concepts to Master

A. Checking Syntax
  • Be able to read a BNF definition and determine if a sample string is valid by tracing the rules.
  • Be able to read a Syntax Diagram (following the paths) and determine if a sample string is valid.
B. Formulating Simple BNF Rules
  • Understand and use <non-terminal>, ::=, |, and terminal symbols.
  • Be able to write a basic rule that involves alternation (OR) or simple repetition/recursion.
C. Limitations of Regular Expressions
  • Know that REs define regular languages (less complex).
  • Know that BNF can define context-free languages (more complex).
  • Be able to state the specific examples of languages BNF can represent that REs cannot: Palindromes and languages requiring equal numbers of specific characters.

Keep practising tracing those rules! You've successfully navigated the grammar police of Computer Science!