Theory of Computation: Regular Expressions and Regular Languages
Hello! Welcome to one of the most practical and fascinating areas of the Theory of Computation: Regular Expressions.
Don't worry if the name sounds intimidating. In simple terms, Regular Expressions (often shortened to "Regex" or "RE") are just incredibly powerful tools used to describe and match patterns in strings of text.
If you have ever used the 'Find and Replace' function in a word processor, or checked if an email address is valid when signing up for a website, you have already seen Regular Expressions in action!
In this chapter, we will learn how to write these patterns and understand their deep connection to the simplest type of abstract computing machine: the Finite State Machine (FSM).
1. Understanding Regular Languages
What is a Language in Computer Science?
In the context of computation theory, a language is simply a set of valid strings (sequences of characters) formed from a specific alphabet.
- Example: If the alphabet is \(\{0, 1\}\), a language might be the set of all strings that start with 0 (e.g., 0, 01, 000, 0101...).
A Regular Language is a language that can be precisely described using a Regular Expression, or recognized by a Finite State Machine (FSM). These are the simplest types of languages we study in this field.
What is a Regular Expression (RE)?
A Regular Expression is a formal sequence of characters that defines a search pattern. Think of it as a blueprint for a valid string.
Analogy: Imagine you are trying to find every car license plate in a database. You don't know the exact plate number, but you know the pattern: three letters, followed by two digits, followed by one letter. The Regular Expression is the way you define that exact pattern.
Key Takeaway: If a pattern can be described by a regular expression, it belongs to a Regular Language.
2. The Essential Metacharacters (Building Blocks)
To write an RE, we use standard characters (like 'a', '1', '@') combined with special characters called metacharacters. These metacharacters define the rules of repetition and choice.
We will use the alphabet \(\{a, b, c\}\) for these examples.
2.1 Repetition and Optionality
These metacharacters control how many times the preceding character or group is allowed to appear.
1. The Kleene Star: * (Zero or More Repetitions)
The * metacharacter means the preceding element can occur zero or more times. It is perhaps the most fundamental repetition operator.
-
RE:
a*
Matches: "", "a", "aa", "aaa", "aaaaa"... -
RE:
b(a)*
Matches: "b" (whenarepeats 0 times), "ba", "baa", "baaa"...
Memory Aid: The * looks like a snowflake—it can cover zero space (empty) or huge amounts of space (infinite repetition).
2. The Plus Sign: + (One or More Repetitions)
The + metacharacter means the preceding element must occur one or more times. It is similar to *, but it cannot match an empty string.
-
RE:
a+
Matches: "a", "aa", "aaa"... (but NOT the empty string "") -
RE:
a(bc)+
Matches: "abc", "abcbc", "abcbcbc"...
3. The Question Mark: ? (Optional/Zero or One Repetition)
The ? metacharacter means the preceding element is optional. It occurs zero or one time.
-
RE:
colou?r
Matches: "color" or "colour" (useful for handling different spellings!) -
RE:
a(b)?c
Matches: "ac" or "abc"
2.2 Alternation and Grouping
4. The Alternation/OR Operator: |
The | metacharacter is used to specify alternation, meaning "this or that." It provides a choice between two or more expressions.
-
RE:
cat|dog
Matches: Exactly "cat" or exactly "dog". -
RE:
a(b|c)d
Matches: "abd" or "acd".
5. Grouping: ()
Parentheses () are used to group regular expressions together so that repetition or alternation operators can be applied to the whole group.
-
If you write
ab+, the+applies only to the 'b' (a, abb, abbb...). -
If you write
(ab)+, the+applies to the whole group 'ab' (ab, abab, ababab...).
Example: Matching a simple filename with a specific extension.
RE: (data|file).txt
Matches: "data.txt" or "file.txt"
✎ Quick Review: Metacharacter Meanings
*: Zero or more+: One or more?: Zero or one (optional)|: OR (alternation)(): Grouping
3. Character Classes
Character classes allow you to specify a set of acceptable characters for a single position in the string. These are defined using square brackets [].
1. Specific Character Sets
Placing characters inside the brackets means "match any single character from this set."
-
RE:
[aeiou]
Matches: Any single vowel (e.g., 'a', 'e', 'i', 'o', or 'u'). -
RE:
[CH]at
Matches: "Chat" or "Hat". (The syllabus specifies this format.)
2. Character Ranges
To specify a large range of characters (like all numbers or all uppercase letters) without listing them individually, you use a hyphen -.
-
RE:
[0-9]
Matches: Any single digit from 0 to 9. -
RE:
[A-Z]
Matches: Any single uppercase letter from A to Z. -
RE:
[a-zA-Z0-9]+
Matches: Any combination of letters (upper or lower) and numbers, at least one character long.
Did you know? Regular expressions are essential for lexical analysis (the first stage of a compiler) where they are used to recognize tokens like keywords, identifiers, and numbers.
Example RE Construction
Let's construct an RE for a valid UK postal code structure that looks like this: A letter, followed by zero or more digits, followed by a mandatory space, followed by one or more letters.
The structure required is complex, so let's simplify based on the syllabus. Let's describe a simple identifier: a variable name that must start with a letter and can contain letters or numbers afterwards.
- Must start with a letter:
[a-zA-Z] - Followed by letters or numbers, zero or more times:
[a-zA-Z0-9]*
Combined RE: [a-zA-Z][a-zA-Z0-9]*
- Matches: "X", "Var1", "myVariable", "count3"
- Does NOT Match: "1X", "345"
Common Mistake to Avoid: Confusing a|b* with (a|b)*.
The first matches 'a' OR zero/more 'b's (e.g., 'a', 'b', 'bb', 'bbb').
The second matches any sequence of 'a's and 'b's, including the empty string (e.g., 'ab', 'aba', 'baab'). Grouping is crucial!
4. The Theoretical Connection: FSMs and Regular Languages
This section brings the formal theory back into our discussion. In the Theory of Computation, Finite State Machines (FSMs) are the simplest models of computation.
4.1 The Equivalence Principle
The most important concept here is the relationship between REs and FSMs:
Regular expressions and FSMs are equivalent ways of defining a regular language.
This means that for every Regular Expression you can write, you can design a corresponding FSM (a state transition diagram) that accepts exactly the same set of strings. And conversely, for any FSM, you can write a Regular Expression that describes the language it recognizes.
- If a language can be represented by a regular expression, it is a regular language.
- If a language can be recognised by an FSM, it is a regular language.
4.2 Writing an RE from an FSM
You must be able to write a regular expression to recognise the same language as a given FSM, and vice versa: design an FSM to recognise the same language as a given regular expression.
Example: Writing an RE from a simple FSM.
Imagine an FSM that accepts strings of the form: one 'a', followed by any number of 'b's, followed by one 'c'.
- The FSM starts at state S0.
- The only transition out of S0 on input 'a' goes to state S1.
- S1 loops back to itself on input 'b'.
- A transition from S1 on input 'c' goes to the final (accepting) state S2.
Step-by-Step Construction:
- The string must start with 'a'. (
a) - The 'b' can be repeated zero or more times (the loop on S1). (
b*) - The string must end with 'c'. (
c)
The resulting Regular Expression is: ab*c
Example 2: Using Alternation
Imagine an FSM that accepts strings starting with either '0' or '1', followed by any number of '0's.
RE: (0|1)0*
This matches "0", "1", "000", "10", "1000", etc.
Why is this important?
The equivalence between FSMs and REs is the foundation of many practical computing applications, especially in parsing, searching, and validating data structures. If you can define a pattern with an RE, you know that a simple, efficient FSM can handle the computation.
💯 Key Takeaways for Examination
You must be proficient in:
- Understanding and applying the five key metacharacters:
*,+,?,|, and(). - Using character classes (e.g.,
[A-Z]or[0-9]) to represent sets of characters. - Stating the formal relationship: Regular Expressions and FSMs describe the same set of languages (Regular Languages).
- Translating simple patterns between FSM diagrams/descriptions and their corresponding Regular Expressions.