How To Parse Mathematical Expressions
This aim of this tutorial is to introduce the concept of parsing instructions expressed as text into executable code. This serves as a basis for working with machine readable languages (e.g. compilers, JSON deserialization and mathematical expression evaluators). This page will use examples for evaluating basic mathematical expressions.
Input Notation
Infix Notation
As humans, we are used to using infix notation to write mathematic expressions:
Unfortunately, this notation is very hard for a computer to understand and parse.
Reverse Polish Notation (RPN)
Reverse polish notation (RPN) is an alternative notation to infix notation. It is also known as postfix notation. In RPN notation, the operands always come before the operator, which is always specified last. For example, 2+3
, which is in infix notation, would become 2 3 +
.
The same expression above (y = 5 * (x + 2)
), but in RPN is:
RPN does not require the use of parenthesis to preserve the correct order of operations. RPN can linearize a tree data-structure, which can have memory benefits.
You may have noticed the some older calculators don’t work as you’re “used to”. This could be because they use RPN! The Hewlett-Packard HP-48 graphic calculator presented RPN to the user (no infix notation was allowed!).
Sequence of Events
Lexer: Converts the grammar (which is the input, i.e. mathematical expression as a string) into a sequence of tokens. Less formally, it can also be called a tokenizer.
Parser: Reads the tokens outputted from the lexer and parses them into a structure such as RPN or AST. It reshapes the tokens into a structure according to rules such as operator preceedance.
Evaluator: Processes the output of the parser, and produces the result the user wants to see. In the case of an AST, the evaluator will typically perform a depth-first evaluation of all the nodes in the AST (which is easily performed with recursion).
Grammar
Context-free grammars are the most common.
Grammar is a way of formally describing the structure of a language. You can call grammar the language of languages. Grammar is usually written as a sequence of rules. Each rule has two parts, a name, and an expansion of the name. The expansion shows what the name can built from. The idea of rules is to describe the syntax as a hierarchy, starting at the broad structure of the language and then breaking it down into smaller and smaller components. The name can also be called a non-terminal symbol. This refers to the fact that it just an intermediary representation and will be expanded into smaller components. Things such as a number
are called terminal symbols.
We will use a simple calculator program as an example. The calculator allows you to type in expressions. Each expression is built from two terms, which are added together. We will
Of course, the basic syntax above is fairly limiting. While our calculator would accept 2*3 + 4*5
, it wouldn’t accept 2*3 + 4*5 + 6*7
as it only allows for exactly two terms (and exactly two factors in each term). We also couldn’t type 2*3 - 4*5
as it only allows for addition! We need a way of grouping multiple operators together +
and-
, and expressing that an expression can be composed of any number of terms added together. We will improve the functionality of our calculator when we look at Extended Backus-Naur Form (EBNF) below.
Extended Backus-Naur Form (EBNF)
EBNF is a standardized way of writing grammar. All names are enclosed in angled brackets, e.g. <name>
. Let’s write a basic calculator:
So far, our calculator only allows an expression in the form 2+3
. Let’s add more to this!
The |
in EBNF represents OR
, e.g. an expr
may consist of a term + term
OR just a term
.
Now we can write expressions such as: 3 + 2
, or just 4
(typing just 4
on a calculator seems a little pointless, but we would expect it to work!).
However, we still can’t type 1 + 2 + 3
, as the grammar does not allow for more than two numbers. You can use curly braces {
and }
to indicate that what’s inside can repeated zero or more times. The below example states that our calculator can accept an expression which equal to one term plus zero or more terms:
Now we can write expressions such as: 1 + 2 + 3
and 4
.
Note that we don’t have to specify | number
anymore, as the case in where the expression is just a single number is handled by zero repetitions of {"+" <term>}
.
We can use round braces (
and )
to group things together. The below example specifies that an expression is composed of any number of terms that are either added OR subtracted from one another:
Now we can write things such as 1 - 2
and 1 + 2 - 3
.
Lets now expand the calculators functionality so that we can also multiply!
Notice how we added a new rule for multiplication/division. Where everywhere previously we could use a number
in the addition/subtraction rule, we now have a <term>
. This further expands to say that <term>
is composed of any amount of numbers separated by *
or /
. Structuring the grammar in this way captures the correct precedence of mathematical operators, multiplication must occur before addition.
Now we can write things such as: 1*2 + 3*4
or 1 + 2*4
or 1*2
!
We will now add support for negation symbols in front of numbers. Everything discussed so far has been a binary operator, i.e. it has operated on two numbers. The negation symbol is a unary operator (operates on one number). It is also optional, so we can enclose it in square brackets ([
and ]
).
Now we can write expressions such as: -1*2
or 1*-2
or -1 + 2
!
Putting all of this together, the complete grammar for our calculator now becomes:
A Recursive Decent Parser (RPN)
A Recursive Descent Parser (RPN) is a type of parser which uses function recursion as a primary technique for parsing expressions. One of the main benefits of RPNs is that their design closely matches the structure of EBNF grammar syntax.
External Resources
For a detailed code-based walk-through of an AST expression parser in C#, see https://mariusbancila.ro/blog/2009/02/03/evaluating-expressions-part-1/.