Scanning is the process of breaking input up into discrete tokens (such as one or more letters forming a word) and parsing is the process of applying meaning to these tokens (such as multiple words strung together to form a sentence, following specific grammatical rules). Software developers come from a variety of backgrounds, and while some may remember using these tools to construct a compiler in their Computer Science undergrad, I have a feeling many are unfamiliar with (or forgot about) these tools. I'd wager that the most well known scanner and parser, in the general pool of software developers, are lex and yacc. The "lex" name is short for lexical analyzer, a tool that understands a syntactic unit of information, such as a word in an English sentence, and feeds these units to another program one at a time. The "lex" tool is a scanner generator since it creates code that scans input. The "yacc" name is short for "yet another compiler compiler" which is accurate since the common usage of a parser is to create a compiler, so you can say the parser compiles a compiler. In general, though, "yacc" is a parser generator.
Remember that a compiler is nothing more than a translator. When you write code in any .NET language, a compiler translates the high level code (such as C#) to a lower level code (Intermediate Language, or IL) which is what the Common Language Runtime (CLR) understands and executes. In a way, a decompiler is actually a compiler, translating the low level code, such as IL, to a higher level language, such as C#. We use the terms "compiler" and "decompiler" because they communicate the direction of the translation. Compilers must scan input and then parse it. They must know that "public" is a token and "class" is a token and what an identifier looks like, but it's the parser that understands what a method is and what a while loop looks like and how they translate to IL.
A scanner and parser can be used separately. I won't go into details of constructing a parser in this post, but I will discuss developing a simple scanner using a scanner configuration that is translated into C# code. Before delving into an example, why do we care about a tool that can create a scanner for us? What are the benefits? Much processing of input we do in the business world (at least in my experience) is easily constructed by hand. We don't usually need to construct a compiler, so isn't using a scanner overkill? Like any problem we're called to solve, it's important to know about as many tools as possible, so when we encounter a case where a certain tool would save us significant time, we can use it since we know it exists. There are also business problems, such as sophisticated data translation, where a scanner/parser might be the perfect set of tools. Here are a few benefits to using an auto-generated scanner:
- A scanner generator allows us to focus on the syntactic elements and not worry about any other details
- If syntactic elements change, it's easy to update the code file used as input to the scanner generator
- A scanner can feed anything - a parser can accept the tokens, your custom code can, etc., so the syntactic analysis of input is separated from applying meaning to the input
- Development of scanner is much faster than rolling one by hand, unless input is rather simple, and relying on a scanner generator reduces chances of introducing error into the scanner
- A generated scanner is typically faster than one you roll on your own
A scanner generator for C# that I've used is available at this link C#Lex site
The syntax of the input files to C#Lex follow the syntax detailed at this JLex page except where called out at the C# Lexer site.
Let's look at an example: analyzing simple English. Words in English can take on multiple forms:
- Words with an initial capital, such as at the beginning of a sentence or proper names
- Contractions - words with an apostrophe (e.g., can't, don't, won't)
- Abbreviations - words that are all capitals and optionally have a period after each letter (e.g., IL, CLR, e.g.)
- Quoted words - single or double quotes on both sides (e.g., "compiler")
We'll stop there in the interest of keeping this simple.
An input file to the C#Lex program is formatted in three sections, each section separated with a double percent (%%) on a line of its own
- User code. This section is copied directly to the output file without modification, so you can place implementation and 'using' statements here.
- Directives to control C#Lex. This is where you can control C#Lex, such as specifying scanning states, and also specify regular expressions to match input.
- Scanner rules. This is where you specify what to recognize, what to do with it, state transitions, etc. When a rule matches here, data can be returned to the code running the scanning loop via a special class called Yytoken (that you define).
The scanning states allow recognition of different syntactic units at different times, so if certain syntactic units can only follow other syntactic units (think about visibility keywords such as 'public' and 'private' that can't appear outside a namespace declaration) you can control this in the scanner generator code.
The program we'll write is dead simple for purposes of illustration: it'll accept tokens from the scanner and output each token, one per line.
I won't go into details of the regular expression language, instead keeping it simple and showing the regular expressions we need without much explanation.
A word is simple: [A-Z]?[a-z']+
This gets us an optional initial capital and a sequence of one or more lower case letters and apostrophes. This doesn't limit the number of apostrophes, so let's revise it.
[A-Z]?[a-z]+'?[a-z]*
We can continue refining to ensure the end of the contraction is one of just a few options (such as "t" or "s") but let's keep it simple.
A word can also have all capitals: [A-Z]+
optionally separated by periods: ([A-Z]\.?)+
but periods can also separate lowercased words: ([a-z]\.?)+
Combining these gives us: [A-Z]?[a-z]+'?[a-z]* | ([A-Z]\.)+ | ([a-z]\.)+
We continue like this until we end up with a set of regular expressions that fully describe the input. Since the dot matches any character except newlines, we'll add a rule to pass this input back as a catch all rule. You probably don't want this in a real application, but it illustrates how to match any input not matched by other rules. Any whitespace is skipped over, along with punctuation we're not interested in (exclamation point, question mark, commas, periods). These regular expressions are far from perfect or comprehensive but they illustrate the process of analyzing the nature of the input and constructing the required expressions to scan the input.
I'm including the final file at the end of this post.
A Yytoken class must be defined. This is the communication mechanism between the scanner and the code you write and can hold any information you want (such as where in the input the scanner is, state information, etc). An instance of this class is what is returned by yylex() in the main scanning loop located in our code:
Yytoken t;
while ((t = yy.yylex()) != null)
{
System.Console.WriteLine(t.m_text);
}
Now that we have an input file to the scanner generator, we first generate the C# code by executing C#Lex.exe on this file, then compile the generated C# file by invoking csc.
C#Lex english.lex
csc english.lex.cs
The input file (input.txt) has this line: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut "labore" et dolore magna aliqua.
Running english.lex.exe gives us the following output
Lorem
ipsum
dolor
sit
amet
consectetur
adipisicing
elit
sed
do
eiusmod
tempor
incididunt
ut
"labore"
et
dolore
magna
aliqua
Our final file looks like this:
using System;
using System.IO;
class WordExample {
public static void Main(string[] argv) {
Yylex yy = new Yylex(new StreamReader(new FileStream("test.txt", FileMode.Open)));
Yytoken t;
while ((t = yy.yylex()) != null)
{
Console.WriteLine(t.m_text);
}
Console.WriteLine();
}
}
class Yytoken {
public Yytoken(string token)
{
m_text = token;
}
public string m_text;
}
%%
ALPHA=[A-Za-z]
WORD=[A-Z]?[a-z]+'?[a-z]* | ([A-Z]\.?)+ | ([a-z]\.?)+
WHITE_SPACE_CHAR=[\n\ \t\b\012\r]
%%
<YYINITIAL> {WORD} { return(new Yytoken(yytext())); }
<YYINITIAL> \"{WORD}\" { return(new Yytoken(yytext())); }
<YYINITIAL> {WHITE_SPACE_CHAR}+ { }
<YYINITIAL> [\.,\?!] { }
<YYINITIAL> . { return(new Yytoken(yytext())); }