Skip to content
Georgii Buzykin edited this page Aug 4, 2022 · 4 revisions

Regular Expression Based Lexical Analyzer Generator

This tool generates lexical analyzer based on a list of pattern descriptions in terms of regular expressions as an input. Its input file is very similar to the input of standard lex tool by structure and syntax. But contrary to lex instead of generating full analyzer with pattern matching inlined handlers lexegen generates only tables and pattern matching C-compliant function as files, which can be included into the rest analyzer's implementation.

This README file briefly describes this tool and can be not up-do-date, it also gives guidelines how to use this stuff. For more detailed information see wiki pages.

A Simple Usage Example

Assume that we already have built this tool as lexegen executable, and we want to create a simple lexical analyzer, which extracts the following tokens from an input buffer: C-identifiers, unsigned integers, floating point numbers and C-strings. Also we want to skip C-comments and whitespace.

Here is a source input file test.lex for this analyzer:

# This is a simple lexical analyzer.

# This section describes named regular expression definitions, which can be
# expanded in further expressions using `{name}`.
# Each line has the form: <definition-name> <regex>.
# Note that only end-of-line character terminates each regular expression, so
# a comment can't be placed on the same line with the expression.

# Start conditions are also can be defined in this section using the form: %start <sc-name>.
# By default only one `initial` start condition is implicitly defined.

# a digit
dig       [0-9]

# a letter
letter    [a-zA-Z]

# unsigned integer
int       {dig}+

# a C-identifier
id        ({letter}|_)({letter}|{dig}|_)*   

# a floating-point number
real      (({dig}+(\.{dig}*)?)|(\.{dig}+))((e|E)(\+|\-)?{dig}+)?

# white-space character
ws        [ \t\r\n]       

# C-style comment
comment    \/\* ( [^*\\] | \\(.|\n) | \*+([^*/\\]|\\(.|\n)) )* \*+\/

# a C-string
string     \" ( [^"\\] | \\(.|\n) )* \"

%% # section separator

# This section describes patterns as regular expressions.
# Each line has the form: <pattern-name> <regex>,
# or the form: <pattern-name> <s1, s2, ...> <regex> with start condition list.
# If start condition list is not specified, the pattern participates in all
# defined start conditions.

comment     {comment}
string      {string}
id          {id}
int         {int}
real        {real}
ws          {ws}+
other       .

%% # mandatory end of section

Then, in order to generate our lexical analyzer let's issue the following:

$ ./lexegen test.lex
Building lexer...
 - pattern count: 7
 - S-state count: 1
 - position count: 44
 - meta-symbol count: 13
 - state count: 24
 transition table size: 1248 bytes
Done.
Optimizing states...
 - state group count: 19
 - dead group count: 0
 - new state count: 19
 transition table size: 988 bytes
Done.
Compressing tables...
 total compressed transition table size: 656 bytes
Done.

As the result two files with default names lex_defs.h and lex_analyzer.inl are generated. If it is needed to specify the names explicitly the following should be issued:

./lexegen test.lex -h <new-defs-file-name> -o <new-analyzer-file-name>

File lex_defs.h contains numerical identifiers for patterns and start conditions (or start analyzer states). Only one sc_initial start condition is defined for our example.

File lex_analyzer.inl contains necessary tables and lex() function implementation, defined as static. This function has the following prototype:

static int lex(const char* first, const char* last, int** p_sptr, unsigned* p_llen, int has_more);

where:

  • first - pointer to the first character of input buffer
  • last - pointer to the character after the last character of input buffer
  • p_sptr - pointer to current user-provided DFA stack pointer
  • p_llen - pointer to current matched lexeme length
  • has_more - should be 0 to treat the end of input buffer as the end of input sequence

returns: matched pattern identifier

How It Works

The analyzer always tries to match one of patterns to the next longest possible chunk of text. If more than one patterns fit this chunk, then pattern priority is used. The first pattern has the highest priority, and the last one has the lowest.

The starting state must be on the top of user-provided DFA stack before calling lex(). Current stack pointer *p_sptr must point to the position after the last state (the first free stack cell)

After the function returns and the pattern is matched the stack pointer *p_sptr is the same as before calling the function.

In case if has_more is 0 the stack pointer *p_sptr is always the same as before the calling. If no user-provided pattern is matched it skips one character and returns predef_pat_default. So, the function always matches at least one character from the input buffer. If input buffer is empty, it returns err_end_of_input (negative).

If has_more is != 0, in case of reaching the end of input buffer the analyzer leaves the stack pointer *p_sptr as it is and returns err_end_of_input. It gives a chance to add more characters to the input sequence and call the lex() function again to continue the analysis. Already analyzed part of input is no more needed. All necessary information is in the state stack. In theory, the old input buffer can be freed, but in practice it will likely be needed in future to concatenate the full lexeme.

User-provided DFA stack must have the same count of free cells as the length of the longest possible lexeme, or last - first if we must deal with lexemes of arbitrary length. The other approach is to trim input buffer in case if it is longer than free range in user-provided DFA stack and to facilitate has_more flag. The probable code will look like this:

unsigned llen = 0;
auto state_stack = std::make_unique<int[]>(kInitialStackSize);
int* slast = state_stack.data() + kInitialStackSize;
int* sptr = state_stack.data();
*sptr++ = lex_detail::sc_initial;
...
const char* trimmed_first = first;
while (true) {
    bool stack_limitation = false;
    const char* trimmed_last = last;
    if (slast - sptr < last - trimmed_first) {
        trimmed_last = first + static_cast<ptrdiff_t>(sptr, slast);
        stack_limitation = true;
    }
    int pat = lex_detail::lex(trimmed_first, trimmed_last, &sptr, &llen, stack_limitation);
    if (pat >= lex_detail::predef_pat_default) {
        break; // the full lexeme is obtained
    } else if (stack_limitation) {
        // enlarge state stack and continue analysis
        <... enlarge state stack ...>
        trimmed_first = trimmed_last;
    } else {
        <...  end of sequence ...>
    }
}
...

Note, that it is convenient to use the state stack also as a start condition stack.

User Code Example

The probable code for the above example can be something like this:

...
namespace lex_detail {
#include "lex_defs.h"
#include "lex_analyzer.inl"
}
...
int main() {
    ...
    const char* first = .... ; // the first char pointer
    const char* last = .... ; // after the last char pointer
    ...
    unsigned llen = 0;
    auto state_stack = std::make_unique<int[]>(kInitialStackSize);
    int* slast = state_stack.data() + kInitialStackSize;
    int* sptr = state_stack.data();
    *sptr++ = lex_detail::sc_initial;
    ...
    while (true) {
        first += llen;
        const char* trimmed_first = first;
        while (true) {
            bool stack_limitation = false;
            const char* trimmed_last = last;
            if (slast - sptr < last - trimmed_first) {
                trimmed_last = trimmed_first + static_cast<ptrdiff_t>(sptr, slast);
                stack_limitation = true;
            }
            int pat = lex_detail::lex(trimmed_first, trimmed_last, &sptr, &llen, stack_limitation);
            if (pat >= lex_detail::predef_pat_default) {
                break; // the full lexeme is obtained
            } else if (stack_limitation) {
                // enlarge state stack and continue analysis
                <... enlarge state stack ...>
                trimmed_first = trimmed_last;
            } else {
                return; // end of sequence
           }
        }
        switch (pat) {
        lex_detail::pat_comment: .....; break;
        lex_detail::pat_string: .....; break;
        lex_detail::pat_id: .....; break;
        lex_detail::pat_int: .....; break;
        lex_detail::pat_real: .....; break;
        lex_detail::pat_ws: .....; break;
        pat_other::pat_other: .....; break;
        }
    }
}

Regular Expression Syntax

These rules are used to compose regular expressions for definitions or patterns:

  • x if this character is not a ' ', FF, CR, HT, or VT matches the character 'x'.
  • . any character (byte) except newline (NL)
  • [xyz] a "character class"; in this case, matches 'x', 'y', or 'z'
  • [abj-oZ] a "character class" with a range in it; matches an 'a', a 'b', any letter from 'j' through 'o', or a 'Z'
  • [^A-Z] a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.
  • [^A-Z\n] any character EXCEPT an uppercase letter or a newline
  • {name} the expansion of the "name" definition (see above)
  • "[xyz]\"foo" the literal string: [xyz]"foo
  • \X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*')
  • \123 the character with octal value 123
  • \x2a the character with hexadecimal value 2a
  • r* zero or more r's, where r is any regular expression
  • r+ one or more r's
  • r? zero or one r's (that is, "an optional r")
  • r{2,5} anywhere from two to five r's
  • r{2,} two or more r's
  • r{,2} no more than two r's
  • r{4} exactly 4 r's
  • (r) match an r; parentheses are used to override precedence
  • rs the regular expression r followed by the regular expression s; called "concatenation"
  • r|s either an r or an s
  • r/s an r but only if it is followed by an s. The text matched by s is included when determining whether this rule is the "longest match", but is then returned to the input. So the returned lexeme is only the text matched by r. This type of pattern is called trailing context".

Note that ' ', FF, CR, HT, or VT characters (bytes) are skipped while parsing regular expressions, use \x20, \f, \r, \t, or \v instead. Also zero '\0' character (byte) is always treated as not matchable.

Command Line Options

$ ./lexegen --help
Usage: lexegen [options] file
Options:
    -o <file>                Place the output analyzer into <file>.
    -h <file>                Place the output definitions into <file>.
    --no-case                Build case insensitive analyzer.
    --compress0              Do not compress analyzer table, do not use `meta` table.
    --compress1              Do not compress analyzer table.
    --compress2              Default compression.
    --use-int8-if-possible   Use `int8_t` instead of `int` for states if state count is < 128.
    -O0                      Do not optimize analyzer states.
    -O1                      Default analyzer optimization.
    --help                   Display this information.

How to Build lexegen

Perform these steps to build the project (in linux, for other platforms the steps are similar):

  1. Clone lexegen repository and enter it

    git clone https://github.com/gbuzykin/lexegen
    cd lexegen
  2. Initialize and update uxs submodule

    git submodule update --init
  3. Then, compilation script should be created using cmake tool. To use the default C++ compiler just issue (for new enough version of cmake)

    cmake -S . -B build

    or to make building scripts for debug or optimized configurations issue the following

    cmake -S . -B build -DCMAKE_BUILD_TYPE="Debug"

    or

    cmake -S . -B build -DCMAKE_BUILD_TYPE="Release"
  4. Enter created folder build and run make

    cd build
    make

    to use several parallel processes (e.g. 8) for building run make with -j key

    make -j 8