Ada Lexer

Christoph Karl Walter Grein

Last Update: 5 July 2011

This is a lexical analyzer written in Ada, which transforms a stream of characters representing an Ada program into a stream of language specific tokens, a token being an element of the language grammar like e.g. an identifier or a reserved word.

Thus the Ada declaration

     type Colour is (Red, Green, Blue);

would be returned as the sequence of tokens

     Type_A, Identifier_A ("Colour"), Is_A, Left_Parenthesis_A,
     Identifier_A ("Red"), Comma_A, Identifier_A ("Green"),
     Comma_A, Identifier_A ("Blue"), Right_Parenthesis_A, Semicolon_A

A tricky case is Character'('a'), which, when correctly analyzed, results in

     Identifier_A ("Character"), Tick_A, Left_Parenthesis_A,
     Character_A ("'a'"), Right_Parenthesis_A

and not in

     Identifier_A ("Character"), Character_A ("'('"),
     Identifier_A ("a"), Tick_A, Right_Parenthesis_A

All token names end in _A (this allows a simple naming scheme for the reserved words, e.g. type becomes Type_A).

This version recognizes the new Ada 2012 reserved words. Selection of the actual Ada generation (83, 95, 2005, 2012) enables correct classification of reserved words resp. identifiers.
However most probably Make_Body cannot handle the new syntax since 2005. This will be added eventually.

Also included are a few small utility programs.

Bug reports and proposals for improvements are welcome. Please include the release date in reports (see the What's new file you get when unpacking the zip file).

Updates Reason
5 July 2011 Lexer: Ada 2012.
30 Oct 2007 Lexer: Very minor update.
26 Jan 2006 Bug fixes in Make_Body.
4 Nov 2005 Only minor restructuring of directories.
23 Oct 2005 Colorize restructured so that it no longer needs Gnat preprocessing.
Added Gnat project files for easy compilation.
20 Aug 2005 Lexer can now work concurrently on several files.
Attributes are recognized.
The additional Ada 2005 reserved words are recognized.
Java recognition has been removed.
Make_Body could not consume variant records and thus did not produce a correct body when such a type was present. This is now fixed. It now also recognizes pragma Import as a completion.
Colorize now also colorizes attributes.
License changed to GMGPL.
Restructured directories.
01 Feb 2001 Last version that can lex Java.
Lexer correctly works on things like Character'('a').
30 July 1998 First release

(*) The code is shown colorized. This was created by an application of the lexer which is also included in the zipped distribution.

Other Lexers

There are another two lexical scanners - one for Ada and one for Java source code - built by me with OpenToken. The difference to the analyzer presented above is mainly in the way they treat syntax errors (illegal tokens). Also, because of OpenToken being a very general frame for all kinds of analyzers, they do not give as much additional information on (legal) tokens like e.g. the base of a based number token or information about Java's documentation tags.

OpenToken was originally developed by the following company and was released as open-source software as a service to the community:

FlightSafety International Simulation Systems Division
Broken Arrow, OK USA 918-259-4000

It is now maintained by Ted Dennison. To quote from the OpenToken Project Homepage Readme (version 2.0):

The OpenToken package is a facility for performing token analysis and parsing within the Ada language. It is designed to provide all the functionality of a traditional lexical analyzer/parser generator, such as lex/yacc. But due to the magic of inheritance and runtime polymorphism it is implemented entirely in Ada as withed-in code. No precompilation step is required, and no messy tool-generated source code is created.

Additionally, the technique of using classes of recognizers promises to make most token specifications as simple as making an easy to read procedure call. The most error prone part of generating analyzers, the token pattern matching, has been taken from the typical user's hands and placed into reusable classes. Over time I hope to see the addition of enough reusable recognizer classes that very few users will ever need to write a custom one. Parse tokens themselves also use this technique, so they ought to be just as reusable in principle, athough there currently aren't a lot of predefined parse tokens included in OpenToken.

Ada's type safety features should also make misbehaving analyzers and parsers easier to debug. All this will hopefully add up to token analyzers and parsers that are much simpler and faster to create, easier to get working properly, and easier to understand.

There is also a mailing list for OpenToken. For more information see the OpenToken homepage.

OpenToken uses a set of recognizers for the tokens of the syntax you want to analyze. It comes with a set of predefined recognizers for tokens like numerals, keywords, identifiers, strings, etc. Each recognizer is a little pattern matching state machine, hence creating your own recognizers (if necessary at all) should be fairly easy.

The main reason I built these two lexers (you will get them when downloading OpenToken) additionally to my own one was to check out how to work within this framework - and I must say it worked like a dream: It was very easy to set them up, in fact a matter of a few hours only. (OK, updating existing recognizers and creating new ones, debugging and testing and polishing consumed another few days, but it was a much simpler work than my Lexer.)

So if you ever want to build a lexer for some kind of formal syntax, I strongly recommend using OpenToken.

Deutsch Heimat Inhaltsverzeichnis
English Contents
Deutsch English

Valid XHTML 1.0 Transitional!