Programming Assigment 6

Course Project: Phase 2 -- The Lexical Analyzer


  • Background:

    • Due Date:
  • Overview:

    For the second step in the compiler project, you will write a scanner for C using the Unix utility lex(1) (or the GNU utility flex(1)).

    The scanner will be coded as a procedure with a driver (main) program calling the scanner for a token. In the next phase, the parser will replace the driver as the caller. To verify the correctness of the translated source program, the driver will also write the tokens to a file, the token file. The ONLY function of the driver is to request a token from the scanner and write the token file; all other processing must be handled by the scanner. Note that you should also pass the original source code through to the token file as comments.

  • Debugging and Output:

    Your driver needs to accept and parse at least the following command line options:

    • -d[ls]* Produce debugging output (see below)
    • -o Output the final product to a file

    The -o option is not required; if given, output should go to the named file. Otherwise, output may be placed in a consistent location of your choosing. For the purposes of this assignment, the output file is the token file, and should be written by the driver. The -d option, if one is given, may be followed by 0-2 of the letters l and s, corresponding to the Lexer and Symbol table. For each letter given, a corresponding debugging output file shall be generated. Each type of debugging information should be placed in a separate file. The lexer debugging output should consist of a list of tokens and their values as returned by the scanner (lexer). The symbol table debugging output should consist of symbol table dumps at key points during execution (for example, at the beginning and end of blocks). Note that for this assignment (and possibly for future work) you should allow the illegal token "!!S" to force a symbol table debugging dump. Any other sequence beginning with "!!" may be used to force other types of debugging output, as you choose. You may, of course, elect to produce additional debugging output for each phase, or to accept additional arguments to the -d option. You may also, if you wish, accept additional arguments for whatever uses you like, provided that these are appropriately documented.

  • Error/Warning Reporting:

    The scanner will detect errors or warnings caused by exceptions to the lexical rules of the language. An error occurs when an illegal character is on the line; a warning occurs when there is an integer overflow or an identifier is too long. The scanner will indicate in the token file that an error has occurred by returning an error token. The scanner will also indicate that an error or warning has occurred by issuing a descriptive message on stderr. The message should indicate the location and type of the problem. You may wish to display the line in question and indicate graphically the error or warning, and its location on that line.

    		FOR i := 1 T' 4
     		                 ^ 41: Unexpected character '
    		
  • Symbol Table Interaction:

    The scanner will be responsible for constructing the symbol table. The symbol table will contain the strings representing the identifiers. You may wish to add a debugging option that causes the symbol table to be written to a file at the end of scanning. Note that in the next phase, the responsibility for construction of the symbol table will be divided between the scanner and parser, so this code will have to be modified at that time. For now, you may assume that all symbols will go in the base level of the table.

  • Information returned by the Scanner:

    When a lexeme is detected, information about the token it represents will be returned to the driver program and, in this first phase, written to the output file:

    • You will need to create a header containing definitions for all tokens, as shown below. Each token should correspond to a distinct integer value. You must then include this header in your scanner and your driver. You should never use the numeric values of the tokens; always use the symbolic names. A sample entry in this header might look like:
      			#define ARRAY_tok 256
      			
    • If the token is a reserved word, delimiter or operator, then a scalar will be returned to indicate that token. For example, for the character string, "BEGIN", return the symbolic constant BEGIN_tok.

    • If the token is an identifier, return the scalar value corresponding to an identifier (such as ID_tok) and the symbol table node you constructed from it. Note that supplemental return values must be placed in yylval or some other suitable global.

    • If the token is a numeric string, return the scalar value corresponding to an integer (such as NUM_tok) and the integer value of the token.
  • Materials to be Submitted:

    On the due date you are to hand in a folder with a title page that lists your team, and then an informational page on where to find your compiler's web page. The web page is to have the following:

    • Table of Contents (first page with links to others)
    • Overview of the document
      • Your name, course, date, etc.
      • Language features of the compiler
      • Implementation (symbol tables, etc.)
      • Assumptions Made (Line length,etc.)
      • Restrictions (Number of errors before it quits,etc.)
    • Directions to run your program, location of the files
    • Source Code files (A page with links to the files)- Each file contains a heading explaining/describing the code.
      • Your test driver
      • Your scanner (lex/flex) description
      • Other code such as the symbol table
    • Script for the run of the program (ie. learn how to use script -- man script). Your script file should contain at least:
      1. The results for three test cases run without debugging output.
      2. The results for three test cases run with debugging output (that is, include the token file and debugging output from the scanner and symbol table. Make sure you insert the appropriate pragmas to cause symbol table dumps).
      3. A demonstration that your driver handles command line arguments properly.
      4. The source code of your test cases.
    • The listing files ( A page with links to the input and listing files)
  • C Language:

    • A list of the Reserved Words for C is here