PA3 — The Parser

PA3 is due 9/25 at 11:59pm Central.

You must complete this assignment in Python.

You may work in a team of up for two people for this assignment. You may work in a team for any or all subsequent programming assignments. You do not need to keep the same teammates. The course staff are not responsible for finding you a willing team. If you want to work on a team, you must register your group on the autograder before submitting!

Goal

For this assignment, you will write a parser using a parser generator. You will describe the Cool grammar in an appropriate input format (specifically, using Python's ply library for lexer and parser generators). You will also write additional code to deserialize the tokens produced by the lexer stage and to serialize the abstract syntax tree produced by your parser.

Specification

You must create three artifacts:

A program, main.py, that takes a single command-line argument (e.g., file.cl-lex). That argument will be an ASCII text Cool tokens file (as described in PA2). The cl-lex file will always be well-formed (i.e., there will be no syntax errors in the cl-lex file itself). However, the cl-lex file may describe a sequence of Cool tokens that do not form a valid Cool program.

Your program must either indicate that there is an error in the Cool program described by the cl-lex file (e.g., a parse error in the Cool file) or emit file.cl-ast, a serialized Cool abstract syntax tree. Your program's main parser component must be constructed by a parser generator. The "glue code" for processing command-line arguments, unserializing tokens and serializing the resulting abstract syntax tree should be written by hand. Invoking python3 your_main.py file.cl-lex should yield the same output as cool --parse file.cl. Your submission can optionally contain other *.py files.
A plain ASCII text file called readme.txt describing your design decisions and choice of test cases. See the grading rubric. A few paragraphs should suffice.
Testcases good.cl and bad.cl. The first should parse correctly and yield an abstract syntax tree. The second should contain an error.

You must use the ply library in Python. Do not write your entire parser by hand. Your submission is thus a grammar written to leverage the ply library.

Line Numbers

The line number for an expression is the line number of the first token that is part of that expression. Example:

(* Line 5 *) while x <= 
(* Line 6 *)        99 loop 
(* Line 7 *)   x <- x + 1 
(* Line 8 *) pool

The while expression is on line 5, the x <= 99 expression is on line 5, the 99 expression is on line 6, and the x <- x + 1 and x + 1 expressions are on line 7. The line numbers for tokens are present in the serialized token .cl-ast file.

Your parser is responsible for keeping track of the line numbers (both for the output syntax tree and for error reporting).

Error Reporting

To report an error, write the string

ERROR: line_number: Parser: message

to standard output and terminate the program. You may write whatever you want in the message, but it should be fairly indicative. Example erroneous input:

(* Line 70 *) class Cons inherits List + IO {

Example error report output:

ERROR: 70: Parser: syntax error near +

The .cl-ast File Format

If there are no errors in file.cl-lex your program should create file.cl-ast and serialize the abstract syntax tree to it. The general format of a .cl-ast file follows the Cool Reference Manual Syntax chart. Basically, we do a pre-order traversal of the abstract syntax tree, writing down every node as we come to it.

We will now describe exactly what to output for each kind of node. You can view this as specifying a set of mutually-recursive tree-walking functions. The notation "superclass:identifier" means "output the superclass using the rule (below) for outputting an identifier". The notation "\n" means "output a newline".

To Output An AST. A Cool AST is a list of classes. Output the list of classes.
To Output A List (of classes, or features, or whatever). Output the number of elements, then a newline, then output each list element in turn.
To Output A Class. Output the class name as an identifier. Then output either:
- no_inherits \n
- inherits \n superclass:identifier
Then output the list of features.
To Output An Identifier. Output the source-file line number, then a newline, then the identifier string, then a newline.
To Output A Feature. Output the name of the feature and then a newline and then any subparts, as given below:
- attribute_no_init \n name:identifier type:identifier
- attribute_init \n name:identifier type:identifier init:exp
- method \n name:identifier formals-list \n type:identifier body:exp
To Output A Formal. Output the name as an identifier on line and then the type as an identifier on a line.
To Output An Expression. Output the line number of the expression and then a newline. Output the name of the expression and then a newline and then any subparts, as given below:
- assign \n var:identifier rhs:exp
- dynamic_dispatch \n e:exp method:identifier args:exp-list
- static_dispatch \n e:exp type:identifier method:identifier args:exp-list
- self_dispatch \n method:identifier args:exp-list
- if \n predicate:exp then:exp else:exp
- while \n predicate:exp body:exp
- block \n body:exp-list
- new \n class:identifier
- isvoid \n e:exp
- plus \n x:exp y:exp
- minus \n x:exp y:exp
- times \n x:exp y:exp
- divide \n x:exp y:exp
- lt \n x:exp y:exp
- le \n x:exp y:exp
- eq \n x:exp y:exp
- not \n x:exp
- negate \n x:exp
- integer \n the_integer_constant \n
- string \n the_string_constant \n
- identifier \n variable:identifier (note that this is not the same as the integer and string cases above)
- true \n
- false \n
To Output A let Expression. (Output the line number, as usual.) Output let \n. Then output the binding list. To output a binding, do either:
- let_binding_no_init \n variable:identifier type:identifier
- let_binding_init \n variable:identifier type:identifier value:exp
Finally, output the expression that is the body of the let.
To Output A case Expression. (Output the line number, as usual.) Output case \n. Then output the case expression. Then output the case-elements list. To output a case-element, output the variable as an identifier, then the type as an identifier, then the case-element-body as an exp.

Example input:

(* Line 01 *)
(* Line 02 *)
(* Line 03 *)  class List {
(* Line 04 *)     -- Define operations on lists.
(* Line 05 *)
(* Line 06 *)     cons(i : Int) : List {
(* Line 07 *)        (new Cons).init(i, self)
(* Line 08 *)     };
(* Line 09 *)
(* Line 10 *)  };

Example .cl-ast output with comments.

1                      -- number of classes                   
3                      --  line number of class name identifier
List                   --  class name identifier
no_inherits            --  does this class inherit? 
1                      --  number of features
method                 --   what kind of feature? 
6                      --   line number of method name identifier
cons                   --   method name identifier
1                      --   number of formal parameters
6                      --    line number of formal parameter identifier
i                      --    formal parameter identifier
6                      --    line number of formal parameter type identifier
Int                    --    formal parameter type identifier
6                      --   line number of return type identifier
List                   --   return type identifier
7                      --    line number of body expression 
dynamic_dispatch       --    kind of body expression 
7                      --     line number of dispatch receiver expression 
new                    --     kind of dispatch receiver expression  
7                      --      line number of new-class identifier 
Cons                   --      new-class identifier
7                      --     line number of dispatch method identifier
init                   --     dispatch method identifier
2                      --     number of arguments in dispatch 
7                      --      line number of first argument expression
identifier             --      kind of first argument expression
7                      --       line number of the identifier
i                      --       what is the identifier? 
7                      --      line nmber of second argument expression
identifier             --      kind of second argument expression
7                      --       line number of the identifier
self                   --       what is the identifier?

The .cl-ast format is quite verbose, but it is particularly easy for later stages (e.g., the type checker) to read in again without having to go through all of the trouble of "actually parsing". It will also make it particularly easy for you to notice where things are going awry if your parser is not producing the correct output.

Writing the rote code to output a .cl-ast text file given an AST may take a bit of time but it should not be difficult; our reference implementation does it in 116 lines and cleaves closely to the structure given above.

Parser Generators

You must use a parser generator or similar library for this assignment.

A Python parser analyzer generator called ply is available, but you must download it yourself (e.g., use pip3 install ply). The autograder has ply installed.

All of these parser generators are derived from yacc (or bison), the original parser generator for C. Thus you may find it handy to refer to the Yacc paper or the Bison manual. When you're reading, mentally translate the C code references into the language of your choice.

Commentary

You can do basic testing as follows:

$ cool --lex file.cl
$ cool --out reference --parse file.cl
$ python3 my-main.py file.cl-lex
$ diff -b -B -E -w file.cl-ast reference.cl-ast

You may find the reference compiler's --unparse option useful for debugging your .cl-ast files.

Hint

If you are failing every negative test case, it is likely that you are not handling cross-platform compatibility correctly on all of your inputs and outputs.

Video Guides

NOTE: Some of these video guides are from a previous offering of a similar course at the University of Virginia. The assignment for this semester has changed slightly. While they are still relevant, you are responsible for completing the assignment according to this course's grading rubric.

A number of Video Guides are provided to help you get started on this assignment on your own. The Video Guides are walkthroughs in which the instructor manually completes and narrates, in real time, the first part of this assignment — including a submission to the grading server. They include coding, testing and debugging elements.

If you are still stuck, you can post on the forum, approach the TA or approach the instructor. The use of online instructional content outside of class weakly approximates a flipped classroom model. Click on a video guide to begin, at which point you can watch it fullscreen or via Youtube if desired.

Python3 + PLY

AST Node classes

Python + PLY

What to Turn In For PA3

You must turn in several files to the autograder:

readme.txt — your README file describing your implementation.
good.cl — a novel positive testcase
bad.cl — a novel negative testcase
Source code of your implementation, including
- main.py -- the main implementation (we will execute it with python3 main.py name_of_test.cl-lex on the autograder)
- optionally, you can include up to 20 more *.py files (e.g., that might be imported by your script).

Grading Rubric

PA3 Grading (out of 130 points):

100 points — for autograder tests (-1 point per incorrect test, minimum score of 0)
10 points — for a clear description in your README

10 — thorough discussion of design decisions (e.g., the handling of let) and choice of test cases; a few paragraphs of coherent English sentences should be fine
5 — vague or hard to understand; omits important details
0 — little to no effort, or submitted an RTF/DOC/PDF file instead of plain TXT

12 points — 6 points each for valid and novel good.cl and bad.cl files
- 6 — wide range of test cases added, stressing most Cool features and an error condition, novel file
- 3 — added some tests, but the scope not sufficiently broad
- 0 — little to no effort, or part of course file resubmitted as test case
8 points — for code cleanliness
- 8 — code is mostly clean and well-commented
- 4— code is sloppy and/or poorly commented in places
- 0 — little to no effort to organize and document code