Most Powerful Open Source ERP

Navigating the Cython Compiler Source Code: short guide

This document aims to provide a quick, concise overview of the core source code of the Cython compiler that still hits all the essentials.
  • Last Update:2020-05-19
  • Version:
  • Language:

Short guide to Cython source code

This document aims to provide a quick, concise overview of the core source code of the Cython compiler that still hits all the essentials.

Source files of interest

cython/Cython/Compiler/
 |      |      | - Main.py
 |      |      | - Pipeline.py
 |      |      |
 |      |      | - Lexicon.py
 |      |      | - Scanning.py
 |      |      | - Parsing.py
 |      |      |
 |      |      | - Nodes.py
 |      |      | - ExprNodes.py
 |      |      |
 |      |      | - Visitor.py
 |      |      |
 |      |      | - PyrexTypes.py
 |      |      | - Symtab.py
 |      |      |
 |      |      | - ParseTreeTransforms.py
 |      |      |
 |      |      | - FlowControl.py
 |      |      | - TypeInference.py
 |      |      |
 |      |      | - ModuleNode.py
 |      |      | - Code.py
 |      |      | - Annotate.py

Core Pipeline

All roads lead to Main.py; all invocations of the Cython compiler go through Main.py. More precisely, they all call Main.run_pipeline. From there we go straight into Pipeline.py to set up and run the compiler pipeline.

Pipeline.py is the place from where the whole compilation process is directed: a pipeline of successive transformations is applied to turn the source Cython code into C code.

A pipeline phase is a callable object that takes some data and returns transformed data.

All pipeline steps are setup by various factory functions in Pipeline.py.

Pipeline.run_pipeline is the engine that chains these successive data transformations together: you put Cython code into the pipeline and generated C code comes out.

Parsing

The first pipeline step is parsing the Cython source code to produce an Abstract Syntax Tree.

The entry point of parsing and therefore the pipeline phase callable is Parsing.p_module.

To understand the parsing process, you have two approaches:

  • the first is to go to Parsing.p_module, which leads you straight to Parsing.p_statement_list and then Parsing.p_statement, where interesting things happen. That’s a top-down kind of approach.

  • if that doesn’t work for you, simply starting at the top of Parsing.py and reading your way down actually yields a fairly coherent order, in a bottom-up kind of way (bottom-up relative to the function calls, not the file).

As a side-note, if you’re more used to a formal grammar with BNF notation, there is a not-quite-finished, out-of-date, and apparently forgotten grammar for the Cython language under Cython/Parser/Grammar.

Before you do any of that though, you should take a look at Scanning.py: all parsing functions in Parsing.py take as argument a Scanning.PyrexScanner object s. It handles emitting a stream of tokens from the Cython source code as a handy abstraction for parsing. The rules for token creation are defined in Lexicon.py.

And before you delve too deeply into that, if you want to know more about how those tokens are produced the Plex documentation at https://pythonhosted.org/plex/ is a good starting point. You can then look into cython/Cython/Plex if you need more.

To understand Parsing.py it can be useful to refer to https://docs.python.org/3/reference/, in particular the lexical analysis section, as well as the grammar-explaining sections starting with expressions. Note that the latest Python language features aren’t always supported in Cython yet.

The Abstract Syntax Tree

Parsing.p_module returns an AST in the form of its root node: a ModuleNode.ModuleNode representing a whole module.

A Parse Tree is made of nodes. The base class for nodes is Nodes.Node.

The tree structure stems from the fact that each subclass defines a child_attrs class attribute that lists the names of the attributes that hold children nodes. Each such attribute can either contain a single node or a list of nodes. When it is a list of nodes, the list itself is not taken into account as part of the tree, i.e. the items in the list are direct children of the parent node.

Many expression nodes in ExprNodes.py use the subexpr class attribute for the same role as child_attrs. To make child_attrs work anyway, ExprNodes.ExprNode defines child_attrs as a read-only property that returns the instance’s subexpr.

I suggest you take the time to look a few nodes and just figure out what they’re for. This will inevitably begin to give you an idea of the tree structure. Most node classes have helpful descriptions in comments, and their child_attrs attribute tells you what they call their children in the tree. You can also look in Parsing.py for when the node is instantiated. Note that many nodes are only produced in later compilation stages however.

Getting started with the AST

To get you started, here are a few omnipresent expression nodes:

Node children attributes other relevant attributes matches
NameNode - name: string references to name
AttributeNode obj attribute: string obj . attribute
IndexNode base, index - base [ index ]
SimpleCallNode function, args, … - function ( args )
UnopNode operand operator: string operator operand
e.g. AmpersandNode operand operator = ‘&’ & operand
BinopNode operand1, operand2 operator: string operand1 operator operand2

As well as a few statement nodes:

Node children attributes other relevant attributes matches
PassStatNode - - pass
SingleAssignmentNode lhs, rhs - lhs = rhs
StatListNode stats - -
IfStatNode if_clauses, else_clause - -

Past this point making a nice-looking table becomes more complicated, but don’t let that stop you from taking a look at CStructOrUnionDefNode, CFuncDefNode(FuncDefNode), DefNode(FuncDefNode), …

One other thing: CBaseTypeNode and CDeclaratorNode and their subclasses are used to represent to two main components of any C-style declaration:

  • a type specifier
  • a declarator

As such, you will find them as children of:

  • CVarDefNode: C variable declaration or C function declaration without definition (forward or extern)
  • CArgDeclNode: C function argument declaration
  • CFuncDefNode: C function definition
  • CTypeDefNode: C-style typedef

Declarators especially can be puzzling, as they can be composed together to form ever more complex declarations:

Node children attributes other attributes example
CNameDeclaratorNode - name: string 'a' in int a
CPtrDeclaratorNode base - * base
CReferenceDeclaratorNode base - & base
CArrayDeclaratorNode base, dimension - base [ dimension ]
CFuncDeclaratorNode base, args, exception_value - base (args) except exception_value
CConstDeclaratorNode base - only for const pointers: * const base

The parser can distinguish between declarators and similar-looking expressions by context, (e.g. the cdef keyword).
If you need to know more about C style declarations, look it up. You can start there: http://c-faq.com/decl/cdecl1.html.

The Visitor Framework

Compilers commonly use a visitor pattern to separate the implementation of operations on the nodes of the tree from the implementation of the nodes themselves. This allows you to implement new operations without changing the classes of the node.

If you’ve never heard of the visitor pattern, look it up!

Visitor.py implements a visitor framework using introspection, string concatenation and the Method Resolution Order, in a similar way to the description in http://peter-hoffmann.com/2010/extrinsic-visitor-pattern-python-inheritance.html, with added caching for performance.

This is implemented in base class Visitor.TreeVisitor:

  • to implement a visiting method for nodes that is instance of class SomeNode (subclasses too), name the method visit_SomeNode.
  • calling visit(node) dispatches the call to the closest matching visiting method according to the MRO.
  • calling visitchildren(node) visits the node’s children according to node.child_attrs.

Visitor.VisitorTransform is a subclass that provides an additional feature: nodes can be replaced by the return value of the visitor’s visiting method that visited them, and even removed from the tree by returning None. When visiting a child node that is held by its parent not as a single node but as one item in a list of nodes, more than one node can be inserted where the original visited node was in the list, simply by returning a list of nodes instead of just one node. This doesn’t work if the node being visited is directly referenced by an attribute of its parent node, that attribute has to be a list containing (among others) the visited node for this to work (and of course it’s impossible for the root node).

Types

PyrexTypes.py implements classes that represent types. The common base class is PyrexTypes.BaseType, but actual Cython types inherit from PyrexTypes.PyrexType. Inheriting directly from PyrexTypes.BaseType is meant for pseudo types.

Primitive types are pre-instantiated and held in aptly-named global variables towards the end of PyrexTypes.py. This also includes error_type, unspecified_type, and py_object_type.

User-defined types like C functions and extension classes are instantiated on the fly when each function or class is declared. Python types like Python functions simply use py_object_type.

Types provide methods to examine the relationship between two types, as well as methods that will be used to generate C code related to the type.

Scopes

Symtab.py implements classes dealing with scopes.

The base class for scopes is Symtab.Scope.

The following nodes are the main nodes that lead to a scope being created and associated to it:

Scope Node When ?
ModuleNode ModuleScope parsing
FuncDefNode LocalScope or ClosureScope declaration analysis
PyClassDefNode PyClassScope declaration analysis
CClassDefNode CClassScope declaration analysis
PropertyNode PropertyScope declaration analysis
CStructOrUnionDefNode StructOrUnionScope declaration analysis
CppClassNode CppClassScope declaration analysis
ComprehensionNode GeneratorExpressionScope expression analysis

Scopes are essentially there to map names to Symtab.Entry objects: each time a function, a class, a variable, etc., is declared, a new Entry is created and mapped to its name in the Cython code. Later references to that name can then look the matching entry up.

Symtab.Entry objects automatically hold a reference to the scope to which they belong and the type of the corresponding Cython name.

Scopes have methods to look for existing entries and hold references to their outer_scope, allowing them to look for entries in an enclosing scope:

  • lookup_here: look for an entry in this scope.
  • lookup: look for an entry in this scope or an enclosing one.

Scopes have methods to declare new entries:

  • declare: create a new entry and add it to the scope; check for re-declarations in the current scope.

A parameter called env conventionally refers to the current scope. That’s for instance the case in the analyse_declarations and analyse_expressions methods of the declaration analysis and expression analysis phases.

Symtab.InnerEntry is used when a name in a nested inner function references a name declared in an outer function (but not a global name). All InnerEntry objects hold a reference to the defining Entry, which shares with them a list holding all its inner entries. An Entry that has inner entries is marked as in_closure. All inner entries are marked as from_closure.

An InnerEntry is created when a name is looked-up in a function’s scope and the entry already exists in an outer function’s scope; the inner entry is then added to the inner function’s scope.

Notable Visitors

  • Visitor.PrintTree: just add PrintTree() anywhere in the list of pipeline visitors and the AST will be printed to standard output.

  • ParseTreeTransforms.PostParse: perform basic post-parse rearrangements of the AST and checks for syntax errors.
    Notably, create the DefNode implementing the lambda for each LambdaNode and set it to their attribute def_node (which is the only attribute of LambdaNode considered a child in the AST, so before that a LambdaNode is a leaf in the AST and its subexpressions are not visited).
    Also, introduce ParallelAssignmentNode to unpack parallel assignments into separate assignments where all the rhs must be evaluated before assigning to their lhs.

  • ParseTreeTransforms.MarkClosureVisitor: mark function or lambdas who contain inner functions or classes or lambdas as needs_closure.
    Also, replace all FuncDefNodes which are generators or coroutines by an appropriate GeneratorDefNode (or subclass).

  • ParseTreeTransforms.AnalyseDeclarationsTransform: declare (almost) all entries using the analyse_declarations framework; scopes and types are created when necessary.

  • FlowControl.ControlFlowAnalysis: launch a separate control flow analysis for each scope. Reaching definitions are computed, mainly to be used in later type inference.

  • ParseTreeTransforms.AnalyseExpressionsTransform: infer the type of all unspecified entries in module, function and scoped expression scopes; then launch the analyse_declarations phase on the corresponding module, function or scoped expression node. This second part performs some type checks like raising errors when assigning Python objects in a nogil section, as well as inserting CoercionNode objects in the tree, usually to indicate that type coercion code will need to be generated.

  • ParseTreeTransforms.CreateClosureClasses: declare entries in the module scope to represent closure classes for functions that declare entries that are references in a nested inner function (i.e. that have inner entries) and mark such function nodes as needs_closure (previous needs_closure flags are invalidated first); associate this entry to the outer function scope; mark inner functions that reference entries declared in an outer function (and will therefore need a reference to that outer scope) as needs_outer_scope.

Control Flow Analysis

FlowControl.py implements the control flow analysis carried out by ControlFlowAnalysis. This what it does:

  1. Build a Control Flow Graph (CFG). Look it up!
  2. Run a reaching definitions algorithm. Look it up!
  3. Store all the reaching definitions relevant to each use of a name in the Cython code in the cf_state attribute of the corresponding NameNode.
  4. Store all assignments and all references to a name respectively in the cf_assignments and cf_references attributes of the corresponding Entry.

You can look it up here for instance: https://www.seas.harvard.edu/courses/cs252/2011sp/slides/Lec02-Dataflow.pdf.

This step is also used to detect variables being “referenced before assignment” and warn about unused results or arguments.

Type Inference

The goal of type inference is to resolve the type of all entries in a scope with an unspecified_type, to an actual type.

In a module scope, it’s really easy: all unspecified_type entries become py_object_type. See TypeInference.PyObjectTypeInferer.

In other scopes, it’s a bit more sophisticated (see TypeInference.SimpleAssignmentTypeInferer):

Use Entry.cf_assignments (thanks to the control flow analysis) to see all the assignments to an unspecified_type entry in the scope, then decide the type of an assignment based on its rhs. The type of the entry will then be a spanning type of all the assignments: a lowest common denominator type. This can always be py_object_type, which plays the role of the universal fallback type.

If the rhs of an assignment depends on names whose type is still unresolved, use the cf_state of the corresponding NameNode to see the reaching definitions: this gives you a list of assignments to this name that have to be resolved first.

These NameNode of unspecified_type on which the rhs of an assignment depends are collected using the type_dependencies method framework: it starts in FlowControl.NameAssignment, branches out through the rhs expression nodes, and ends up in each NameNode.type_dependencies which returns a tuple containing itself if the type is unspecified, and an empty tuple otherwise.

Code Generation

Code generation relies on a framework implemented in Code.py.

Code.CCodeWriter is the object to which code is written; code-generating methods usually take such an object as a code argument.

Code.CCodeWriter has the useful feature of insertion points: a new code CCodeWriter object linked to the first is spawned, so that everything written to the new one will come after all that was already written to the original when it was spawned, but before all that will be written to the original afterwards.

This allows a CCodeWriter insertion point to be created for each code section, and then simply dispatched to appropriate methods.

At the end, the contents of the original CCodeWriter, insertion points and all, is copied in order to an actual file, and there you have it.

Code generation is mainly driven from methods in ModuleNode.py; these deal with calling into the code generating framework of the module’s component nodes and entries.

The whole process is launched by calling ModuleNode.process_implementation: that is the callable that is returned as a core pipeline phase.

You can get an idea of what the code generation framework is like on the Node side by looking at the descriptive comments in Nodes.StatNode and ExprNodes.ExprNode.

When the --annotate option is active, the Code.CCodeWriter subclass Annotate.AnnotationCCodeWriter implemented in Annotate.py is simply used instead to generate html annotations alongside the code. From the point of view of code-generating methods, nothing changes.