Short guide to Cython source code
This document aims to provide a quick, concise overview of the core source code of the Cython compiler that still hits all the essentials.
Source files of interest
cython/Cython/Compiler/
| | | - Main.py
| | | - Pipeline.py
| | |
| | | - Lexicon.py
| | | - Scanning.py
| | | - Parsing.py
| | |
| | | - Nodes.py
| | | - ExprNodes.py
| | |
| | | - Visitor.py
| | |
| | | - PyrexTypes.py
| | | - Symtab.py
| | |
| | | - ParseTreeTransforms.py
| | |
| | | - FlowControl.py
| | | - TypeInference.py
| | |
| | | - ModuleNode.py
| | | - Code.py
| | | - Annotate.py
Core Pipeline
All roads lead to Main.py
; all invocations of the Cython compiler go through Main.py
. More precisely, they all call Main.run_pipeline
. From there we go straight into Pipeline.py
to set up and run the compiler pipeline.
Pipeline.py
is the place from where the whole compilation process is directed: a pipeline of successive transformations is applied to turn the source Cython code into C code.
A pipeline phase is a callable object that takes some data and returns transformed data.
All pipeline steps are setup by various factory functions in Pipeline.py
.
Pipeline.run_pipeline
is the engine that chains these successive data transformations together: you put Cython code into the pipeline and generated C code comes out.
Parsing
The first pipeline step is parsing the Cython source code to produce an Abstract Syntax Tree.
The entry point of parsing and therefore the pipeline phase callable is Parsing.p_module
.
To understand the parsing process, you have two approaches:
-
the first is to go to Parsing.p_module
, which leads you straight to Parsing.p_statement_list
and then Parsing.p_statement
, where interesting things happen. That’s a top-down kind of approach.
-
if that doesn’t work for you, simply starting at the top of Parsing.py
and reading your way down actually yields a fairly coherent order, in a bottom-up kind of way (bottom-up relative to the function calls, not the file).
As a side-note, if you’re more used to a formal grammar with BNF notation, there is a not-quite-finished, out-of-date, and apparently forgotten grammar for the Cython language under Cython/Parser/Grammar
.
Before you do any of that though, you should take a look at Scanning.py
: all parsing functions in Parsing.py
take as argument a Scanning.PyrexScanner
object s
. It handles emitting a stream of tokens from the Cython source code as a handy abstraction for parsing. The rules for token creation are defined in Lexicon.py
.
And before you delve too deeply into that, if you want to know more about how those tokens are produced the Plex documentation at https://pythonhosted.org/plex/ is a good starting point. You can then look into cython/Cython/Plex
if you need more.
To understand Parsing.py
it can be useful to refer to https://docs.python.org/3/reference/, in particular the lexical analysis section, as well as the grammar-explaining sections starting with expressions. Note that the latest Python language features aren’t always supported in Cython yet.
The Abstract Syntax Tree
Parsing.p_module
returns an AST in the form of its root node: a ModuleNode.ModuleNode
representing a whole module.
A Parse Tree is made of nodes. The base class for nodes is Nodes.Node
.
The tree structure stems from the fact that each subclass defines a child_attrs
class attribute that lists the names of the attributes that hold children nodes. Each such attribute can either contain a single node or a list of nodes. When it is a list of nodes, the list itself is not taken into account as part of the tree, i.e. the items in the list are direct children of the parent node.
Many expression nodes in ExprNodes.py
use the subexpr
class attribute for the same role as child_attrs
. To make child_attrs
work anyway, ExprNodes.ExprNode
defines child_attrs
as a read-only property that returns the instance’s subexpr
.
I suggest you take the time to look a few nodes and just figure out what they’re for. This will inevitably begin to give you an idea of the tree structure. Most node classes have helpful descriptions in comments, and their child_attrs
attribute tells you what they call their children in the tree. You can also look in Parsing.py
for when the node is instantiated. Note that many nodes are only produced in later compilation stages however.
Getting started with the AST
To get you started, here are a few omnipresent expression nodes:
Node |
children attributes |
other relevant attributes |
matches |
NameNode |
- |
name : string |
references to name |
AttributeNode |
obj |
attribute : string |
obj . attribute |
IndexNode |
base , index |
- |
base [ index ] |
SimpleCallNode |
function , args , … |
- |
function ( args ) |
UnopNode |
operand |
operator : string |
operator operand |
e.g. AmpersandNode |
operand |
operator = ‘&’ |
& operand |
BinopNode |
operand1 , operand2 |
operator : string |
operand1 operator operand2 |
As well as a few statement nodes:
Node |
children attributes |
other relevant attributes |
matches |
PassStatNode |
- |
- |
pass |
SingleAssignmentNode |
lhs , rhs |
- |
lhs = rhs |
StatListNode |
stats |
- |
- |
IfStatNode |
if_clauses , else_clause |
- |
- |
Past this point making a nice-looking table becomes more complicated, but don’t let that stop you from taking a look at CStructOrUnionDefNode
, CFuncDefNode(FuncDefNode)
, DefNode(FuncDefNode)
, …
One other thing: CBaseTypeNode
and CDeclaratorNode
and their subclasses are used to represent to two main components of any C-style declaration:
- a type specifier
- a declarator
As such, you will find them as children of:
CVarDefNode
: C variable declaration or C function declaration without definition (forward or extern
)
CArgDeclNode
: C function argument declaration
CFuncDefNode
: C function definition
CTypeDefNode
: C-style typedef
Declarators especially can be puzzling, as they can be composed together to form ever more complex declarations:
Node |
children attributes |
other attributes |
example |
CNameDeclaratorNode |
- |
name : string |
'a' in int a |
CPtrDeclaratorNode |
base |
- |
* base |
CReferenceDeclaratorNode |
base |
- |
& base |
CArrayDeclaratorNode |
base , dimension |
- |
base [ dimension ] |
CFuncDeclaratorNode |
base , args , exception_value |
- |
base (args ) except exception_value |
CConstDeclaratorNode |
base |
- |
only for const pointers: * const base |
The parser can distinguish between declarators and similar-looking expressions by context, (e.g. the cdef
keyword).
If you need to know more about C style declarations, look it up. You can start there: http://c-faq.com/decl/cdecl1.html.
The Visitor Framework
Compilers commonly use a visitor pattern to separate the implementation of operations on the nodes of the tree from the implementation of the nodes themselves. This allows you to implement new operations without changing the classes of the node.
If you’ve never heard of the visitor pattern, look it up!
Visitor.py
implements a visitor framework using introspection, string concatenation and the Method Resolution Order, in a similar way to the description in http://peter-hoffmann.com/2010/extrinsic-visitor-pattern-python-inheritance.html, with added caching for performance.
This is implemented in base class Visitor.TreeVisitor
:
- to implement a visiting method for nodes that is instance of class
SomeNode
(subclasses too), name the method visit_SomeNode
.
- calling
visit(node)
dispatches the call to the closest matching visiting method according to the MRO.
- calling
visitchildren(node)
visits the node’s children according to node.child_attrs
.
Visitor.VisitorTransform
is a subclass that provides an additional feature: nodes can be replaced by the return value of the visitor’s visiting method that visited them, and even removed from the tree by returning None. When visiting a child node that is held by its parent not as a single node but as one item in a list of nodes, more than one node can be inserted where the original visited node was in the list, simply by returning a list of nodes instead of just one node. This doesn’t work if the node being visited is directly referenced by an attribute of its parent node, that attribute has to be a list containing (among others) the visited node for this to work (and of course it’s impossible for the root node).
Types
PyrexTypes.py
implements classes that represent types. The common base class is PyrexTypes.BaseType
, but actual Cython types inherit from PyrexTypes.PyrexType
. Inheriting directly from PyrexTypes.BaseType
is meant for pseudo types.
Primitive types are pre-instantiated and held in aptly-named global variables towards the end of PyrexTypes.py
. This also includes error_type
, unspecified_type
, and py_object_type
.
User-defined types like C functions and extension classes are instantiated on the fly when each function or class is declared. Python types like Python functions simply use py_object_type
.
Types provide methods to examine the relationship between two types, as well as methods that will be used to generate C code related to the type.
Scopes
Symtab.py
implements classes dealing with scopes.
The base class for scopes is Symtab.Scope
.
The following nodes are the main nodes that lead to a scope being created and associated to it:
Scope |
Node |
When ? |
ModuleNode |
ModuleScope |
parsing |
FuncDefNode |
LocalScope or ClosureScope |
declaration analysis |
PyClassDefNode |
PyClassScope |
declaration analysis |
CClassDefNode |
CClassScope |
declaration analysis |
PropertyNode |
PropertyScope |
declaration analysis |
CStructOrUnionDefNode |
StructOrUnionScope |
declaration analysis |
CppClassNode |
CppClassScope |
declaration analysis |
ComprehensionNode |
GeneratorExpressionScope |
expression analysis |
Scopes are essentially there to map names to Symtab.Entry
objects: each time a function, a class, a variable, etc., is declared, a new Entry
is created and mapped to its name in the Cython code. Later references to that name can then look the matching entry up.
Symtab.Entry
objects automatically hold a reference to the scope
to which they belong and the type
of the corresponding Cython name.
Scopes have methods to look for existing entries and hold references to their outer_scope
, allowing them to look for entries in an enclosing scope:
lookup_here
: look for an entry in this scope.
lookup
: look for an entry in this scope or an enclosing one.
Scopes have methods to declare new entries:
declare
: create a new entry and add it to the scope; check for re-declarations in the current scope.
A parameter called env
conventionally refers to the current scope. That’s for instance the case in the analyse_declarations
and analyse_expressions
methods of the declaration analysis and expression analysis phases.
Symtab.InnerEntry
is used when a name in a nested inner function references a name declared in an outer function (but not a global name). All InnerEntry
objects hold a reference to the defining Entry
, which shares with them a list holding all its inner entries. An Entry
that has inner entries is marked as in_closure
. All inner entries are marked as from_closure
.
An InnerEntry
is created when a name is looked-up in a function’s scope and the entry already exists in an outer function’s scope; the inner entry is then added to the inner function’s scope.
Notable Visitors
-
Visitor.PrintTree
: just add PrintTree()
anywhere in the list of pipeline visitors and the AST will be printed to standard output.
-
ParseTreeTransforms.PostParse
: perform basic post-parse rearrangements of the AST and checks for syntax errors.
Notably, create the DefNode
implementing the lambda for each LambdaNode
and set it to their attribute def_node
(which is the only attribute of LambdaNode
considered a child in the AST, so before that a LambdaNode
is a leaf in the AST and its subexpressions are not visited).
Also, introduce ParallelAssignmentNode
to unpack parallel assignments into separate assignments where all the rhs
must be evaluated before assigning to their lhs
.
-
ParseTreeTransforms.MarkClosureVisitor
: mark function or lambdas who contain inner functions or classes or lambdas as needs_closure
.
Also, replace all FuncDefNodes
which are generators or coroutines by an appropriate GeneratorDefNode
(or subclass).
-
ParseTreeTransforms.AnalyseDeclarationsTransform
: declare (almost) all entries using the analyse_declarations
framework; scopes and types are created when necessary.
-
FlowControl.ControlFlowAnalysis
: launch a separate control flow analysis for each scope. Reaching definitions are computed, mainly to be used in later type inference.
-
ParseTreeTransforms.AnalyseExpressionsTransform
: infer the type of all unspecified entries in module, function and scoped expression scopes; then launch the analyse_declarations
phase on the corresponding module, function or scoped expression node. This second part performs some type checks like raising errors when assigning Python objects in a nogil
section, as well as inserting CoercionNode
objects in the tree, usually to indicate that type coercion code will need to be generated.
-
ParseTreeTransforms.CreateClosureClasses
: declare entries in the module scope to represent closure classes for functions that declare entries that are references in a nested inner function (i.e. that have inner entries) and mark such function nodes as needs_closure
(previous needs_closure
flags are invalidated first); associate this entry to the outer function scope; mark inner functions that reference entries declared in an outer function (and will therefore need a reference to that outer scope) as needs_outer_scope
.
Control Flow Analysis
FlowControl.py
implements the control flow analysis carried out by ControlFlowAnalysis
. This what it does:
- Build a Control Flow Graph (CFG). Look it up!
- Run a reaching definitions algorithm. Look it up!
- Store all the reaching definitions relevant to each use of a name in the Cython code in the
cf_state
attribute of the corresponding NameNode
.
- Store all assignments and all references to a name respectively in the
cf_assignments
and cf_references
attributes of the corresponding Entry
.
You can look it up here for instance: https://www.seas.harvard.edu/courses/cs252/2011sp/slides/Lec02-Dataflow.pdf.
This step is also used to detect variables being “referenced before assignment” and warn about unused results or arguments.
Type Inference
The goal of type inference is to resolve the type of all entries in a scope with an unspecified_type
, to an actual type.
In a module scope, it’s really easy: all unspecified_type
entries become py_object_type
. See TypeInference.PyObjectTypeInferer
.
In other scopes, it’s a bit more sophisticated (see TypeInference.SimpleAssignmentTypeInferer
):
Use Entry.cf_assignments
(thanks to the control flow analysis) to see all the assignments to an unspecified_type
entry in the scope, then decide the type of an assignment based on its rhs. The type of the entry will then be a spanning type of all the assignments: a lowest common denominator type. This can always be py_object_type
, which plays the role of the universal fallback type.
If the rhs of an assignment depends on names whose type is still unresolved, use the cf_state
of the corresponding NameNode
to see the reaching definitions: this gives you a list of assignments to this name that have to be resolved first.
These NameNode
of unspecified_type
on which the rhs of an assignment depends are collected using the type_dependencies
method framework: it starts in FlowControl.NameAssignment
, branches out through the rhs expression nodes, and ends up in each NameNode.type_dependencies
which returns a tuple containing itself if the type is unspecified, and an empty tuple otherwise.
Code Generation
Code generation relies on a framework implemented in Code.py
.
Code.CCodeWriter
is the object to which code is written; code-generating methods usually take such an object as a code
argument.
Code.CCodeWriter
has the useful feature of insertion points: a new code CCodeWriter
object linked to the first is spawned, so that everything written to the new one will come after all that was already written to the original when it was spawned, but before all that will be written to the original afterwards.
This allows a CCodeWriter
insertion point to be created for each code section, and then simply dispatched to appropriate methods.
At the end, the contents of the original CCodeWriter
, insertion points and all, is copied in order to an actual file, and there you have it.
Code generation is mainly driven from methods in ModuleNode.py
; these deal with calling into the code generating framework of the module’s component nodes and entries.
The whole process is launched by calling ModuleNode.process_implementation
: that is the callable that is returned as a core pipeline phase.
You can get an idea of what the code generation framework is like on the Node
side by looking at the descriptive comments in Nodes.StatNode
and ExprNodes.ExprNode
.
When the --annotate
option is active, the Code.CCodeWriter
subclass Annotate.AnnotationCCodeWriter
implemented in Annotate.py
is simply used instead to generate html annotations alongside the code. From the point of view of code-generating methods, nothing changes.