Abstract Syntax Tree for Patching Code and Assessing Code Quality

Abdul Qadir

Why should you care?

How do we easily and scalably patch 100,000s of lines of source code? Read about how we used a simple yet powerful data structure – Abstract Syntax Tree (AST) to create a system that from one single central point, maps source code dependencies and in-turn patches all dependencies.

Abstract

A software system is usually built with assumptions around how dependencies such as the underlying language system, frameworks, libraries etc. are written. Changes in these dependencies may have a ripple effect into the software system itself. For example, recently, the famous Python package pandas released its 1.0.0 version, which has deprecated and changed several functionalities that existed in its previous 0.25.x version. An organization may have many systems using 0.25.x version of pandas. Hence, upgrading it to 1.0.0 will require developers of every system to go through the pandas change documentation and patch their code accordingly.


Since we developers love to automate tedious tasks, it is natural for us to think of writing a patch script that will update the source code of all the systems according to the changes in new pandas version. A patch script could be parsing the source code and doing some kind of find+replace. But such a patch script will likely be unreliable and not comprehensive. For example, say the patch script needs to change the name of a function get to create wherever it is called in the code base. A simple find+replace will end up replacing the word “get” even if it was not a function call. Another example would be that find+replace will not be able to handle cases where code statements spill over to multiple lines. We need the patch script to parse the source code, while understanding the language constructs. In this article, we propose the use of Abstract Syntax Trees (AST) to write such patch scripts. And then later, we present how ASTs can be used to assess code quality.

Abstract Syntax Tree (AST)

Abstract Syntax Tree (or AST) is a tree representation of source code Wikipedia page.

Almost every language has a way to generate AST from its code. We use Python to build several critical parts of our systems. Hence, this article uses Python to give examples and highlights, but the learnings from here can be applied to any other language.

Python has a package called ast to generate ASTs. Here is a small tutorial on it.

Code:

Output:

So, the head of the AST is a Module object, which makes sense. Let’s dig deeper in it. The ast package provides an ast.dump(node) function that returns a formatted view of the entire tree rooted at node. Let’s call it on head object and see what we get.

Code:

Output (prettified):

Looking at the ast.dump output, we can see that the head object which is of type Module has an attribute body whose value is a list of 2 nodes – one representing var = 1 and the other representing print(var). The first node representing var = 1 has a target attribute representing the LHS var and a value attribute representing the RHS 1. Let’s see if we can print the RHS.


Code:

Output:

So, it works as expected. Now let’s try to modify the RHS from value 1 to 2.

Code:

Output (prettified):

We can see the value of the corresponding attribute has changed to 2. Now, we will want to convert the AST back to code to get the modified code. To do that, we will use a Python package called astunparse, for ast doesn’t provide this functionality.

Code:

Output:

So, the modified code has statement var = 2 instead of var = 1 as expected.

IntelliPatch

Now that we understand ASTs and how to generate them, inspect them, modify them and re-create code from them, let’s go back to the problem of writing patch scripts to modify the code of a system to use pandas 1.0.0 instead of pandas 0.25.x. We call these AST based patch scripts as “IntelliPatch”.


All the backward incompatibilities in pandas 1.0.0 are listed on this page. Let’s take the first backward incompatibility on the list and write IntelliPatch for that.

Avoid using names from MultiIndex.levels

In pandas 1.0.0, the name of a MultiIndex level can not be updated using = operator, instead it requires the use of Index.set_names().

Code using pandas 0.25.x:

Output:

The above code will raise a RunTimeError with pandas 1.0.0. For it to use pandas 1.0.0, it should be modified to the code below.

Equivalent code using pandas 1.0.0:

The IntelliPatch needs to do the following:

  1. Create AST of the given code and traverse it.
  2. Identify if any node represents the code of form <var>.levels[<idx>].name = <val> .
  3. Replace the identified node with the one that represents the code of form <var> = <var>.set_names(<val>, level=<idx>).

Below is the IntelliPatch script that does that.

intelli_patch.py

Usage Example 1:

Output:

Usage Example 2:

Output:

In usage example 2, note that the code statement that is to be replaced expands to more than 1 line and is present within a function g that is present within a function f that is present within a class C. IntelliPatch handles this case as well.


One can extend the patch script to take care of all backward incompatibilities in pandas 1.0.0. And then write an outer function that goes through every Python file of a system, reads its code, patches it and writes it back to disk.

It is important to note that a developer should review the changes done by the IntelliPatch before committing it. For example, if code is hosted on git, then a git diff should be performed and reviewed by the developer.

Impact

At Soroco, we have written 5 IntelliPatch scripts so far that were ran on 10 systems. Each script successfully parsed and patched about 150,000 lines of code across 10 systems. In terms of productivity, this effort took one of our engineers three full days to complete. This engineer learnt about ASTs before implementing these solutions.

Of the five scripts, one particular script was unique – a code scrubber and not a traditional patch. This need stemmed from an external party seeking to review the outline of the code, without sharing the actual logic and specifics of the code. Hence, we wrote a scrubber, that scrubs logic and other key elements in the code while retaining only the imports, class and function definitions, docstrings, type annotations and some very specific information required for the review. Therefore, the AST proved to be a valuable tool for buiding a code scrubber as well.

Limitations

One of the problems of patching code using ast package of Python is that it loses all the formatting and comments of the original source code. This can be solved by making the patch script a little smarter. Instead of having it unparse the entire patched AST and write that to disk, we can make it unparse only the nodes it modified and insert the modified code at the corresponding line number in the file. The ast nodes have lineno attribute that can be used to retrieve the line number of the file to be injected the patched code with.
If you enjoy reading this article and want to work on similar problems, apply here and come work with us!

Code Quality Assessment

Now that we understand how ASTs can be very useful to write intelligent patch scripts, in this section we will explain how it can be used to assess code quality.

Many of the IDEs, linters and code inspectors, like PyCharm and SonarQube, use ASTs to perform code quality checks. We can use ASTs to create our own code quality checks specific to our needs. Below are a few examples:

Example 1: Non self-explanatory variable names

You want the developers in your organization to use good self-explanatory variable names in the code. The most frequent problem that you see in the code is the use of single character variable names like i, j, etc. Below is a script that can check that.

variable_name_check.py

Usage:

Output:

Example 2: Un-logged except block of code

You want the developers in your organization to make sure to put logging when an exception is caught. You expect that either an error or exception function of logging module is called from every except block of code. Below is a script that can check that using AST.

unlogged_except_check.py

Usage:

Output:

This can be taken one step further where if an except code block is found without any logging, then the code quality checker can put the logging in the code by adding a corresponding node in the AST.

Conclusion

The usefulness of ASTs extends far beyond the discussion in this article. For example, the ASTs of the files in a given system can be used to create a call graph. A call graph created during run-time may not cover all the code paths. But a call graph created using ASTs statically will cover all the code paths and thus will be comprehensive. The call graph then can be used to generate a human readable documentation of the system. We have built such a functionality in Soroco that we call “LiveDoc”, but that is a topic for another day in an another article 🙂

If you enjoy reading this article and want to work on similar problems, apply here and come work with us!

Like this article? Spread the word 

Share on facebook
Share on twitter
Share on linkedin
Share on reddit
Share on mix
Share on email

Content Explorer

Leave a Reply

Your email address will not be published. Required fields are marked *