Cleaning up Ph.D. Python

As part of my job, I sometimes help graduate students and postdoctoral scholars to improve their code and their coding skills. The purpose of this post is to provide a high-level summary of what I try to teach them, and the advice that I give them, without delving into low-level, case-specific details.

Most of the people I help would be described as 'intermediate' or 'advanced' coders, but not professionals. My goal is to help guide them toward becoming more professional. Consequently, the goal of this post is to provide a few tips to help an intermediate programmer to become more professional. A brief summary of what I consider to be 'intermediate' and 'advanced' is given below. For this article, I will emphasize the Python programming language, since the people who I've encountered lately are primarily Python programmers.

]Cleaning up. (Photo by Nicholas Pilch)

Stages in the life of a software developer

Based upon my observations, it seems that, most^‡ software engineers, application developers, and scientists progress through the following approximate stages, given sufficient effort and time:

Beginner: A person who is currently in the process of learning to write code for the first time.

Novice: A novice has learned most of the syntax for one language and has started to learn about the standard libraries of the language and some third-party libraries. They have typically learned about some common algorithms and data structures. They may be in the process of learning a second or third programming language. The initial versions of their code are usually filled with syntax errors and frequently yield unexpected behavior.

Intermediate: An intermediate developer has written many small programs and a few fairly complicated ones, spanning multiple files, and thousands of lines of code. When writing in a compiled language, they still frequently experience syntax errors, but far fewer than a novice. Depending upon the language, memory leaks and segmentation faults may be fairly common in early versions of their code. The intermediate programmer has become very familiar with the standard library of their favorite language and they are likely aware of some fairly advanced algorithms and less-frequently-used data structures.

Advanced: An advanced developer does not frequently experience syntax errors in their primary and secondary programming languages (they are most likely familiar with five or more languages). They do not experience a huge number runtime-errors anymore. They are familiar with many third-party libraries and may currently be in the process of writing a third-party library. They know how to write parallel and multi-threaded code. They also generally know how to take advantage of SIMD instructions in their code and they are familiar with enough details of their target computer architecture to take advantage of optimal memory-access patterns. The importance of writing readable, well-organized, maintainable code begins to grow and they begin to appreciate automated testing.

Professional: At the professional level, code design and implementation becomes more formal and collaborative. Once the algorithms have been determined, most of the difficulty is no longer related to making the code work or writing code that is fast or efficient, since these things have become fairly easy to do; the emphasis shifts to writing code that can be easily read and understood so that other developers can easily contribute and so that the code can be more easily maintained and developed in a collaborative manner. Security also often becomes a significant focus. The professional focuses on the organization and overall design of code, keeping in mind maintainability and the cost of making changes to the code in the future; the goal is to minimize this cost. Code documentation and various types of testing receive as much effort as the code implementation (oftentimes more).

Master / Expert / Guru: The master is extremely familiar with the entire software and hardware stack. They are an expert in one or more languages; they may have even developed their own language. They have worked on several large projects and contributed many thousands of lines of professionally-written code over the course of a decade or more. Rather than writing code, many masters tend to spend most of their time answering questions, reviewing code, and participating in discussions with professionals about future changes to a particular codebase. Some extreme, famous examples include Linus Torvalds, Guido van Rossum, Greg Kroah-Hartman, Ken Thompson, Dennis Ritchie, Bjarne Stroustrup, Pieter Hintjens, Richard Stallman, Donald Knuth, and John Carmack.

‡With the possible exception of people who jump right into programming at a very early age, computer engineers, electrical engineers, and the subset of computer science students who are trained, early-on, to write in assembly language, write operating system kernels, kernel modules, compilers, assemblers, and related code.

General Principles

The following is an outline of some important general principles. To get the full benefit of this, you should read the things that are referenced within and do many web searches in order to fill in the details.

Follow a coding style guide

Following a coding style guide helps your code to be more consistent in appearance and easier to read. Additionally, the style guide usually encourages good coding practices and discourages problematic ones. For Python, most people use the PEP 8 or a slightly-modified variant of PEP-8.

Well-written code doesn't need many comments

In most cases, the presence comments in code indicates that the code needs to be refactored. General explanatory text belongs in documentation strings—not in the body of a function or method. If you feel the need to explain what a certain quantity is or what the next few lines of code are doing, then the quantity in question should probably be renamed and the few lines of code in question should probably be put into a separate function or method with a name that explains what the code does and a documentation string explaining how and why. You can read more here:

Avoid premature optimization and generalization

One should only optimize code that really needs to be made more efficient, unless the optimization does not negatively impact the clarity of the code. Highly-optimized code tends to be more difficult to read and more difficult to modify than straightforwardly-written code. Optimization should only be done near the end of the development cycle on the components of the code that have been identified as the slowest (or most memory-intensive). Optimizing code that does not need to be optimized is generally a waste of effort and it makes the code more difficult to understand.

Premature generalization is also usually a waste of effort. Only generalize when there is actually a tangible benefit to doing so. Making things more general and abstract is nice and it is oftentimes very beneficial, but try to use good judgment. It can be tempting to make the code design more 'elegant' by generalizing things that don't really need to be generalized, but it usually also ends up making the code more complicated than it needs to be.

Become familiar with common design patterns

Reading about common design patterns can be helpful. In particular, it's helpful to be familiar with object-oriented design patterns. The real value here is realized once you have tried to get into the minds of the authors of the "gang of four" book. You will gradually become better at identifying places in which the design of the code will cause headaches in the future. The ultimate goal is to design flexible, adaptable code that is easy to modify in major ways; rigidly-designed code with many strongly-coupled components slows down the development process later on.

Keep SOLID and the UNIX Philosophy in mind

Study the SOLID design principles and learn about the Unix philosophy. This is another way in which you can learn from the experience of software professionals who have come before you.

Test-driven development

Become familiar with test-driven development, even if you do not formally practice it, it's a good methodology to know about. Unit tests are most valuable when used in a test-driven development design process. Some people go as far as saying that writing unit tests is almost pointless if you write them after the code is written, but I don't agree with that; unit tests are still very valuable during refactoring and they are a useful form of documentation.

Consistent, disciplined use of version control

Most developers nowadays are somewhat familiar with Git or another version control system. You should do some web searches and read many articles on the proper use of version control, which will teach you about when to fork, when to commit, what to write in a commit message, etc. After reading what other people recommend regarding version control, figure out what's best for you; decide upon a version control work-flow and follow it cosistently.

Helpful Tools

A Linter

Pylint is a linter, which is a type of static analysis tool. It can identify potential problems with the code, point out areas that should be refactored, and show style violations. Install using: sudo pip install pylint

A Code Diagrammer

A code diagrammer, or automated UML-generator is a program that can analyze your code and create rough UML diagrams or UML-like diagrams that graphically represent the code. Viewing your code graphically is oftentimes very enlightening. PyReverse can be used to automatically create basic UML diagrams of any Python module that you are trying to understand. PyReverse comes as a sub-component of Pylint.

A Good IDE

Good IDEs (in any language) have the following capabilities, in addition to syntax highlighting, searching, and intelligent code completion:

They create a code model which can be used to refactor the project efficiently. Some IDEs (for other languages, like C++) can perform very complicated refactoring.
They integrate with a debugger so that breakpoints and watchpoints can be set and many variables can be viewed at once.
They allow you to view documentaton strings for classes and methods by simply hovering over a name or pressing a special key combination.
They can identify syntax errors, style violations, and potential problems as you type.
They can automatically reformat (beautify) code to enforce a specified coding style.
They integrate with version control systems, like Git.

There are several Python IDEs available. Currently, most people prefer PyCharm.

A Debugger

Even if you are not using an IDE, you can still take advantage of a debugger. The standard Python debugger is pdb. The module ipdb is an improved version. If you use IPython, then you already have ipdb. To use it, simply use the %debug magic function or tell IPython to start the debugger when an exception is raised, using %pdb. I would suggest reading the documentation for pdb and ipdb to learn how to step through lines of code, step into functions, view the current values of variables, etc.

A Profiler

Profiling tools can be used to identify how many times each function is called during code execution and how much time was spent in each function. This allows you to find areas of the code that are 'hot' and would benefit from optimization. IPython makes performance profiling easy with the %prun magic function. The memory_profiler module makes memory profiling fairly easy.

A Beautifier

There are a few standalone beautifiers that can automatically reformat Python code. One is yapf. To install: sudo pip install yapf

An Automatic Documentation Generator

Once you have written appropriately-formatted docstrings to document your code, a documentation generator can be used to parse the strings and build HTML or documentation for the code internals and API. By reading through the documentation, you can judge the quality of your docstrings. Overall, the best example of this sort of tool is probably Doxygen, but it is best for C and C++. The best solution for Python appears to be Sphinx, with the autodoc extension. Note that the Napoleon extension allows you some freedom regarding the style of your documentation strings. Napoleon allows you to use NumPy-style or Google-style docstrings instead of reStructuredText doc strings.

A Continuous Integration Framework

If you are working with other developers (and perhaps even if you aren't), using continuous integration can be helpful. Continuous integration systems can automatically run all of your unit tests, integration tests, and fuzz tests, build your documentation, and package your code for deployment. They can also facilitate the sort of static-analysis offered by SonarQube (see below).

A Continuous Inspection Tool

There are several tools that will continually inspect Python code for quality. A popular one is SonarQube, which performs tasks similar to Pylint (it may even use Pylint internally). The problem with SonarQube is that it is not particularly easy to set up; it is more appropriate to run this on a server when working on a large project with many contributors. It can be automatically triggered by a continuous integration system, like Jenkins.

Nathaniel R. Stickley

Software Engineer - Astrophysicist - Data Scientist