Nathaniel R. Stickley

Software Engineer - Astrophysicist - Data Scientist

Lazy Evaluation in Python

As part of a project at work, I have had to write some code to load a collection of HDF5 files containing thousands of small cutout images (2D arrays of floating point values). The application that uses these files (hereafter, the "consumer") will not need to use all of the cutouts that are contained within the collection of input files. Ideally, I don't want the consumer to waste time and memory by loading the entire contents of all of the files. I also want the code to be clean and maintainable; if the format of the input files changes, I do not want to change the code within the consumer.

To manage the latter concern, I began implementing a set of classes for reading and abstracting the input files and I placed these classes into the sub-project containing the program that produces the output files (the 'producer'). The producer is primarily maintained by another developer. However, by adding my file-reading classes to the same project and then adding integration tests, I can guarantee that, whenever a change is made to the producer that breaks the file-reader, the tests will fail and effectively set off an alarm until the file reading classes have been updated. The code within the consumer does not need to change; it only needs to import the latest version of the file reader.

To handle the first concern, I have used lazy evaluation, specifically lazy initialization, in this case. I created a class called CutoutCollection, which handles the file loading and abstraction. The CutoutCollection appears to contain many objects of the class Cutout, however, the Cutout objects are not actually created until they are requested by name (id number). Furthermore, the contents of each Cutout are not loaded from the underlying HDF5 files until they are requested. Here is a section of the implementation of Cutout:

Note that the __init__() sets the internal variables corresponding to loaded quanties to None. When the flux property is accessed, it first calls _delayed_init(), which completes the actual initializaton. If the object has already been initialized, then _delayed_init() does nothing. The Wcs and Pixels classes also delay their initialization until their member data is requested (not shown here).

The CutoutCollection class looks like this:

The method, _make_index_if_necessary() is called whenever the contents of the collection are requested. Until data is requested, the CutoutCollection object only contains references to the HDF5 files that it represents. I could have made this even more lightweight by delaying the index creation and only adding objects to the _cutout_index dictionary when they are requested. I've chosen not to do that because the current method works well enough, due to the fact that the Cutout objects also use lazy initialization.

You can find find some timing results which illustrate performance of the code here.

Nathaniel R. Stickley