Python's concurrent.futures module

This will be an uncharacteristically short post. Instead of writing a full-fledged article, I wanted to point you to a tip that I just learned from an article that George Seif posted on Medium [1]. In it he introduces Python’s concurrent.futures module, and explains how it can be used to accelerate automated data analysis processes.

One common theme in this blog is the use of the glob package. Glob creates a list of all of the files in a folder, thereby providing a list of files that you can program Python to iterate through. In this way you can write a program that makes Python perform a set of calculations on every file in that folder, thus analyzing all of the data in a fraction of the time, and with a fraction of the effort that would be required either manually or automatically but without glob.

By default Python uses a single core on your computer for its processes. This means that, when iterating through your glob loop, Python will use a single core to analyze the first data file, then the second, then the third, and so on. What George discovered is that the concurrent.futures package instructs Python to use all of the cores of the computer in parallel. This means that one core would analyze the first file in the glob list. A second core would analyze the second file at the same time. A third would analyze the third file at the same time. And so on, until all of your processors are in use. In cases where you need to analyze thousands, or maybe even millions of data files, this could result in dramatically faster completion times.

The base code needed to use this capability is as shown below. In the code below the variable “Path” takes the place of the path to your data folder, and "Analysis_Script” takes the place of the function you’re using to analyze the data.

with concurrent.futures.ProcessPoolExecutor() as executor:

Test_Files = glob.glob(Path + “*.csv”

executor.map(Analysis_Script, TestFiles)

For a more complete introduction, including timed results showing the speed improvements, see George’s article. It is cited in the footnotes.

1 Seif, George. “Here’s how you can get a 2-6x speed-up on your data pre-processing with Python.” Medium.com.