Python's concurrent.futures module

This will be an uncharacteristically short post. Instead of writing a full-fledged article, I wanted to point you to a tip that I just learned from an article that George Seif posted on Medium [1]. In it he introduces Python’s concurrent.futures module, and explains how it can be used to accelerate automated data analysis processes.

One common theme in this blog is the use of the glob package. Glob creates a list of all of the files in a folder, thereby providing a list of files that you can program Python to iterate through. In this way you can write a program that makes Python perform a set of calculations on every file in that folder, thus analyzing all of the data in a fraction of the time, and with a fraction of the effort that would be required either manually or automatically but without glob.

By default Python uses a single core on your computer for its processes. This means that, when iterating through your glob loop, Python will use a single core to analyze the first data file, then the second, then the third, and so on. What George discovered is that the concurrent.futures package instructs Python to use all of the cores of the computer in parallel. This means that one core would analyze the first file in the glob list. A second core would analyze the second file at the same time. A third would analyze the third file at the same time. And so on, until all of your processors are in use. In cases where you need to analyze thousands, or maybe even millions of data files, this could result in dramatically faster completion times.

The base code needed to use this capability is as shown below. In the code below the variable “Path” takes the place of the path to your data folder, and "Analysis_Script” takes the place of the function you’re using to analyze the data.

with concurrent.futures.ProcessPoolExecutor() as executor:

Test_Files = glob.glob(Path + “*.csv”

executor.map(Analysis_Script, TestFiles)

For a more complete introduction, including timed results showing the speed improvements, see George’s article. It is cited in the footnotes.

1 Seif, George. “Here’s how you can get a 2-6x speed-up on your data pre-processing with Python.” Medium.com.

September Announcements

Since the 1000x Faster blog is currently on a hiatus while I create new content, I figure that this is a good time to announce a few changes and upcoming projects.

Newsletter

First off, 1000x Faster now has a newsletter that you can sign up for. This newsletter is expected to be used for announcements of new projects, or new product releases. Examples include announcements of new blog post series discussing new topics (And here's a teaser: Make sure to see the last announcement in this post for one of those!), or releases for publications or data analysis tools. Make sure to sign up using the link at the very bottom of the page to stay informed of all the new happenings here at 1000x Faster.

Patreon Account

Thus far I've published this blog, teaching people concepts needed to make their data analysis processes much faster and easier, free of charge. I want to keep it that way, so everybody can learn from it whether they have the money to pay for education or not. At the same time, I do need to monetize 1000x Faster so that I'm rewarded for the time that I spend providing this value to people. In an attempt to balance these goals, I've created a Patreon account so that those who have money and want to support the project can do so. Hopefully this brings in enough money that I can continue providing this content to as many people as possible. If you're interested in supporting the 1000x Faster project, the link to my Patreon account is just above the Newsletter sign up form at the bottom of the page.

New Blog Post Series: Python Foundations

My previous blog post series focused on a fairly advanced topic, of how to use Python to automate laboratory data analysis. This is certainly a valuable topic that many people can benefit from, but all of those posts assumed that the reader had all of the necessary Python tools installed and a basic understanding of how to use them. It briefly introduced several packages that are used in data analysis automation, but only scratched the service. 

Aiming to support those who are newer to Python programming, I'll be starting a series of blog posts I'm calling "Python Foundations." It will provide a more beginner level introduction to the topics including 1) Installing my preferred Python tools, 2) Pointing the reader to excellent resources for learning the basic syntax structure and commands of Python, and 3) Providing detailed tutorials to using many of the useful packages to perform data analysis, and plotting. Specific tutorials will be written for important packages such as pandas, bokeh, glob, and matplotlib.

Next Steps

I expect the next blog post to come when I have several of the Python Foundations blog posts created. Hopefully you're excitedly waiting [1].

 

 

 

[1] Excitedly, yes. Though I wouldn't recommend holding your breath. It might be a few weeks.

Analyzing Data Sets With Multiple Test Types

The previous posts have all discussed methods for automating data analysis using Python when all tests are similar. This won’t always be the case. Sometimes tests will be used for different purposes; for example, some tests may collect data for regression development, while others search for behaviors or control logic in specific test cases. This creates an added level of complexity when writing scripts to analyze the data; the approach must be flexible enough to correctly analyze each of these different cases. This post describes how to create a central program which is flexible enough to handle all of these data analysis needs.

Generating Regressions from Stored Data

The final Python-based automation of laboratory data analysis topic to discuss is that of generating and validating regressions from the stored data. This is typically the ultimate goal of laboratory data analysis projects, and there are still several things to think through before declaring the project completed. This post will introduce and discuss topics such as identifying the best regression form, different tools for generating regressions, and validating models.

Storing Intermediate Results for Later Analysis

So far, all of the discussion has been in analyzing results from individual tests. The next step is to begin to think bigger picture, and create ways to combine those individual test results into data sets describing the results from the entire project. The first step is storing the individual test results in a logical manner, which facilitates later analysis. This post provides guidance on how to do that.

Checking the Quality of Testing and Analysis Results

One challenge of automated data analysis is that of checking the results. There is potential for errors in testing, and in data analysis which can both be caught quickly when manually analyzing data. This post provides some methods of doing the same error checking with automated processes, and provides example Python code.

An Introduction to Python Packages that are Useful for Automating Data Analysis

Automating analysis of each individual test relies on the capabilities of several available packages. These packages include glob, pandas, bokeh, and matplotlib. This post provides an introduction to these packages, and future posts will provide a much more thorough description of individual capabilities.

How to Identify the Conditions of Laboratory Tests and Split Large Data Files

When automating laboratory data analysis, it’s critical that the program have a way to identify the conditions of the test. Sometimes this is easier said than done, as file names may consist of nondescript numbers, provide more information about when the test was run than the test itself, or contain data from several tests in a single file. This post provides some ways to overcome these obstacles, complete with example Python code.

Designing Data Sets for Automated Laboratory Data Analysis

Automating laboratory data analysis is either simple or a nightmare depending on how the data set is structured. This post describes some of the fundamental challenges, and provides several possible solutions to make your data science life easier.

The Structure of Automating Laboratory Data Analysis

Since laboratory experimentation, and the associated data analysis is a common part of scientific research, the next series of posts will focus on how to automate this process. First, we'll present the structure and big-picture design of a project before moving on to discuss several of the topics in significantly more depth. This series of posts will focus primarily on the data science portion of the project, with some brief discussion of collaborating with the laboratory testers.

The Structure of a Laboratory Experiment Based Project with Automated Data Analysis

Unfortunately, each project must be approached individually and a detailed, yet generic solution doesn’t exist. However, there is a fundamental approach that can be applied to every project, with the specific programming (Primarily the calculations) changing between projects. The following general procedure provides the structure of an automated data analysis project. Several of these individual steps will be addressed in detail in later posts.

1. Create the test plan

Determine what tests need to be performed to generate the data set needed to answer the research question. This ensures that a satisfactory data set is available when generating regressions at the end of the project, and avoids needing to perform extra tests.

2. Design the data set to allow automation

This includes specifying what signals will be used to identify the most important sections of the tests, or the sections that will be analyzed by the project. This ensures that there will be an easy way to structure the program to identify the results of each individual test.

3. Create a clear file naming system

Either create a data printing method that makes identification of the test conditions in each test straightforward, or collaborate with the lab tester to do so. This ensures that the program will be able to identify the conditions of each test, which is necessary for analyzing the data and storing the results.

4. Store the resulting data files in a specific folder

This allows use of the Python package "glob" to sequentially open, and analyze the data from each individual test.

5. analyze the results of individual tests

Create a program to automatically cycle through all of the data files, and analyze each data set. This program will likely use a for loop and glob to automatically analyze every data file. It will likely use pandas to perform the calculations to identify the desired result of the test, and create checks to ensure that the test was performed correctly. It will also likely include plotting features with either bokeh or matplotlib.

6. Include error checking options

Any numbers of errors can occur in this process. Maybe some of the tests had errors. Maybe there was a mistake in the programmed calculations. Make life easier by ensuring that the program provides ample outputs to check the quality of the test results and following data analysis. This could mean printing plots from the test that allow visual inspection, or adding an algorithm that compares the measured data and calculations to expectations and report errors.

7. Store the data logically

The calculated values from each test need to be stored in tables and data files for later use. How these values are stored can either make the remaining steps easy, or impossible. The data should often be stored in different tables that provide the data set needed to later perform regressions.

8. Generate regressions from the resulting data set

Create a program that will open the stored data from Step 7 and create regressions. It should include an algorithm to create each desired regression, matching the data storage structure determined in Step 7. Ensure that this program provides adequate outputs, both statistical and visual, to allow thorough validation of the results.

9. Validate the results

Validate the resulting regressions using the statistical and visual outputs provided in Step 8. Determine whether the model is accurate enough or not. If not, either return to Step 7 and generate different regressions, or Step 1 and add additional tests to create a more comprehensive data set. If the model is accurate enough, publish detailed descriptions of its strengths and weaknesses so that future users understand what situations the model should/should not be used in.

Next Up: Designing Data Sets to Allow Automation

Those 9 steps provide the framework of a research project with automated data analysis. The upcoming series of posts will dive into the details of specific points. Next week we'll start by exploring step 2, with a thorough discussion of how to design data sets to allow automated data analysis.

 

Welcome to 1000x Faster!

1000x Faster is based on a very simple premise: You have better things to do than manual data analysis. Maybe you'd rather be drawing conclusions from the data, and sharing them with your clients. Maybe your preference is for the idea generation, and business development side of things. Or maybe higher efficiency in your work yields higher profit margins. Whatever it is that drives you, the goal of 1000x Faster is to help you finish your data analysis in a fraction of the time so you get to it.