The Structure of Automating Laboratory Data Analysis

Since laboratory experimentation, and the associated data analysis is a common part of scientific research, the next series of posts will focus on how to automate this process. First, we'll present the structure and big-picture design of a project before moving on to discuss several of the topics in significantly more depth. This series of posts will focus primarily on the data science portion of the project, with some brief discussion of collaborating with the laboratory testers.

The Structure of a Laboratory Experiment Based Project with Automated Data Analysis

Unfortunately, each project must be approached individually and a detailed, yet generic solution doesn’t exist. However, there is a fundamental approach that can be applied to every project, with the specific programming (Primarily the calculations) changing between projects. The following general procedure provides the structure of an automated data analysis project. Several of these individual steps will be addressed in detail in later posts.

1. Create the test plan

Determine what tests need to be performed to generate the data set needed to answer the research question. This ensures that a satisfactory data set is available when generating regressions at the end of the project, and avoids needing to perform extra tests.

2. Design the data set to allow automation

This includes specifying what signals will be used to identify the most important sections of the tests, or the sections that will be analyzed by the project. This ensures that there will be an easy way to structure the program to identify the results of each individual test.

3. Create a clear file naming system

Either create a data printing method that makes identification of the test conditions in each test straightforward, or collaborate with the lab tester to do so. This ensures that the program will be able to identify the conditions of each test, which is necessary for analyzing the data and storing the results.

4. Store the resulting data files in a specific folder

This allows use of the Python package "glob" to sequentially open, and analyze the data from each individual test.

5. analyze the results of individual tests

Create a program to automatically cycle through all of the data files, and analyze each data set. This program will likely use a for loop and glob to automatically analyze every data file. It will likely use pandas to perform the calculations to identify the desired result of the test, and create checks to ensure that the test was performed correctly. It will also likely include plotting features with either bokeh or matplotlib.

6. Include error checking options

Any numbers of errors can occur in this process. Maybe some of the tests had errors. Maybe there was a mistake in the programmed calculations. Make life easier by ensuring that the program provides ample outputs to check the quality of the test results and following data analysis. This could mean printing plots from the test that allow visual inspection, or adding an algorithm that compares the measured data and calculations to expectations and report errors.

7. Store the data logically

The calculated values from each test need to be stored in tables and data files for later use. How these values are stored can either make the remaining steps easy, or impossible. The data should often be stored in different tables that provide the data set needed to later perform regressions.

8. Generate regressions from the resulting data set

Create a program that will open the stored data from Step 7 and create regressions. It should include an algorithm to create each desired regression, matching the data storage structure determined in Step 7. Ensure that this program provides adequate outputs, both statistical and visual, to allow thorough validation of the results.

9. Validate the results

Validate the resulting regressions using the statistical and visual outputs provided in Step 8. Determine whether the model is accurate enough or not. If not, either return to Step 7 and generate different regressions, or Step 1 and add additional tests to create a more comprehensive data set. If the model is accurate enough, publish detailed descriptions of its strengths and weaknesses so that future users understand what situations the model should/should not be used in.

Next Up: Designing Data Sets to Allow Automation

Those 9 steps provide the framework of a research project with automated data analysis. The upcoming series of posts will dive into the details of specific points. Next week we'll start by exploring step 2, with a thorough discussion of how to design data sets to allow automated data analysis.