Performance Map Tutorial: Analyzing the Individual Data Files

Now that the main data file has been split into individual data files, the next step in our tutorial automatically creating a performance map for heat pump water heats (HPWHs) in Python is to analyze the individual files. This will include the following steps:

  • Filtering the data in each file to include only the data needed in each test,

  • Calculating regressions representing the coefficient of performance (COP) of the HPWH in each test,

  • Checking the accuracy of the regression,

  • And storing the results.

If any of the terms above are foreign to you, see Performance Map Tutorial: Creating a Performance Map of Heat Pump Water Heaters for an introduction.

This post will walk through each of the above steps. If you’re following along with the companion data set then, after following these steps, you will have plotted data and regressions showing the COP of the HPWH as a function of water temperature at each of the three different ambient temperatures in the data set.

Package Import Statements

As with in the previous post, the first step is to import all of the required packages. The recommended packages were all described in An Introduction to Python Packages that are Useful for Automating Data Analysis. For this part of the project, the recommended packages are:

  1. pandas: This package was already described in Performance Map Tutorial: Splitting the Data Set into Individual Files, and will not be described in detail here.

  2. glob: This package was already described in Performance Map Tutorial: Splitting the Data Set into Individual Files, and will not be described in detail here.

  3. os: This package was already described in Performance Map Tutorial: Splitting the Data Set into Individual Files, and will not be described in detail here.

  4. NumPy: NumPy is a numerical package commonly used in scientific computing. It contains several functions that are useful for understanding large data sets. For this tutorial, the most important functions are polyfit and poly1d. Polyfit generates a regression that matches a data set with a user-specified order. Poly1d uses coefficients generated from polyfit and a user-specified condition to identify the value of the regression.

To import these packages and make use of their capabilities for this project, use the following lines of code:

import glob

import pandas as pd

import numpy as np

Reading the Data Files

The next step in the process is to read the data files. This will be done using a method that will be familiar to those who followed Performance Map Tutorial: Splitting the Data Set into Individual Files; We’ll use glob to create a list of all files in the folder, than use a for loop to run through each file in the list. It can be done using four steps:

  1. First set the path. This path tells glob where the files are located, so it searches the correct folder,

  2. Second use glob to create a list of all of the appropriate files in that folder.

  3. Third create a for loop iterating through each of the files stored in the glob list,

  4. Fourth add a line within the for loop reading the data files sequentially.

These steps can be accomplished with the following code. Note that this code assumes the files are located in a certain folder, and the code defining the path will need to be updated to match the location of your data files.

Path = r'C:\Users\JSmith\Desktop\AutomatedDataAnalysisWithPython\Tutorial-HPWHPerformanceMap\DataSet'

Filenames = glob.glob(Path + '/*.csv') #Creates a list of all filenames in the "path" folder

for Filename in Filenames:

Data = pd.read_csv(Filename)

Note that the rest of the steps in this process will be contained within the for loop, and will be indented accordingly.

Filtering the Data Set

The split data files include data from the conditioning period at the start of each test, as well as extra data at the end of the final test. This data is not useful for analyzing data, and will actually lead to errors in the regressions. It is still in the files because the split data files script did not include code to remove it. To correctly analyze the data, we must remove the extraneous data now.

If you open PerformanceMap_HPWH_55.csv and look at the initial data you can see the conditioning period quite clearly. At first, the flow rate of water is 5 gal/min, and the water temperatures in the tank are changing dramatically. This means that cold water is being added to the tank, pushing the hot water out, to prepare for the test. When all of the 125 deg F water is pushed out of the tank and replaced with 72 deg F water, the tank will be ready to begin the test. This state can be seen when the water flow rate reduces to 0 gal/min. The second issue is the ambient air temperature. At the start of the data file, the ambient temperature is close to 72 deg F when the test is supposed to occur at 55 deg F. Continuing to peruse the data files shows that the ambient temperature gradually decreases after the tank reaches the desired temperature. It gradually decreases from roughly 72 deg F to roughly 55 deg F, as specified in the test.

Fortunately, in this data set the imagined tester creating the data did their job correctly. We can see from the P_Elec (W) column that the HPWH started drawing electricity only after the tank temperatures and ambient temperature reached the desired starting conditions. This both gives us some initial confidence in the data, and provide a clear filtering point. Since the only relevant data for this test is when the HPWH is drawing electricity, we can filter our data set to only include that data.

This can be done with a single line of code. Pandas includes capabilities to reduce a data frame to include only the data that meets a boolean condition. The data frame must then be saved to a new data frame, as pandas won’t overwrite the data unless instructed to do so. This can be achieved using the following line of code:

Data = Data[Data['P_Elec (W)'] > 0]

Note that this filter works very well for our example data set, but would work poorly with real experimental data. No measurement is ever 100% precise, meaning that some electricity readings will be greater than 0 W even when no electricity is being consumed. When using real data, the filter should be set to accept data when electricity flow is greater than some larger number, like 50 or 100 W.

Filtering the data like this will result in a data frame with an index that starts at a value greater than zero. This makes it hard to manipulate the data later, and can be corrected by resetting the index. The index can be reset with the following two lines of code.

Data = Data.reset_index()

del Data['index']

Analyzing Each Data Set

Analyzing the data set requires several different calculations. As it stands, the data set provides the electricity consumption and temperature of the tank at eight different locations. However, to calculate the COP of the heat pump we need the change in average temperature of the tank and the amount of electricity consumed during each time stamp. They almost must have the same units. These objectives can be achieved with the following steps:

  1. First calculate the average temperature of the tank. Since each temperature in the data set represents 1/8th of the tank, the average can be calculated by summing the temperatures and dividing by 8,

  2. Second create a new column representing the average temperature of the tank at the previous timestep. This makes it easy to calculate the change in average tank temperature between timesteps. Do this by shifting the average tank temperature data by one row and assigning it to a new column,

  3. Third enter a value in the first row of the new column. This is necessary because pandas didn’t have a value to fill that cell with when performing the shift. Since we know the test starts with the tank at 72 deg F, we can use the .loc function to fill that cell with 72.0 deg F,

  4. Fourth calculate the change in stored energy in the tank between two timesteps. This is done with the equation (Change_Energy) = (Mass_Water) * (SpecificHeat) * (Change_Temperature). Since it’s a HPWH with an 80 gal storage tank, the mass is 80 gal * 8.3176 lb/gal. The specific heat of water in IP units is 0.998 Btu/(lb-F). The change in temperature can be calculated using the two average tank temperature columns.

  5. The electricity consumed during a timestep is equal to the rate of electricity consumption times the duration of the timestep. Our data points are 10 seconds apart in this data set. To calculate the COP we also must convert the electricity consumption from W to Btu, matching the change in stored energy in the tank. This is done with the conversion 1 W = 3.412142 Btu/hr and then converting from hours to seconds,

  6. Finally, the COP is equal to the change in stored energy divided by the electricity used to cause that change in energy.

All of this can be accomplished with the following code. Keep in mind that this all occurs within the for loop and must be indented accordingly.

Data['Average Tank Temperature (deg F)'] = (1./8.) * (Data['T1 (deg F)'] + Data['T2 (deg F)'] + Data['T3 (deg F)'] + Data['T4 (deg F)'] + Data['T5 (deg F)'] + Data['T6 (deg F)'] + Data['T7 (deg F)'] + Data['T8 (deg F)'])

Data['Previous Average Tank Temperature (deg F)'] = Data['Average Tank Temperature (deg F)'].shift(periods = 1)

Data.loc[0, 'Previous Average Tank Temperature (deg F)'] = 72.0

Data['Change in Stored Energy (Btu)'] = (80 * 8.3176) * (0.998) * (Data['Average Tank Temperature (deg F)'] - Data['Previous Average Tank Temperature (deg F)'])

Data['P_Elec (Btu/10s)'] = Data['P_Elec (W)'] * (3.412142/60/60) * 10

Data['COP (-)'] = Data['Change in Stored Energy (Btu)'] / Data['P_Elec (Btu/10s)']

Generating a Regression for the Individual Data Files

The entire point of this process is creating regressions of the data. We will use the data from each individual test to create regressions showing the COP of the heat pump as a function of water temperature at a specific ambient temperature. Since there are three tests with three different ambient temperatures, this gives us the ability to create a rough performance map showing the COP of the HPWH as a function of both temperatures. To do this, we need to generate a regression showing the COP of the heat pump as a function of temperature in each test. This can be done with the following line of code calling the NumPy function polyfit.

Coefficients = np.polyfit(Data['Average Tank Temperature (deg F)'], Data['COP'], 2)

That code calls Numpy.polyfit and tells it to store the regression coefficients in the variable Regression. It uses the average tank temperature calculated above as the x data, and the COP calculated above as the y data. Finally, the “2” at the end tells polyfit to make it a 2nd order equation.

It’s important to check and make sure that this process was performed correctly. This can be done in two ways.

  1. First, use the terminal window to examine the coefficients of Coefficients. Depending on the terminal used you may see different formatting, but the results should be: array([ 1.39999998e-04, -1.34000000e-01, 1.65000000e+01]).

  2. Second, use the NumPy.poly1d function to test the results of the regression. This can be done by first converting the coefficients into a regression, and second evaluating the regression to determine the value of the regression at that value. Use values from the dataframe to test and ensure that the values from the regression match the values calculated in the data set. This can be done with the following code.

Regression = np.poly1d(Coefficients)

COP_72 = Regression(72.0)

COP_140 = Regression(140.0)

If you’re following along with the companion data set, COP_72 should be 9.3277 and COP_140 should be 2.2339.

Saving Results

The file step in the process is storing the results. In this case, we’ll store both the dataframes with the newly calculated values and the coefficients from the regressions. To do this we’ll need to write code that 1) Creates a new folder to store the values, 2) Provides filenames stating what’s in each file, and 3) Saves the files.

Creating a new folder was covered in Performance Map Tutorial: Splitting the Data Set into Individual Files, so it should be familiar. In this case, we’ll create a new folder name by taking the current path and adding a new folder called “Analyzed” in the structure. Since we already have the existing folder in a variable named Path, we can simply add “\Analyzed” to the end of that. Then we use the same code as before to see if that folder exists, and create it if it does not. The following lines of code accomplish that objective.

Folder = Path + '\Analyzed'

if not os.path.exists(Folder):


The next step is providing names for the individual tests themselves. We want the dataframes to be stored in the new folder, and have “_Analyzed” at the end of the filename to distinguish them from the raw data files. This can be done by combining the Folder variable, which puts it in the new folder, with a section from the Filename variable from our for loop, and an ending of “_Analyzed.csv”. For the coefficients, we want to save them in the correct folder with a name of “Coefficients_” plus the temperature at the end of the filename, and .csv at the end. These filenames can be created using the following two lines of code.

Filename_Test = Folder + '\\' + Filename[-26:-4] + '_Analyzed.csv'

Filename_Coefficients = Folder + '\Coefficients_' + Filename[-6:]

Once those filenames are created, the last step is to actually save them. This can be using two lines of code calling the pandas dataframe function to_csv, and the NumPy array function tofile. Make sure to specify that you don’t want to save the index of the dataframe, and that you want the separator for the coefficients to be a comma. This makes it easier to call the data later. The files can be saved with the following two lines of code.

Data.to_csv(Filename_Test, index = False)

Coefficients.tofile(Filename_Coefficients, sep = ',')

Next Steps

In this post we’ve learned how to automatically perform calculations on several data files, create regressions for each file, and save the results. We did it with the example of three data files, which saves some time and tedium. Imagine the potential if using these methods on tests projects that contained hundreds of data files, instead of only three.

In the next module, we will discuss ways to visually ensure that the data and data analysis methods were performed correctly. This will include plotting the data to see what’s contained in each file and adding the regression to the plot so you can ensure that it closely fits the data set.

Performance Map Tutorial: Splitting The Data Set into Individual Files

As discussed in the previous posts, we are now starting the process of using Python scripting to automatically analyze laboratory data and create a performance map predicting the coefficient of performance of heat pump water heaters. If this is a surprise, see Performance Map Tutorial: Creating a Performance Map of Heat Pump Water Heater Performance which introduces this series of posts.

In many cases the laboratory data that you receive will contain several tests in a single file. This would be very obnoxious if analyzing the data by hand due to the repetitive, uninteresting process of identifying where each tests and ends, copy/pasting that data to another file, and repeating until all the tests are separated. In a project with hundreds of tests, this process would take hours and be excruciating. Fortunately, it’s possible to easily write a Python script that does it automatically. And that’s what we’re going to learn to do here.

Splitting the Data Set into Individual Files

The following sections will walk you through all of the steps needed to split your data file into several files and give them names which make it easy to understand what each file contains.

Package Import Statements

First, you need to import the packages that are useful for this process. The recommended packages were all described in An Introduction to Python Packages that are Useful for Automating Data Analysis. For this part of the project, the three required packages are:

  1. glob: glob is a package that is useful for automatically cycling through several different files. This is useful if your laboratory data comes in a few files instead of only one. If it comes in one large file, this package is not necessary. For those who purchased the data set accompanying this tutorial and are following along, that data comes in a single file and using glob is not necessary.

  2. pandas: pandas is the go-t package for all data analysis needs. It is excellent for reading data in as tables and manipulating those data sets. It will be a key part of nearly all work done in this blog.

  3. os: The os package gives Python programmers access to a set of commands controlling the computers operating system. There are some commands which will be relevant to this process.

These three packages can all be imported by adding the following three lines of code to your script.

import pandas as pd

import glob

import os

Those three lines will make the commands in those packages available in the program. The portion in the pandas import line saying “as pd” means that pandas can be referenced using “pd” instead of “pandas” throughout the script. This can be convenient as is referenced quite frequently.

Read in the Data

Once the necessary packages are imported and available, the next step is reading the data.

First, use glob to read the data files. Remember that this is done by 1) Specifying the path of the data files, and 2) Specifying the type of files to be read by glob. It can be done using code similar to the following:

Path = r'C:\Users\JSmith\Desktop\AutomatedDataAnalysisWithPython\Tutorial-HPWHPerformanceMap\DataSet'

Filenames = glob.glob(Path + '/*.csv')

Note that the Path variable must state the folder where you have your data located, and will almost certainly need to be changed from the path specified above. The second line instructs the program to find all files of type .csv located in the folder specified with Path and add them to a list called Filenames. The result is that Filenames now contains a list of all .csv files in that folder which can be iterated through to analyze the data.

It’s also necessary to have a test plan stating the conditions of each test. You’ll need to make this file and save it as a .csv file in your work folder. The test plan needs to state the necessary conditions of each test. It’s useful for both communicating with the experimentalist collecting the data, and for automating the data analysis process. For this example, the test plan should look like the one shown in Figure 1.

Figure 1: Test Plan for this Tutorial

The test plan should be read into a pandas data frame using the typical approach. The following code shows how to do this, and store it in the variable “Test_Plan”. Remember that the path used in your program will need to be modified to match the location of the test plan on your computer.

Test_Plan = pd.read_csv(r'C:\Users\JSmith\Desktop\AutomatedDataAnalysisWithPython\Tutorial-HPWHPerformanceMap\DataSet\Additionals\Test_Plan.csv')

Identifying the End of Each Test

The next step in splitting the data set into individual files is identifying the end of each test. This means reading through the data, understanding it well enough to know when one test ends and the next begins, and writing code to break the data files there. In this code we know that one test ends when the water temperature reaches 140 deg F and the heat pump turns off. This can be identified because the electricity drawn by the heat pump water heater suddenly dropping from a few hundred watts to 0.

Before identifying the lines in the data frame representing times when the heat pump shut off, we need to read the data files that have been stored in the glob list and prepare to iterate through them. We can do that with a for loop that runs through each item in the list created earlier. The program will then run through each data file and extract the data in each. The for loop can be called with the following code:

for Filename in Filenames:

Data = pd.read_csv(Filename)

We now have a data frame called “Data” that contains the data from an individual test file and a for loop iterating through all of the files in the folder called “Path.” Note that all of the future called must be indented, as it takes place in the for loop.

To create a list of the lines in the data frame representing the end of each test, we need to do four things:

  1. Create a column in the data frame that has the electricity data shifted by one row. This will be used to identify the row where the electricity turns off.

  2. Subtract the electricity draw from the electricity draw in the new shifted column. The value in this columns will be negative in the row after the heat pump turns off.

  3. Create a list to store the rows representing times when the heat pump turned off.

  4. Add the rows representing times when the heat pump turned off to that list.

These objectives can be accomplished with the following four lines of code:

Data['P_Elec_Shift (W)'] = Data['P_Elec (W)'].shift(periods = -1)

Data['P_Elec_Shift (W)'] = Data['P_Elec_Shift (W)'] - Data['P_Elec (W)']

End_Of_Tests = []

End_Of_Tests = End_Of_Tests + Data[Data['P_Elec_Shift (W)'] < 0].index.tolist()

The end result is a list, called “End_Of_Tests”, that identifies the last row of each test in the currently open data file. Later we will use this list to actually split the data from the file.

The code shown above created a new column in the data frame. This isn’t necessarily a problem, but deleting that column creates a cleaner result when saving the final data frame. It can be removed with the following line:

del Data['P_Elec_Shift (W)']

Splitting the Files and Identifying the Conditions

With the last row of data in each test contained in “End_Of_Testsit’s time to write the code breaking out individual tests and identifying the conditions of those tests. To do that, we need to iterate through the “End_Of_Testslist and break out the sections of the between the indices. We do this with the following steps:

  1. First, we need to create a list that iterates through the indices contained in the list. This is done with a standard for loop, and range declaration.

  2. Second, we need to create an if statement that identifies whether it’s the first time through the for loop or not. Keep in mind that the first entry in “End_Of_Testsis the end of the first test, so the first time through the loop will need special treatment to identify the beginning of that test.

  3. Read the data from the identified sections into a new file containing the data of this specific test.

This can be accomplished with the following code.

for i in range(len(End_Of_Tests)):

if i == 0:

File_SingleTest = Data[0:End_Of_Tests[i]+1]


File_SingleTest = Data[End_Of_Tests[i-1]+1:End_Of_Tests[i]+1]

Note that this now has the code contained inside a nested for loop. The beginning of this for loop will have to be indented because it’s inside the glob list loop, and the if statements will have to be indented twice because they’re inside this second for loop as well. All future code will have to be indented twice, as it belongs inside this second for loop.

The above code will identify the rows of a given tests and add it to a new data frame called “File_SingleTest”. Since it’s nested within a for loop, it will do this once for each test in the data file.

The next step is reading the data contained in the file, and matching it to the correct test in the test plan. This can be done using the following steps.

  1. Read the data in File_SingleTest to identify the ambient temperature during that test. To do this, we calculate the mean ambient temperature from the last 50 data points.

  2. Compare that ambient temperature to the three ambient temperatures in the test plan. We will do that by creating a new column in the test plan data frame that shows the absolute value of the difference between the value in each row of the test plan and the average value in the last 50 data points of the test.

  3. Identify the row in the test plan with the minimum difference between the plan and the test, and take the value called for in that row of the test plan.

This can be accomplished with the following three lines of code:

Temperature_Ambient = File_SingleTest['T_Amb (deg F)'][-50:].mean()

Test_Plan['Error_Temperature_Ambient'] = abs(Test_Plan['Ambient Temperature (deg F)'] - Temperature_Ambient)

Temperature_Ambient = Test_Plan.loc[Test_Plan['Error_Temperature_Ambient'].idxmin(), 'Ambient Temperature (deg F)']

Keep in mind that this code is nested inside the two for loops, and executes once for each test period identified in each file of tests. In this way the Temperature_Ambient variable will at some point hold the ambient temperature of each test in the entire project.

Saving the File With a Descriptive Name

Now that we have the code to separate the different tests and identify the ambient temperature of each, the last step is saving the results. When doing this it’s important to remember to use dynamic names that describe the data contained in the file. In this case the only part that’s changing is the ambient temperature during the test, so that will be the only variable in the code. With that in mind, we can save the data files with the following steps:

  1. First, we need to specify the folder that the data should be saved in. This will form part of the final path for the save process.

  2. Second, we need to make sure that folder exists. We can use the os commands to check for it and create it, if it doesn’t exist.

  3. Specify the filename for the single test. This will be a dynamic filename referencing the ambient temperature of the test in question.

  4. Save the file to the new filename.

These four steps can be completed with five lines of code (Checking to see if the folder exists takes one line of code, creating the folder takes a second):

Folder = Path + "\Files_IndividualTests"

if not os.path.exists(Folder):


Filename_SingleTest = "\PerformanceMap_HPWH_" + str(int(Temperature_Ambient)) + ".csv"

File_SingleTest.to_csv(Folder + Filename_SingleTest, index = False)

And that’s it! The program now has the code it needs to save the files. This means that your program, if you’ve been following along, is fully capable of 1) Opening each .csv data file in the specified folder, 2) Identifying the rows of data corresponding to each test in each of those files, 3) Breaking those data sets into separate files for each individual test, 4) Identifying the test in the plan that matches each file identified in the data set, and 5) Saving each file to a file name that describes the data contained in the test.

The Results

So what do the results look like? If you purchased the accompanying data set and followed along with the process, you should have a folder containing three files. Each of those files represents a different test. They will have descriptive names stating what is in them. The names will state the type of test, namely that they’re results used in the project creating a performance map for heat pump water heaters, and information stating the ambient temperature of each test. The folder containing the data should look like Figure 2.

Figure 2: The Resulting Files for each Individual Test

Each individual fill contains a subset of the data presented in Performance Map Tutorial: Introducing the HPWH Performance Map Tutorial Data Set. That post showed a single file with data from three tests. During the post, the temperature of water in the tank and the ambient temperature were highlighted. Figures 3 and 4 show the data set from the first result file. Comparing the water temperatures in Figure 3 and the ambient temperatures in Figure 4 will make it clear that this data matches what was shown in the first test period of that data set.

Figure 3: Water Temperature in the Tank in the First Result File

Figure 4: Ambient Temperature in the First Result File

Next Steps

In this part of the tutorial we discussed how to read data files containing several tests, split them apart into several different files, and store them in files with descriptive names. This step laid the groundwork for the next step, which is analyzing each data file and storing the results for later use. That’s what will be covered in the next step. We’ll write a script that can open each data file in the folder, filter out the data the doesn’t suit the needs of the project, perform calculations studying the performance of the heat pump, develop regressions predicting the performance of the unit, and check the accuracy of those regressions.

Analyzing Data Sets With Multiple Test Types

The previous posts have all discussed methods for automating data analysis using Python when all tests are similar. This won’t always be the case. Sometimes tests will be used for different purposes; for example, some tests may collect data for regression development, while others search for behaviors or control logic in specific test cases. This creates an added level of complexity when writing scripts to analyze the data; the approach must be flexible enough to correctly analyze each of these different cases. This post describes how to create a central program which is flexible enough to handle all of these data analysis needs.

Generating Regressions from Stored Data

The final Python-based automation of laboratory data analysis topic to discuss is that of generating and validating regressions from the stored data. This is typically the ultimate goal of laboratory data analysis projects, and there are still several things to think through before declaring the project completed. This post will introduce and discuss topics such as identifying the best regression form, different tools for generating regressions, and validating models.

Storing Intermediate Results for Later Analysis

So far, all of the discussion has been in analyzing results from individual tests. The next step is to begin to think bigger picture, and create ways to combine those individual test results into data sets describing the results from the entire project. The first step is storing the individual test results in a logical manner, which facilitates later analysis. This post provides guidance on how to do that.

Checking the Quality of Testing and Analysis Results

One challenge of automated data analysis is that of checking the results. There is potential for errors in testing, and in data analysis which can both be caught quickly when manually analyzing data. This post provides some methods of doing the same error checking with automated processes, and provides example Python code.

An Introduction to Python Packages that are Useful for Automating Data Analysis

Automating analysis of each individual test relies on the capabilities of several available packages. These packages include glob, pandas, bokeh, and matplotlib. This post provides an introduction to these packages, and future posts will provide a much more thorough description of individual capabilities.

How to Identify the Conditions of Laboratory Tests and Split Large Data Files

When automating laboratory data analysis, it’s critical that the program have a way to identify the conditions of the test. Sometimes this is easier said than done, as file names may consist of nondescript numbers, provide more information about when the test was run than the test itself, or contain data from several tests in a single file. This post provides some ways to overcome these obstacles, complete with example Python code.

Designing Data Sets for Automated Laboratory Data Analysis

Automating laboratory data analysis is either simple or a nightmare depending on how the data set is structured. This post describes some of the fundamental challenges, and provides several possible solutions to make your data science life easier.

The Structure of Automating Laboratory Data Analysis

Since laboratory experimentation, and the associated data analysis is a common part of scientific research, the next series of posts will focus on how to automate this process. First, we'll present the structure and big-picture design of a project before moving on to discuss several of the topics in significantly more depth. This series of posts will focus primarily on the data science portion of the project, with some brief discussion of collaborating with the laboratory testers.

The Structure of a Laboratory Experiment Based Project with Automated Data Analysis

Unfortunately, each project must be approached individually and a detailed, yet generic solution doesn’t exist. However, there is a fundamental approach that can be applied to every project, with the specific programming (Primarily the calculations) changing between projects. The following general procedure provides the structure of an automated data analysis project. Several of these individual steps will be addressed in detail in later posts.

1. Create the test plan

Determine what tests need to be performed to generate the data set needed to answer the research question. This ensures that a satisfactory data set is available when generating regressions at the end of the project, and avoids needing to perform extra tests.

2. Design the data set to allow automation

This includes specifying what signals will be used to identify the most important sections of the tests, or the sections that will be analyzed by the project. This ensures that there will be an easy way to structure the program to identify the results of each individual test.

3. Create a clear file naming system

Either create a data printing method that makes identification of the test conditions in each test straightforward, or collaborate with the lab tester to do so. This ensures that the program will be able to identify the conditions of each test, which is necessary for analyzing the data and storing the results.

4. Store the resulting data files in a specific folder

This allows use of the Python package "glob" to sequentially open, and analyze the data from each individual test.

5. analyze the results of individual tests

Create a program to automatically cycle through all of the data files, and analyze each data set. This program will likely use a for loop and glob to automatically analyze every data file. It will likely use pandas to perform the calculations to identify the desired result of the test, and create checks to ensure that the test was performed correctly. It will also likely include plotting features with either bokeh or matplotlib.

6. Include error checking options

Any numbers of errors can occur in this process. Maybe some of the tests had errors. Maybe there was a mistake in the programmed calculations. Make life easier by ensuring that the program provides ample outputs to check the quality of the test results and following data analysis. This could mean printing plots from the test that allow visual inspection, or adding an algorithm that compares the measured data and calculations to expectations and report errors.

7. Store the data logically

The calculated values from each test need to be stored in tables and data files for later use. How these values are stored can either make the remaining steps easy, or impossible. The data should often be stored in different tables that provide the data set needed to later perform regressions.

8. Generate regressions from the resulting data set

Create a program that will open the stored data from Step 7 and create regressions. It should include an algorithm to create each desired regression, matching the data storage structure determined in Step 7. Ensure that this program provides adequate outputs, both statistical and visual, to allow thorough validation of the results.

9. Validate the results

Validate the resulting regressions using the statistical and visual outputs provided in Step 8. Determine whether the model is accurate enough or not. If not, either return to Step 7 and generate different regressions, or Step 1 and add additional tests to create a more comprehensive data set. If the model is accurate enough, publish detailed descriptions of its strengths and weaknesses so that future users understand what situations the model should/should not be used in.

Next Up: Designing Data Sets to Allow Automation

Those 9 steps provide the framework of a research project with automated data analysis. The upcoming series of posts will dive into the details of specific points. Next week we'll start by exploring step 2, with a thorough discussion of how to design data sets to allow automated data analysis.