Performance Map Tutorial: Splitting The Data Set into Individual Files

As discussed in the previous posts, we are now starting the process of using Python scripting to automatically analyze laboratory data and create a performance map predicting the coefficient of performance of heat pump water heaters. If this is a surprise, see Performance Map Tutorial: Creating a Performance Map of Heat Pump Water Heater Performance which introduces this series of posts.

In many cases the laboratory data that you receive will contain several tests in a single file. This would be very obnoxious if analyzing the data by hand due to the repetitive, uninteresting process of identifying where each tests and ends, copy/pasting that data to another file, and repeating until all the tests are separated. In a project with hundreds of tests, this process would take hours and be excruciating. Fortunately, it’s possible to easily write a Python script that does it automatically. And that’s what we’re going to learn to do here.

Splitting the Data Set into Individual Files

The following sections will walk you through all of the steps needed to split your data file into several files and give them names which make it easy to understand what each file contains.

Package Import Statements

First, you need to import the packages that are useful for this process. The recommended packages were all described in An Introduction to Python Packages that are Useful for Automating Data Analysis. For this part of the project, the three required packages are:

  1. glob: glob is a package that is useful for automatically cycling through several different files. This is useful if your laboratory data comes in a few files instead of only one. If it comes in one large file, this package is not necessary. For those who purchased the data set accompanying this tutorial and are following along, that data comes in a single file and using glob is not necessary.

  2. pandas: pandas is the go-t package for all data analysis needs. It is excellent for reading data in as tables and manipulating those data sets. It will be a key part of nearly all work done in this blog.

  3. os: The os package gives Python programmers access to a set of commands controlling the computers operating system. There are some commands which will be relevant to this process.

These three packages can all be imported by adding the following three lines of code to your script.

import pandas as pd

import glob

import os

Those three lines will make the commands in those packages available in the program. The portion in the pandas import line saying “as pd” means that pandas can be referenced using “pd” instead of “pandas” throughout the script. This can be convenient as is referenced quite frequently.

Read in the Data

Once the necessary packages are imported and available, the next step is reading the data.

First, use glob to read the data files. Remember that this is done by 1) Specifying the path of the data files, and 2) Specifying the type of files to be read by glob. It can be done using code similar to the following:

Path = r'C:\Users\JSmith\Desktop\AutomatedDataAnalysisWithPython\Tutorial-HPWHPerformanceMap\DataSet'

Filenames = glob.glob(Path + '/*.csv')

Note that the Path variable must state the folder where you have your data located, and will almost certainly need to be changed from the path specified above. The second line instructs the program to find all files of type .csv located in the folder specified with Path and add them to a list called Filenames. The result is that Filenames now contains a list of all .csv files in that folder which can be iterated through to analyze the data.

It’s also necessary to have a test plan stating the conditions of each test. You’ll need to make this file and save it as a .csv file in your work folder. The test plan needs to state the necessary conditions of each test. It’s useful for both communicating with the experimentalist collecting the data, and for automating the data analysis process. For this example, the test plan should look like the one shown in Figure 1.

Figure 1: Test Plan for this Tutorial

The test plan should be read into a pandas data frame using the typical approach. The following code shows how to do this, and store it in the variable “Test_Plan”. Remember that the path used in your program will need to be modified to match the location of the test plan on your computer.

Test_Plan = pd.read_csv(r'C:\Users\JSmith\Desktop\AutomatedDataAnalysisWithPython\Tutorial-HPWHPerformanceMap\DataSet\Additionals\Test_Plan.csv')

Identifying the End of Each Test

The next step in splitting the data set into individual files is identifying the end of each test. This means reading through the data, understanding it well enough to know when one test ends and the next begins, and writing code to break the data files there. In this code we know that one test ends when the water temperature reaches 140 deg F and the heat pump turns off. This can be identified because the electricity drawn by the heat pump water heater suddenly dropping from a few hundred watts to 0.

Before identifying the lines in the data frame representing times when the heat pump shut off, we need to read the data files that have been stored in the glob list and prepare to iterate through them. We can do that with a for loop that runs through each item in the list created earlier. The program will then run through each data file and extract the data in each. The for loop can be called with the following code:

for Filename in Filenames:

Data = pd.read_csv(Filename)

We now have a data frame called “Data” that contains the data from an individual test file and a for loop iterating through all of the files in the folder called “Path.” Note that all of the future called must be indented, as it takes place in the for loop.

To create a list of the lines in the data frame representing the end of each test, we need to do four things:

  1. Create a column in the data frame that has the electricity data shifted by one row. This will be used to identify the row where the electricity turns off.

  2. Subtract the electricity draw from the electricity draw in the new shifted column. The value in this columns will be negative in the row after the heat pump turns off.

  3. Create a list to store the rows representing times when the heat pump turned off.

  4. Add the rows representing times when the heat pump turned off to that list.

These objectives can be accomplished with the following four lines of code:

Data['P_Elec_Shift (W)'] = Data['P_Elec (W)'].shift(periods = -1)

Data['P_Elec_Shift (W)'] = Data['P_Elec_Shift (W)'] - Data['P_Elec (W)']

End_Of_Tests = []

End_Of_Tests = End_Of_Tests + Data[Data['P_Elec_Shift (W)'] < 0].index.tolist()

The end result is a list, called “End_Of_Tests”, that identifies the last row of each test in the currently open data file. Later we will use this list to actually split the data from the file.

The code shown above created a new column in the data frame. This isn’t necessarily a problem, but deleting that column creates a cleaner result when saving the final data frame. It can be removed with the following line:

del Data['P_Elec_Shift (W)']

Splitting the Files and Identifying the Conditions

With the last row of data in each test contained in “End_Of_Testsit’s time to write the code breaking out individual tests and identifying the conditions of those tests. To do that, we need to iterate through the “End_Of_Testslist and break out the sections of the between the indices. We do this with the following steps:

  1. First, we need to create a list that iterates through the indices contained in the list. This is done with a standard for loop, and range declaration.

  2. Second, we need to create an if statement that identifies whether it’s the first time through the for loop or not. Keep in mind that the first entry in “End_Of_Testsis the end of the first test, so the first time through the loop will need special treatment to identify the beginning of that test.

  3. Read the data from the identified sections into a new file containing the data of this specific test.

This can be accomplished with the following code.

for i in range(len(End_Of_Tests)):

if i == 0:

File_SingleTest = Data[0:End_Of_Tests[i]+1]


File_SingleTest = Data[End_Of_Tests[i-1]+1:End_Of_Tests[i]+1]

Note that this now has the code contained inside a nested for loop. The beginning of this for loop will have to be indented because it’s inside the glob list loop, and the if statements will have to be indented twice because they’re inside this second for loop as well. All future code will have to be indented twice, as it belongs inside this second for loop.

The above code will identify the rows of a given tests and add it to a new data frame called “File_SingleTest”. Since it’s nested within a for loop, it will do this once for each test in the data file.

The next step is reading the data contained in the file, and matching it to the correct test in the test plan. This can be done using the following steps.

  1. Read the data in File_SingleTest to identify the ambient temperature during that test. To do this, we calculate the mean ambient temperature from the last 50 data points.

  2. Compare that ambient temperature to the three ambient temperatures in the test plan. We will do that by creating a new column in the test plan data frame that shows the absolute value of the difference between the value in each row of the test plan and the average value in the last 50 data points of the test.

  3. Identify the row in the test plan with the minimum difference between the plan and the test, and take the value called for in that row of the test plan.

This can be accomplished with the following three lines of code:

Temperature_Ambient = File_SingleTest['T_Amb (deg F)'][-50:].mean()

Test_Plan['Error_Temperature_Ambient'] = abs(Test_Plan['Ambient Temperature (deg F)'] - Temperature_Ambient)

Temperature_Ambient = Test_Plan.loc[Test_Plan['Error_Temperature_Ambient'].idxmin(), 'Ambient Temperature (deg F)']

Keep in mind that this code is nested inside the two for loops, and executes once for each test period identified in each file of tests. In this way the Temperature_Ambient variable will at some point hold the ambient temperature of each test in the entire project.

Saving the File With a Descriptive Name

Now that we have the code to separate the different tests and identify the ambient temperature of each, the last step is saving the results. When doing this it’s important to remember to use dynamic names that describe the data contained in the file. In this case the only part that’s changing is the ambient temperature during the test, so that will be the only variable in the code. With that in mind, we can save the data files with the following steps:

  1. First, we need to specify the folder that the data should be saved in. This will form part of the final path for the save process.

  2. Second, we need to make sure that folder exists. We can use the os commands to check for it and create it, if it doesn’t exist.

  3. Specify the filename for the single test. This will be a dynamic filename referencing the ambient temperature of the test in question.

  4. Save the file to the new filename.

These four steps can be completed with five lines of code (Checking to see if the folder exists takes one line of code, creating the folder takes a second):

Folder = Path + "\Files_IndividualTests"

if not os.path.exists(Folder):


Filename_SingleTest = "\PerformanceMap_HPWH_" + str(int(Temperature_Ambient)) + ".csv"

File_SingleTest.to_csv(Folder + Filename_SingleTest, index = False)

And that’s it! The program now has the code it needs to save the files. This means that your program, if you’ve been following along, is fully capable of 1) Opening each .csv data file in the specified folder, 2) Identifying the rows of data corresponding to each test in each of those files, 3) Breaking those data sets into separate files for each individual test, 4) Identifying the test in the plan that matches each file identified in the data set, and 5) Saving each file to a file name that describes the data contained in the test.

The Results

So what do the results look like? If you purchased the accompanying data set and followed along with the process, you should have a folder containing three files. Each of those files represents a different test. They will have descriptive names stating what is in them. The names will state the type of test, namely that they’re results used in the project creating a performance map for heat pump water heaters, and information stating the ambient temperature of each test. The folder containing the data should look like Figure 2.

Figure 2: The Resulting Files for each Individual Test

Each individual fill contains a subset of the data presented in Performance Map Tutorial: Introducing the HPWH Performance Map Tutorial Data Set. That post showed a single file with data from three tests. During the post, the temperature of water in the tank and the ambient temperature were highlighted. Figures 3 and 4 show the data set from the first result file. Comparing the water temperatures in Figure 3 and the ambient temperatures in Figure 4 will make it clear that this data matches what was shown in the first test period of that data set.

Figure 3: Water Temperature in the Tank in the First Result File

Figure 4: Ambient Temperature in the First Result File

Next Steps

In this part of the tutorial we discussed how to read data files containing several tests, split them apart into several different files, and store them in files with descriptive names. This step laid the groundwork for the next step, which is analyzing each data file and storing the results for later use. That’s what will be covered in the next step. We’ll write a script that can open each data file in the folder, filter out the data the doesn’t suit the needs of the project, perform calculations studying the performance of the heat pump, develop regressions predicting the performance of the unit, and check the accuracy of those regressions.

Performance Map Tutorial: Introducing the HPWH Performance Map Tutorial Data Set

Now that Performance Map Tutorial: Creating a Performance Map for Heat Pump Water Heaters has begun, those interested in following along will benefit from having access to a companion data set. Reading the blog posts would be valuable because it gives exposure to the information, but having a companion set allows you to follow along as you read. This way you can write your own code based on what’s read in the blog posts, do the analysis yourself, see how it works, check your results, and leave the tutorial with the confidence that you can write a program to automatically generate a performance map of heat pump water heaters.

Announcing a Sample Data Set for the Tutorial

To help with that goal, I have published my data set in the Store. This is a manufactured data set containing data representing what happens in performance map testing of heat pump water heaters. It emulates the data sets you would receive from the laboratory and includes important measurements including ambient temperature, inlet and outlet water temperature, water temperature measurements at several depths in the storage tank, water flow rate, and electricity consumption. This data will form the basis of the project as we analyze the data to find the change in energy stored in the tank, identify the coefficient of performance (COP) of the heat pump, and create a performance map predicting the COP as a function of the ambient and storage temperatures.

The data set contains manufactured data representing three different tests. In each test, the water in the tank is pushed out and replaced with 72 deg F water. This is done until all of the water stored in the tank is 72 deg F. After the tank is set at the starting temperature, the ambient temperature is set to the intended air temperature. Once the ambient temperature is at the set temperature, the heat pump is engaged and allowed to bring the water temperature to the set temperature of 140 deg F. Measurements of the electricity tell us how much energy is consumed by the heat pump, and measurements of the water temperature in the tank tells us how much the energy stored in the tank has changed. Therefore, during each test we can calculate the COP of the device with changes in water temperature for a specific ambient temperature. There are three tests to provide this information at three different ambient temperatures.

Overview of the Tutorial Data Set

Figure 1 shows the water temperatures in the tank during the three tests. The red lines and text describe what is occurring during the testing. The vertical lines and text at the bottom break the entire test period into the three separate tests, showing the repeated nature of the tests. These three tests are each performed at different ambient temperatures. The same pattern is followed within each of the three tests, and is highlighted with the text in test 1. First the water in the tank is replaced with 72 deg F water. Then, when the water in the tank is all at that temperature, the heat pump is used to heat the water up to 140 deg F. Remember that this is a manufactured data set, not actual test data. Real test data will never be as perfect as this data set, but this set does capture all of the main concepts and allow practicing the techniques.

Figure 1: Water Temperatures in the Tank During HPWH COP Testing

Figure 2 presents the ambient temperatures during the same test period. In this case the data is red, and the descriptive information is black. As in Figure 1, the data is divided into three sections representing the data from the three tests. The descriptive information within each test states the ambient temperature used for that test period. The three tests were done with 55, 70, and 95 deg F ambient air temperatures. Combined with the data during each individual test, tracking the water temperature as energy is added, this provides the information needed to determine the curve predicting the COP of the heat pump as a function of water temperature at three different ambient temperatures. These curves provide the basis of a performance map for the device.

Figure 2: Ambient Temperatures During HPWH COP testing

Next Steps

The next post will begin the process of analyzing this data set, and providing detailed tutorials so you can follow along using this data set yourself. We will begin with splitting this single data set into three different files, one for each test. During the process, our program will study the data contained within each test and write a file name detailing the test, and providing all information needed for future analysis. This will be done using the techniques described in How to Identify the Conditions of Laboratory Tests and Split Large Data Files. Once this process is complete, future posts will detail how to write a program which analyzes those test files and creates the final performance map.

Python's concurrent.futures module

This will be an uncharacteristically short post. Instead of writing a full-fledged article, I wanted to point you to a tip that I just learned from an article that George Seif posted on Medium [1]. In it he introduces Python’s concurrent.futures module, and explains how it can be used to accelerate automated data analysis processes.

One common theme in this blog is the use of the glob package. Glob creates a list of all of the files in a folder, thereby providing a list of files that you can program Python to iterate through. In this way you can write a program that makes Python perform a set of calculations on every file in that folder, thus analyzing all of the data in a fraction of the time, and with a fraction of the effort that would be required either manually or automatically but without glob.

By default Python uses a single core on your computer for its processes. This means that, when iterating through your glob loop, Python will use a single core to analyze the first data file, then the second, then the third, and so on. What George discovered is that the concurrent.futures package instructs Python to use all of the cores of the computer in parallel. This means that one core would analyze the first file in the glob list. A second core would analyze the second file at the same time. A third would analyze the third file at the same time. And so on, until all of your processors are in use. In cases where you need to analyze thousands, or maybe even millions of data files, this could result in dramatically faster completion times.

The base code needed to use this capability is as shown below. In the code below the variable “Path” takes the place of the path to your data folder, and "Analysis_Script” takes the place of the function you’re using to analyze the data.

with concurrent.futures.ProcessPoolExecutor() as executor:

Test_Files = glob.glob(Path + “*.csv”, TestFiles)

For a more complete introduction, including timed results showing the speed improvements, see George’s article. It is cited in the footnotes.

1 Seif, George. “Here’s how you can get a 2-6x speed-up on your data pre-processing with Python.”

Performance Map Tutorial: Creating a Performance Map of Heat Pump Water Heater Performance

One of the biggest challenges when learning a new programming language or technique is finding realistic problems to solve, so one can be confident that (s)he is learning something that will prove to be valuable. To help with that problem, I’ll provide a detailed tutorial teaching the reader to create a performance map predicting the coefficient of performance of the heat pump in heat pump water heaters [1]. This is a task that’s often used in in the building energy simulation world to predict the amount of energy, and annual operation cost, of a building. Following the tutorial presented in this series of posts will allow the user to:

  • Learn the art of writing computer programs to automate laboratory data analysis,

  • Obtain a more in-depth understanding of previously mentioned packages commonly used in scientific data analysis, and

  • Understand how these skills can be used in a scientific research career.

Wait, what will we be learning to do?

This tutorial process will walk the user through the process of creating a performance map predicting the coefficient of performance of the heat pump in heat pump water heaters. There are four concepts that must be understood for this to make sense:

What is a heat pump?

A heat pump is a heating and cooling device that essentially uses a refrigerant and a pump to shift heat from one fluid to another. It operates on one fundamental principle: Pressurized refrigerants get very hot, while unpressurized refrigerants get very cold. The heat pump then operates with four stages, as shown in Figure 1 and described in the following bullet points.

  1. The fluid enters the compressor (I.e. pump) in a cold, gaseous state.

  2. The compressor pumps the gas, causing it to reach a much higher pressure and temperature as it approaches the condenser. To allow the heat pump to function, this temperature is higher than the temperature surrounding the condenser.

  3. The fluid passes through a heat exchanger called a condenser, releasing heat to the surrounding environment. This cools the fluid, though not dramatically.

  4. The hot fluid then passes through an expansion valve, returning it to a much lower pressure and temperature. To allow the heat pump to function, the temperature is colder the ambient temperature surrounding the evaporator.

  5. The lower temperature fluid passes through another heat exchanger, called an evaporator. Since the fluid entering the evaporator is colder than the surrounding fluid, heat transfers from the surroundings to the refrigerant. As heat enters the refrigerant, it boils and converts back to a gas.

  6. The gas then enters the compressor in a cold, gaseous state and the cycle begins anew.

Figure 1: Schematic of the Refrigeration Cycle in a Heat Pump

Through this process, a heat pump essentially transfers heat from the cold fluid surrounding the evaporator to the hot fluid surrounding the condenser. One example of this is an air conditioner used to cool a home. The evaporator is placed inside the home, and the condenser is placed outside. The cold liquid takes heat from the ~75 °F air in the home. The compressor then pressurizes the gas, causing a higher temperature. The temperature is then so high that the hot gas rejects heat to the much higher air temperature outside the home. The expansion valve then allows the hot gas to expand, and turn into a cold liquid. At which point it re-enters the evaporator to restart the cycle.

What is a Coefficient of Performance?

Heat pumps caused a problem for our traditional understanding of efficiency. This is because they’re a very different technology from what had been used in the past. Traditional heat transfer focused on converting energy from one form to another. For example, burning natural gas to convert chemical energy to thermal energy. Or dropping an object, converting potential energy (Altitude) to kinetic energy (Motion). For any of these examples the maximum efficiency is 100%, because it’s impossible to get more energy out of a system than was present to begin with. You can’t burn gas containing 100 kBtu of thermal energy, and end up with 200 kBtu of heat for instance.

Heat pumps caused a problem due to the very nature of how they function. Electricity is input into the system to power a pump, which compresses the fluid. Thus the traditional definition of efficiency would be calculated using the motion of the fluid and the electrical input. However, in the case of heat pumps, what people really care about is the amount of heat transferred relative to the electrical input. And since no energy is input for the heat transfer directly, it’s entirely possible for the amount of heat moved from one place to the other to exceed the electricity input to the pump.

To overcome this challenge, engineers and scientists created a new term, the Coefficient of Performance (Often abbreviated as “COP”). The COP of a heat pump identifies the amount of heat transfer obtained per unit of electricity input into the pump. It typically exceeds 100%.

What is a Performance Map?

A performance map is a technical term for an equation that states how a value changes as the inputs that drive it change. One example could be how the amount of energy used to heat a house changes with the outdoor air temperature, and the indoor air temperature setpoint. A performance map would be an equation that returns the energy used to heat the building for known indoor and outdoor air temperatures.

Performance maps are important when describing heat pumps because the heat transfer at the condenser and evaporator are both uncontrolled. The heat transfer between the fluid and surrounding air is dependent on the temperature difference. Lower temperatures around the condenser, and higher temperatures around the evaporator lead to higher performance. Since these temperatures are uncontrolled, it’s important to understand how the COP of the heat pump will change as those temperatures change. Then the performance of the heat pump can be predicted, and people can make intelligent choices about what heat pumps to use.

What is a Heat Pump Water Heater?

A heat pump water heater is a type of water heater that uses a heat pump to heat the water. It does this by transferring heat from the surrounding air, to the water held in a storage tank (Typically 50 - 80 gallons). The most common form of heat pump water heater is the configuration called “Integrated” heat pump water heaters. In this case, the condenser is located in the hot water storage tank and the evaporator is located in the surrounding air. This is shown conceptually in Figure 2.

Figure 2: Schematic of a Heat Pump Installed in a Heat Pump Water Heater

Due to the locations of the condenser and the evaporator, the COP of a heat pump water heater will increase as the temperature of the water in the tank decreases and the temperature of the air surrounding the evaporator increases. Since predicting the performance of heat pump water heaters is an important part of building energy simulation, performance maps predicting the COP as a function of these two temperatures are vital.

What We Will Cover

This tutorial will walk students through the process of using previously recorded laboratory data to create a performance map for a heat pump water heater. We will walk through each step of the process of writing a program that can perform all of these calculations automatically, thus avoiding the tedium of performing all of the calculations manually. Topics covered will include:

  • Splitting individual tests out of a single data file so they can each be analyzed individually,

  • Analyzing the data from each test, including filtering out unnecessary data, performing calculations, and plotting the results of each test,

  • Both automatically and manually identifying erroneous tests, and replacing them with better data,

  • Storing test results in a central table for later analysis,

  • Using those stored results to create the desired regression (In this case, the performance map of the COP of the heat pump water heater),

  • Validating and visualizing the regression, and

  • Documenting the validity, strengths, and weaknesses of the final regression.

Tutorials walking you through each of these steps will be provided for free on the blog. This will include detailed discussions of the topics, and actual code that can be used to write the desired program. When the process is complete, you should have a program capable of automatically analyzing data.

To make this process more valuable, I will create a data set and answer sheet for the process. The data set will provide sample data similar to what you should expect would come from a laboratory to be used for this process. The answer sheet will consist of the results, and plots generated when running the script, allowing you to check your answers and ensure that your code is working correctly. For those who want to make the most of this guide, those resources will be available in the 1000x Faster store.

This tutorial will assume that the user has Python 2.7, their IDE of choice, and an understanding of the basic syntax and structure of the language.

[1] This sentence included several technical terms, and may be confusing to many. Don’t worry, I’ll be explaining them shortly.

September Announcements

Since the 1000x Faster blog is currently on a hiatus while I create new content, I figure that this is a good time to announce a few changes and upcoming projects.


First off, 1000x Faster now has a newsletter that you can sign up for. This newsletter is expected to be used for announcements of new projects, or new product releases. Examples include announcements of new blog post series discussing new topics (And here's a teaser: Make sure to see the last announcement in this post for one of those!), or releases for publications or data analysis tools. Make sure to sign up using the link at the very bottom of the page to stay informed of all the new happenings here at 1000x Faster.

Patreon Account

Thus far I've published this blog, teaching people concepts needed to make their data analysis processes much faster and easier, free of charge. I want to keep it that way, so everybody can learn from it whether they have the money to pay for education or not. At the same time, I do need to monetize 1000x Faster so that I'm rewarded for the time that I spend providing this value to people. In an attempt to balance these goals, I've created a Patreon account so that those who have money and want to support the project can do so. Hopefully this brings in enough money that I can continue providing this content to as many people as possible. If you're interested in supporting the 1000x Faster project, the link to my Patreon account is just above the Newsletter sign up form at the bottom of the page.

New Blog Post Series: Python Foundations

My previous blog post series focused on a fairly advanced topic, of how to use Python to automate laboratory data analysis. This is certainly a valuable topic that many people can benefit from, but all of those posts assumed that the reader had all of the necessary Python tools installed and a basic understanding of how to use them. It briefly introduced several packages that are used in data analysis automation, but only scratched the service. 

Aiming to support those who are newer to Python programming, I'll be starting a series of blog posts I'm calling "Python Foundations." It will provide a more beginner level introduction to the topics including 1) Installing my preferred Python tools, 2) Pointing the reader to excellent resources for learning the basic syntax structure and commands of Python, and 3) Providing detailed tutorials to using many of the useful packages to perform data analysis, and plotting. Specific tutorials will be written for important packages such as pandas, bokeh, glob, and matplotlib.

Next Steps

I expect the next blog post to come when I have several of the Python Foundations blog posts created. Hopefully you're excitedly waiting [1].




[1] Excitedly, yes. Though I wouldn't recommend holding your breath. It might be a few weeks.


As most of you have no doubt noticed, the 1000x Faster blog is on a bit of a hiatus right now. It's been a while since I had any projects that involved data analysis, and haven't had inspiration for new techniques to create/new posts to write. Since I don't want to waste your time writing mediocre content, I'm waiting until I have something really good to write before resuming posting.

Don't worry though, I'll be back [1]. A few projects leading to more insights should be coming pretty soon.



[1] Preferabbly read in your best Arnold Schwarzenegger voice.


Analyzing Data Sets With Multiple Test Types

The previous posts have all discussed methods for automating data analysis using Python when all tests are similar. This won’t always be the case. Sometimes tests will be used for different purposes; for example, some tests may collect data for regression development, while others search for behaviors or control logic in specific test cases. This creates an added level of complexity when writing scripts to analyze the data; the approach must be flexible enough to correctly analyze each of these different cases. This post describes how to create a central program which is flexible enough to handle all of these data analysis needs.

Generating Regressions from Stored Data

The final Python-based automation of laboratory data analysis topic to discuss is that of generating and validating regressions from the stored data. This is typically the ultimate goal of laboratory data analysis projects, and there are still several things to think through before declaring the project completed. This post will introduce and discuss topics such as identifying the best regression form, different tools for generating regressions, and validating models.

Storing Intermediate Results for Later Analysis

So far, all of the discussion has been in analyzing results from individual tests. The next step is to begin to think bigger picture, and create ways to combine those individual test results into data sets describing the results from the entire project. The first step is storing the individual test results in a logical manner, which facilitates later analysis. This post provides guidance on how to do that.

Checking the Quality of Testing and Analysis Results

One challenge of automated data analysis is that of checking the results. There is potential for errors in testing, and in data analysis which can both be caught quickly when manually analyzing data. This post provides some methods of doing the same error checking with automated processes, and provides example Python code.

An Introduction to Python Packages that are Useful for Automating Data Analysis

Automating analysis of each individual test relies on the capabilities of several available packages. These packages include glob, pandas, bokeh, and matplotlib. This post provides an introduction to these packages, and future posts will provide a much more thorough description of individual capabilities.

How to Identify the Conditions of Laboratory Tests and Split Large Data Files

When automating laboratory data analysis, it’s critical that the program have a way to identify the conditions of the test. Sometimes this is easier said than done, as file names may consist of nondescript numbers, provide more information about when the test was run than the test itself, or contain data from several tests in a single file. This post provides some ways to overcome these obstacles, complete with example Python code.

Designing Data Sets for Automated Laboratory Data Analysis

Automating laboratory data analysis is either simple or a nightmare depending on how the data set is structured. This post describes some of the fundamental challenges, and provides several possible solutions to make your data science life easier.

The Structure of Automating Laboratory Data Analysis

Since laboratory experimentation, and the associated data analysis is a common part of scientific research, the next series of posts will focus on how to automate this process. First, we'll present the structure and big-picture design of a project before moving on to discuss several of the topics in significantly more depth. This series of posts will focus primarily on the data science portion of the project, with some brief discussion of collaborating with the laboratory testers.

The Structure of a Laboratory Experiment Based Project with Automated Data Analysis

Unfortunately, each project must be approached individually and a detailed, yet generic solution doesn’t exist. However, there is a fundamental approach that can be applied to every project, with the specific programming (Primarily the calculations) changing between projects. The following general procedure provides the structure of an automated data analysis project. Several of these individual steps will be addressed in detail in later posts.

1. Create the test plan

Determine what tests need to be performed to generate the data set needed to answer the research question. This ensures that a satisfactory data set is available when generating regressions at the end of the project, and avoids needing to perform extra tests.

2. Design the data set to allow automation

This includes specifying what signals will be used to identify the most important sections of the tests, or the sections that will be analyzed by the project. This ensures that there will be an easy way to structure the program to identify the results of each individual test.

3. Create a clear file naming system

Either create a data printing method that makes identification of the test conditions in each test straightforward, or collaborate with the lab tester to do so. This ensures that the program will be able to identify the conditions of each test, which is necessary for analyzing the data and storing the results.

4. Store the resulting data files in a specific folder

This allows use of the Python package "glob" to sequentially open, and analyze the data from each individual test.

5. analyze the results of individual tests

Create a program to automatically cycle through all of the data files, and analyze each data set. This program will likely use a for loop and glob to automatically analyze every data file. It will likely use pandas to perform the calculations to identify the desired result of the test, and create checks to ensure that the test was performed correctly. It will also likely include plotting features with either bokeh or matplotlib.

6. Include error checking options

Any numbers of errors can occur in this process. Maybe some of the tests had errors. Maybe there was a mistake in the programmed calculations. Make life easier by ensuring that the program provides ample outputs to check the quality of the test results and following data analysis. This could mean printing plots from the test that allow visual inspection, or adding an algorithm that compares the measured data and calculations to expectations and report errors.

7. Store the data logically

The calculated values from each test need to be stored in tables and data files for later use. How these values are stored can either make the remaining steps easy, or impossible. The data should often be stored in different tables that provide the data set needed to later perform regressions.

8. Generate regressions from the resulting data set

Create a program that will open the stored data from Step 7 and create regressions. It should include an algorithm to create each desired regression, matching the data storage structure determined in Step 7. Ensure that this program provides adequate outputs, both statistical and visual, to allow thorough validation of the results.

9. Validate the results

Validate the resulting regressions using the statistical and visual outputs provided in Step 8. Determine whether the model is accurate enough or not. If not, either return to Step 7 and generate different regressions, or Step 1 and add additional tests to create a more comprehensive data set. If the model is accurate enough, publish detailed descriptions of its strengths and weaknesses so that future users understand what situations the model should/should not be used in.

Next Up: Designing Data Sets to Allow Automation

Those 9 steps provide the framework of a research project with automated data analysis. The upcoming series of posts will dive into the details of specific points. Next week we'll start by exploring step 2, with a thorough discussion of how to design data sets to allow automated data analysis.


Welcome to 1000x Faster!

1000x Faster is based on a very simple premise: You have better things to do than manual data analysis. Maybe you'd rather be drawing conclusions from the data, and sharing them with your clients. Maybe your preference is for the idea generation, and business development side of things. Or maybe higher efficiency in your work yields higher profit margins. Whatever it is that drives you, the goal of 1000x Faster is to help you finish your data analysis in a fraction of the time so you get to it.