Laboratory

Performance Map Tutorial: Splitting The Data Set into Individual Files

As discussed in the previous posts, we are now starting the process of using Python scripting to automatically analyze laboratory data and create a performance map predicting the coefficient of performance of heat pump water heaters. If this is a surprise, see Performance Map Tutorial: Creating a Performance Map of Heat Pump Water Heater Performance which introduces this series of posts.

In many cases the laboratory data that you receive will contain several tests in a single file. This would be very obnoxious if analyzing the data by hand due to the repetitive, uninteresting process of identifying where each tests and ends, copy/pasting that data to another file, and repeating until all the tests are separated. In a project with hundreds of tests, this process would take hours and be excruciating. Fortunately, it’s possible to easily write a Python script that does it automatically. And that’s what we’re going to learn to do here.

Splitting the Data Set into Individual Files

The following sections will walk you through all of the steps needed to split your data file into several files and give them names which make it easy to understand what each file contains.

Package Import Statements

First, you need to import the packages that are useful for this process. The recommended packages were all described in An Introduction to Python Packages that are Useful for Automating Data Analysis. For this part of the project, the three required packages are:

  1. glob: glob is a package that is useful for automatically cycling through several different files. This is useful if your laboratory data comes in a few files instead of only one. If it comes in one large file, this package is not necessary. For those who purchased the data set accompanying this tutorial and are following along, that data comes in a single file and using glob is not necessary.

  2. pandas: pandas is the go-t package for all data analysis needs. It is excellent for reading data in as tables and manipulating those data sets. It will be a key part of nearly all work done in this blog.

  3. os: The os package gives Python programmers access to a set of commands controlling the computers operating system. There are some commands which will be relevant to this process.

These three packages can all be imported by adding the following three lines of code to your script.


import pandas as pd

import glob

import os

Those three lines will make the commands in those packages available in the program. The portion in the pandas import line saying “as pd” means that pandas can be referenced using “pd” instead of “pandas” throughout the script. This can be convenient as is referenced quite frequently.

Read in the Data

Once the necessary packages are imported and available, the next step is reading the data.

First, use glob to read the data files. Remember that this is done by 1) Specifying the path of the data files, and 2) Specifying the type of files to be read by glob. It can be done using code similar to the following:

Path = r'C:\Users\JSmith\Desktop\AutomatedDataAnalysisWithPython\Tutorial-HPWHPerformanceMap\DataSet'

Filenames = glob.glob(Path + '/*.csv')

Note that the Path variable must state the folder where you have your data located, and will almost certainly need to be changed from the path specified above. The second line instructs the program to find all files of type .csv located in the folder specified with Path and add them to a list called Filenames. The result is that Filenames now contains a list of all .csv files in that folder which can be iterated through to analyze the data.

It’s also necessary to have a test plan stating the conditions of each test. You’ll need to make this file and save it as a .csv file in your work folder. The test plan needs to state the necessary conditions of each test. It’s useful for both communicating with the experimentalist collecting the data, and for automating the data analysis process. For this example, the test plan should look like the one shown in Figure 1.

Figure 1: Test Plan for this Tutorial

The test plan should be read into a pandas data frame using the typical approach. The following code shows how to do this, and store it in the variable “Test_Plan”. Remember that the path used in your program will need to be modified to match the location of the test plan on your computer.

Test_Plan = pd.read_csv(r'C:\Users\JSmith\Desktop\AutomatedDataAnalysisWithPython\Tutorial-HPWHPerformanceMap\DataSet\Additionals\Test_Plan.csv')

Identifying the End of Each Test

The next step in splitting the data set into individual files is identifying the end of each test. This means reading through the data, understanding it well enough to know when one test ends and the next begins, and writing code to break the data files there. In this code we know that one test ends when the water temperature reaches 140 deg F and the heat pump turns off. This can be identified because the electricity drawn by the heat pump water heater suddenly dropping from a few hundred watts to 0.

Before identifying the lines in the data frame representing times when the heat pump shut off, we need to read the data files that have been stored in the glob list and prepare to iterate through them. We can do that with a for loop that runs through each item in the list created earlier. The program will then run through each data file and extract the data in each. The for loop can be called with the following code:

for Filename in Filenames:

Data = pd.read_csv(Filename)

We now have a data frame called “Data” that contains the data from an individual test file and a for loop iterating through all of the files in the folder called “Path.” Note that all of the future called must be indented, as it takes place in the for loop.

To create a list of the lines in the data frame representing the end of each test, we need to do four things:

  1. Create a column in the data frame that has the electricity data shifted by one row. This will be used to identify the row where the electricity turns off.

  2. Subtract the electricity draw from the electricity draw in the new shifted column. The value in this columns will be negative in the row after the heat pump turns off.

  3. Create a list to store the rows representing times when the heat pump turned off.

  4. Add the rows representing times when the heat pump turned off to that list.

These objectives can be accomplished with the following four lines of code:

Data['P_Elec_Shift (W)'] = Data['P_Elec (W)'].shift(periods = -1)

Data['P_Elec_Shift (W)'] = Data['P_Elec_Shift (W)'] - Data['P_Elec (W)']

End_Of_Tests = []

End_Of_Tests = End_Of_Tests + Data[Data['P_Elec_Shift (W)'] < 0].index.tolist()

The end result is a list, called “End_Of_Tests”, that identifies the last row of each test in the currently open data file. Later we will use this list to actually split the data from the file.

The code shown above created a new column in the data frame. This isn’t necessarily a problem, but deleting that column creates a cleaner result when saving the final data frame. It can be removed with the following line:

del Data['P_Elec_Shift (W)']

Splitting the Files and Identifying the Conditions

With the last row of data in each test contained in “End_Of_Testsit’s time to write the code breaking out individual tests and identifying the conditions of those tests. To do that, we need to iterate through the “End_Of_Testslist and break out the sections of the between the indices. We do this with the following steps:

  1. First, we need to create a list that iterates through the indices contained in the list. This is done with a standard for loop, and range declaration.

  2. Second, we need to create an if statement that identifies whether it’s the first time through the for loop or not. Keep in mind that the first entry in “End_Of_Testsis the end of the first test, so the first time through the loop will need special treatment to identify the beginning of that test.

  3. Read the data from the identified sections into a new file containing the data of this specific test.

This can be accomplished with the following code.

for i in range(len(End_Of_Tests)):

if i == 0:

File_SingleTest = Data[0:End_Of_Tests[i]+1]

else:

File_SingleTest = Data[End_Of_Tests[i-1]+1:End_Of_Tests[i]+1]

Note that this now has the code contained inside a nested for loop. The beginning of this for loop will have to be indented because it’s inside the glob list loop, and the if statements will have to be indented twice because they’re inside this second for loop as well. All future code will have to be indented twice, as it belongs inside this second for loop.

The above code will identify the rows of a given tests and add it to a new data frame called “File_SingleTest”. Since it’s nested within a for loop, it will do this once for each test in the data file.

The next step is reading the data contained in the file, and matching it to the correct test in the test plan. This can be done using the following steps.

  1. Read the data in File_SingleTest to identify the ambient temperature during that test. To do this, we calculate the mean ambient temperature from the last 50 data points.

  2. Compare that ambient temperature to the three ambient temperatures in the test plan. We will do that by creating a new column in the test plan data frame that shows the absolute value of the difference between the value in each row of the test plan and the average value in the last 50 data points of the test.

  3. Identify the row in the test plan with the minimum difference between the plan and the test, and take the value called for in that row of the test plan.

This can be accomplished with the following three lines of code:

Temperature_Ambient = File_SingleTest['T_Amb (deg F)'][-50:].mean()

Test_Plan['Error_Temperature_Ambient'] = abs(Test_Plan['Ambient Temperature (deg F)'] - Temperature_Ambient)

Temperature_Ambient = Test_Plan.loc[Test_Plan['Error_Temperature_Ambient'].idxmin(), 'Ambient Temperature (deg F)']

Keep in mind that this code is nested inside the two for loops, and executes once for each test period identified in each file of tests. In this way the Temperature_Ambient variable will at some point hold the ambient temperature of each test in the entire project.

Saving the File With a Descriptive Name

Now that we have the code to separate the different tests and identify the ambient temperature of each, the last step is saving the results. When doing this it’s important to remember to use dynamic names that describe the data contained in the file. In this case the only part that’s changing is the ambient temperature during the test, so that will be the only variable in the code. With that in mind, we can save the data files with the following steps:

  1. First, we need to specify the folder that the data should be saved in. This will form part of the final path for the save process.

  2. Second, we need to make sure that folder exists. We can use the os commands to check for it and create it, if it doesn’t exist.

  3. Specify the filename for the single test. This will be a dynamic filename referencing the ambient temperature of the test in question.

  4. Save the file to the new filename.

These four steps can be completed with five lines of code (Checking to see if the folder exists takes one line of code, creating the folder takes a second):

Folder = Path + "\Files_IndividualTests"

if not os.path.exists(Folder):

os.makedirs(Folder)

Filename_SingleTest = "\PerformanceMap_HPWH_" + str(int(Temperature_Ambient)) + ".csv"

File_SingleTest.to_csv(Folder + Filename_SingleTest, index = False)

And that’s it! The program now has the code it needs to save the files. This means that your program, if you’ve been following along, is fully capable of 1) Opening each .csv data file in the specified folder, 2) Identifying the rows of data corresponding to each test in each of those files, 3) Breaking those data sets into separate files for each individual test, 4) Identifying the test in the plan that matches each file identified in the data set, and 5) Saving each file to a file name that describes the data contained in the test.

The Results

So what do the results look like? If you purchased the accompanying data set and followed along with the process, you should have a folder containing three files. Each of those files represents a different test. They will have descriptive names stating what is in them. The names will state the type of test, namely that they’re results used in the project creating a performance map for heat pump water heaters, and information stating the ambient temperature of each test. The folder containing the data should look like Figure 2.

Figure 2: The Resulting Files for each Individual Test

Each individual fill contains a subset of the data presented in Performance Map Tutorial: Introducing the HPWH Performance Map Tutorial Data Set. That post showed a single file with data from three tests. During the post, the temperature of water in the tank and the ambient temperature were highlighted. Figures 3 and 4 show the data set from the first result file. Comparing the water temperatures in Figure 3 and the ambient temperatures in Figure 4 will make it clear that this data matches what was shown in the first test period of that data set.

Figure 3: Water Temperature in the Tank in the First Result File

Figure 4: Ambient Temperature in the First Result File

Next Steps

In this part of the tutorial we discussed how to read data files containing several tests, split them apart into several different files, and store them in files with descriptive names. This step laid the groundwork for the next step, which is analyzing each data file and storing the results for later use. That’s what will be covered in the next step. We’ll write a script that can open each data file in the folder, filter out the data the doesn’t suit the needs of the project, perform calculations studying the performance of the heat pump, develop regressions predicting the performance of the unit, and check the accuracy of those regressions.

Performance Map Tutorial: Introducing the HPWH Performance Map Tutorial Data Set

Now that Performance Map Tutorial: Creating a Performance Map for Heat Pump Water Heaters has begun, those interested in following along will benefit from having access to a companion data set. Reading the blog posts would be valuable because it gives exposure to the information, but having a companion set allows you to follow along as you read. This way you can write your own code based on what’s read in the blog posts, do the analysis yourself, see how it works, check your results, and leave the tutorial with the confidence that you can write a program to automatically generate a performance map of heat pump water heaters.

Announcing a Sample Data Set for the Tutorial

To help with that goal, I have published my data set in the Store. This is a manufactured data set containing data representing what happens in performance map testing of heat pump water heaters. It emulates the data sets you would receive from the laboratory and includes important measurements including ambient temperature, inlet and outlet water temperature, water temperature measurements at several depths in the storage tank, water flow rate, and electricity consumption. This data will form the basis of the project as we analyze the data to find the change in energy stored in the tank, identify the coefficient of performance (COP) of the heat pump, and create a performance map predicting the COP as a function of the ambient and storage temperatures.

The data set contains manufactured data representing three different tests. In each test, the water in the tank is pushed out and replaced with 72 deg F water. This is done until all of the water stored in the tank is 72 deg F. After the tank is set at the starting temperature, the ambient temperature is set to the intended air temperature. Once the ambient temperature is at the set temperature, the heat pump is engaged and allowed to bring the water temperature to the set temperature of 140 deg F. Measurements of the electricity tell us how much energy is consumed by the heat pump, and measurements of the water temperature in the tank tells us how much the energy stored in the tank has changed. Therefore, during each test we can calculate the COP of the device with changes in water temperature for a specific ambient temperature. There are three tests to provide this information at three different ambient temperatures.

Overview of the Tutorial Data Set

Figure 1 shows the water temperatures in the tank during the three tests. The red lines and text describe what is occurring during the testing. The vertical lines and text at the bottom break the entire test period into the three separate tests, showing the repeated nature of the tests. These three tests are each performed at different ambient temperatures. The same pattern is followed within each of the three tests, and is highlighted with the text in test 1. First the water in the tank is replaced with 72 deg F water. Then, when the water in the tank is all at that temperature, the heat pump is used to heat the water up to 140 deg F. Remember that this is a manufactured data set, not actual test data. Real test data will never be as perfect as this data set, but this set does capture all of the main concepts and allow practicing the techniques.

Figure 1: Water Temperatures in the Tank During HPWH COP Testing

Figure 2 presents the ambient temperatures during the same test period. In this case the data is red, and the descriptive information is black. As in Figure 1, the data is divided into three sections representing the data from the three tests. The descriptive information within each test states the ambient temperature used for that test period. The three tests were done with 55, 70, and 95 deg F ambient air temperatures. Combined with the data during each individual test, tracking the water temperature as energy is added, this provides the information needed to determine the curve predicting the COP of the heat pump as a function of water temperature at three different ambient temperatures. These curves provide the basis of a performance map for the device.

Figure 2: Ambient Temperatures During HPWH COP testing

Next Steps

The next post will begin the process of analyzing this data set, and providing detailed tutorials so you can follow along using this data set yourself. We will begin with splitting this single data set into three different files, one for each test. During the process, our program will study the data contained within each test and write a file name detailing the test, and providing all information needed for future analysis. This will be done using the techniques described in How to Identify the Conditions of Laboratory Tests and Split Large Data Files. Once this process is complete, future posts will detail how to write a program which analyzes those test files and creates the final performance map.

Analyzing Data Sets With Multiple Test Types

The previous posts have all discussed methods for automating data analysis using Python when all tests are similar. This won’t always be the case. Sometimes tests will be used for different purposes; for example, some tests may collect data for regression development, while others search for behaviors or control logic in specific test cases. This creates an added level of complexity when writing scripts to analyze the data; the approach must be flexible enough to correctly analyze each of these different cases. This post describes how to create a central program which is flexible enough to handle all of these data analysis needs.

Generating Regressions from Stored Data

The final Python-based automation of laboratory data analysis topic to discuss is that of generating and validating regressions from the stored data. This is typically the ultimate goal of laboratory data analysis projects, and there are still several things to think through before declaring the project completed. This post will introduce and discuss topics such as identifying the best regression form, different tools for generating regressions, and validating models.

Storing Intermediate Results for Later Analysis

So far, all of the discussion has been in analyzing results from individual tests. The next step is to begin to think bigger picture, and create ways to combine those individual test results into data sets describing the results from the entire project. The first step is storing the individual test results in a logical manner, which facilitates later analysis. This post provides guidance on how to do that.

Checking the Quality of Testing and Analysis Results

One challenge of automated data analysis is that of checking the results. There is potential for errors in testing, and in data analysis which can both be caught quickly when manually analyzing data. This post provides some methods of doing the same error checking with automated processes, and provides example Python code.

An Introduction to Python Packages that are Useful for Automating Data Analysis

Automating analysis of each individual test relies on the capabilities of several available packages. These packages include glob, pandas, bokeh, and matplotlib. This post provides an introduction to these packages, and future posts will provide a much more thorough description of individual capabilities.

How to Identify the Conditions of Laboratory Tests and Split Large Data Files

When automating laboratory data analysis, it’s critical that the program have a way to identify the conditions of the test. Sometimes this is easier said than done, as file names may consist of nondescript numbers, provide more information about when the test was run than the test itself, or contain data from several tests in a single file. This post provides some ways to overcome these obstacles, complete with example Python code.

Designing Data Sets for Automated Laboratory Data Analysis

Automating laboratory data analysis is either simple or a nightmare depending on how the data set is structured. This post describes some of the fundamental challenges, and provides several possible solutions to make your data science life easier.

The Structure of Automating Laboratory Data Analysis

Since laboratory experimentation, and the associated data analysis is a common part of scientific research, the next series of posts will focus on how to automate this process. First, we'll present the structure and big-picture design of a project before moving on to discuss several of the topics in significantly more depth. This series of posts will focus primarily on the data science portion of the project, with some brief discussion of collaborating with the laboratory testers.

The Structure of a Laboratory Experiment Based Project with Automated Data Analysis

Unfortunately, each project must be approached individually and a detailed, yet generic solution doesn’t exist. However, there is a fundamental approach that can be applied to every project, with the specific programming (Primarily the calculations) changing between projects. The following general procedure provides the structure of an automated data analysis project. Several of these individual steps will be addressed in detail in later posts.

1. Create the test plan

Determine what tests need to be performed to generate the data set needed to answer the research question. This ensures that a satisfactory data set is available when generating regressions at the end of the project, and avoids needing to perform extra tests.

2. Design the data set to allow automation

This includes specifying what signals will be used to identify the most important sections of the tests, or the sections that will be analyzed by the project. This ensures that there will be an easy way to structure the program to identify the results of each individual test.

3. Create a clear file naming system

Either create a data printing method that makes identification of the test conditions in each test straightforward, or collaborate with the lab tester to do so. This ensures that the program will be able to identify the conditions of each test, which is necessary for analyzing the data and storing the results.

4. Store the resulting data files in a specific folder

This allows use of the Python package "glob" to sequentially open, and analyze the data from each individual test.

5. analyze the results of individual tests

Create a program to automatically cycle through all of the data files, and analyze each data set. This program will likely use a for loop and glob to automatically analyze every data file. It will likely use pandas to perform the calculations to identify the desired result of the test, and create checks to ensure that the test was performed correctly. It will also likely include plotting features with either bokeh or matplotlib.

6. Include error checking options

Any numbers of errors can occur in this process. Maybe some of the tests had errors. Maybe there was a mistake in the programmed calculations. Make life easier by ensuring that the program provides ample outputs to check the quality of the test results and following data analysis. This could mean printing plots from the test that allow visual inspection, or adding an algorithm that compares the measured data and calculations to expectations and report errors.

7. Store the data logically

The calculated values from each test need to be stored in tables and data files for later use. How these values are stored can either make the remaining steps easy, or impossible. The data should often be stored in different tables that provide the data set needed to later perform regressions.

8. Generate regressions from the resulting data set

Create a program that will open the stored data from Step 7 and create regressions. It should include an algorithm to create each desired regression, matching the data storage structure determined in Step 7. Ensure that this program provides adequate outputs, both statistical and visual, to allow thorough validation of the results.

9. Validate the results

Validate the resulting regressions using the statistical and visual outputs provided in Step 8. Determine whether the model is accurate enough or not. If not, either return to Step 7 and generate different regressions, or Step 1 and add additional tests to create a more comprehensive data set. If the model is accurate enough, publish detailed descriptions of its strengths and weaknesses so that future users understand what situations the model should/should not be used in.

Next Up: Designing Data Sets to Allow Automation

Those 9 steps provide the framework of a research project with automated data analysis. The upcoming series of posts will dive into the details of specific points. Next week we'll start by exploring step 2, with a thorough discussion of how to design data sets to allow automated data analysis.