Performance Map Tutorial: Splitting The Data Set into Individual Files

As discussed in the previous posts, we are now starting the process of using Python scripting to automatically analyze laboratory data and create a performance map predicting the coefficient of performance of heat pump water heaters. If this is a surprise, see Performance Map Tutorial: Creating a Performance Map of Heat Pump Water Heater Performance which introduces this series of posts.

In many cases the laboratory data that you receive will contain several tests in a single file. This would be very obnoxious if analyzing the data by hand due to the repetitive, uninteresting process of identifying where each tests and ends, copy/pasting that data to another file, and repeating until all the tests are separated. In a project with hundreds of tests, this process would take hours and be excruciating. Fortunately, it’s possible to easily write a Python script that does it automatically. And that’s what we’re going to learn to do here.

Splitting the Data Set into Individual Files

The following sections will walk you through all of the steps needed to split your data file into several files and give them names which make it easy to understand what each file contains.

Package Import Statements

First, you need to import the packages that are useful for this process. The recommended packages were all described in An Introduction to Python Packages that are Useful for Automating Data Analysis. For this part of the project, the three required packages are:

  1. glob: glob is a package that is useful for automatically cycling through several different files. This is useful if your laboratory data comes in a few files instead of only one. If it comes in one large file, this package is not necessary. For those who purchased the data set accompanying this tutorial and are following along, that data comes in a single file and using glob is not necessary.

  2. pandas: pandas is the go-t package for all data analysis needs. It is excellent for reading data in as tables and manipulating those data sets. It will be a key part of nearly all work done in this blog.

  3. os: The os package gives Python programmers access to a set of commands controlling the computers operating system. There are some commands which will be relevant to this process.

These three packages can all be imported by adding the following three lines of code to your script.


import pandas as pd

import glob

import os

Those three lines will make the commands in those packages available in the program. The portion in the pandas import line saying “as pd” means that pandas can be referenced using “pd” instead of “pandas” throughout the script. This can be convenient as is referenced quite frequently.

Read in the Data

Once the necessary packages are imported and available, the next step is reading the data.

First, use glob to read the data files. Remember that this is done by 1) Specifying the path of the data files, and 2) Specifying the type of files to be read by glob. It can be done using code similar to the following:

Path = r'C:\Users\JSmith\Desktop\AutomatedDataAnalysisWithPython\Tutorial-HPWHPerformanceMap\DataSet'

Filenames = glob.glob(Path + '/*.csv')

Note that the Path variable must state the folder where you have your data located, and will almost certainly need to be changed from the path specified above. The second line instructs the program to find all files of type .csv located in the folder specified with Path and add them to a list called Filenames. The result is that Filenames now contains a list of all .csv files in that folder which can be iterated through to analyze the data.

It’s also necessary to have a test plan stating the conditions of each test. You’ll need to make this file and save it as a .csv file in your work folder. The test plan needs to state the necessary conditions of each test. It’s useful for both communicating with the experimentalist collecting the data, and for automating the data analysis process. For this example, the test plan should look like the one shown in Figure 1.

Figure 1: Test Plan for this Tutorial

The test plan should be read into a pandas data frame using the typical approach. The following code shows how to do this, and store it in the variable “Test_Plan”. Remember that the path used in your program will need to be modified to match the location of the test plan on your computer.

Test_Plan = pd.read_csv(r'C:\Users\JSmith\Desktop\AutomatedDataAnalysisWithPython\Tutorial-HPWHPerformanceMap\DataSet\Additionals\Test_Plan.csv')

Identifying the End of Each Test

The next step in splitting the data set into individual files is identifying the end of each test. This means reading through the data, understanding it well enough to know when one test ends and the next begins, and writing code to break the data files there. In this code we know that one test ends when the water temperature reaches 140 deg F and the heat pump turns off. This can be identified because the electricity drawn by the heat pump water heater suddenly dropping from a few hundred watts to 0.

Before identifying the lines in the data frame representing times when the heat pump shut off, we need to read the data files that have been stored in the glob list and prepare to iterate through them. We can do that with a for loop that runs through each item in the list created earlier. The program will then run through each data file and extract the data in each. The for loop can be called with the following code:

for Filename in Filenames:

Data = pd.read_csv(Filename)

We now have a data frame called “Data” that contains the data from an individual test file and a for loop iterating through all of the files in the folder called “Path.” Note that all of the future called must be indented, as it takes place in the for loop.

To create a list of the lines in the data frame representing the end of each test, we need to do four things:

  1. Create a column in the data frame that has the electricity data shifted by one row. This will be used to identify the row where the electricity turns off.

  2. Subtract the electricity draw from the electricity draw in the new shifted column. The value in this columns will be negative in the row after the heat pump turns off.

  3. Create a list to store the rows representing times when the heat pump turned off.

  4. Add the rows representing times when the heat pump turned off to that list.

These objectives can be accomplished with the following four lines of code:

Data['P_Elec_Shift (W)'] = Data['P_Elec (W)'].shift(periods = -1)

Data['P_Elec_Shift (W)'] = Data['P_Elec_Shift (W)'] - Data['P_Elec (W)']

End_Of_Tests = []

End_Of_Tests = End_Of_Tests + Data[Data['P_Elec_Shift (W)'] < 0].index.tolist()

The end result is a list, called “End_Of_Tests”, that identifies the last row of each test in the currently open data file. Later we will use this list to actually split the data from the file.

The code shown above created a new column in the data frame. This isn’t necessarily a problem, but deleting that column creates a cleaner result when saving the final data frame. It can be removed with the following line:

del Data['P_Elec_Shift (W)']

Splitting the Files and Identifying the Conditions

With the last row of data in each test contained in “End_Of_Testsit’s time to write the code breaking out individual tests and identifying the conditions of those tests. To do that, we need to iterate through the “End_Of_Testslist and break out the sections of the between the indices. We do this with the following steps:

  1. First, we need to create a list that iterates through the indices contained in the list. This is done with a standard for loop, and range declaration.

  2. Second, we need to create an if statement that identifies whether it’s the first time through the for loop or not. Keep in mind that the first entry in “End_Of_Testsis the end of the first test, so the first time through the loop will need special treatment to identify the beginning of that test.

  3. Read the data from the identified sections into a new file containing the data of this specific test.

This can be accomplished with the following code.

for i in range(len(End_Of_Tests)):

if i == 0:

File_SingleTest = Data[0:End_Of_Tests[i]+1]

else:

File_SingleTest = Data[End_Of_Tests[i-1]+1:End_Of_Tests[i]+1]

Note that this now has the code contained inside a nested for loop. The beginning of this for loop will have to be indented because it’s inside the glob list loop, and the if statements will have to be indented twice because they’re inside this second for loop as well. All future code will have to be indented twice, as it belongs inside this second for loop.

The above code will identify the rows of a given tests and add it to a new data frame called “File_SingleTest”. Since it’s nested within a for loop, it will do this once for each test in the data file.

The next step is reading the data contained in the file, and matching it to the correct test in the test plan. This can be done using the following steps.

  1. Read the data in File_SingleTest to identify the ambient temperature during that test. To do this, we calculate the mean ambient temperature from the last 50 data points.

  2. Compare that ambient temperature to the three ambient temperatures in the test plan. We will do that by creating a new column in the test plan data frame that shows the absolute value of the difference between the value in each row of the test plan and the average value in the last 50 data points of the test.

  3. Identify the row in the test plan with the minimum difference between the plan and the test, and take the value called for in that row of the test plan.

This can be accomplished with the following three lines of code:

Temperature_Ambient = File_SingleTest['T_Amb (deg F)'][-50:].mean()

Test_Plan['Error_Temperature_Ambient'] = abs(Test_Plan['Ambient Temperature (deg F)'] - Temperature_Ambient)

Temperature_Ambient = Test_Plan.loc[Test_Plan['Error_Temperature_Ambient'].idxmin(), 'Ambient Temperature (deg F)']

Keep in mind that this code is nested inside the two for loops, and executes once for each test period identified in each file of tests. In this way the Temperature_Ambient variable will at some point hold the ambient temperature of each test in the entire project.

Saving the File With a Descriptive Name

Now that we have the code to separate the different tests and identify the ambient temperature of each, the last step is saving the results. When doing this it’s important to remember to use dynamic names that describe the data contained in the file. In this case the only part that’s changing is the ambient temperature during the test, so that will be the only variable in the code. With that in mind, we can save the data files with the following steps:

  1. First, we need to specify the folder that the data should be saved in. This will form part of the final path for the save process.

  2. Second, we need to make sure that folder exists. We can use the os commands to check for it and create it, if it doesn’t exist.

  3. Specify the filename for the single test. This will be a dynamic filename referencing the ambient temperature of the test in question.

  4. Save the file to the new filename.

These four steps can be completed with five lines of code (Checking to see if the folder exists takes one line of code, creating the folder takes a second):

Folder = Path + "\Files_IndividualTests"

if not os.path.exists(Folder):

os.makedirs(Folder)

Filename_SingleTest = "\PerformanceMap_HPWH_" + str(int(Temperature_Ambient)) + ".csv"

File_SingleTest.to_csv(Folder + Filename_SingleTest, index = False)

And that’s it! The program now has the code it needs to save the files. This means that your program, if you’ve been following along, is fully capable of 1) Opening each .csv data file in the specified folder, 2) Identifying the rows of data corresponding to each test in each of those files, 3) Breaking those data sets into separate files for each individual test, 4) Identifying the test in the plan that matches each file identified in the data set, and 5) Saving each file to a file name that describes the data contained in the test.

The Results

So what do the results look like? If you purchased the accompanying data set and followed along with the process, you should have a folder containing three files. Each of those files represents a different test. They will have descriptive names stating what is in them. The names will state the type of test, namely that they’re results used in the project creating a performance map for heat pump water heaters, and information stating the ambient temperature of each test. The folder containing the data should look like Figure 2.

Figure 2: The Resulting Files for each Individual Test

Each individual fill contains a subset of the data presented in Performance Map Tutorial: Introducing the HPWH Performance Map Tutorial Data Set. That post showed a single file with data from three tests. During the post, the temperature of water in the tank and the ambient temperature were highlighted. Figures 3 and 4 show the data set from the first result file. Comparing the water temperatures in Figure 3 and the ambient temperatures in Figure 4 will make it clear that this data matches what was shown in the first test period of that data set.

Figure 3: Water Temperature in the Tank in the First Result File

Figure 4: Ambient Temperature in the First Result File

Next Steps

In this part of the tutorial we discussed how to read data files containing several tests, split them apart into several different files, and store them in files with descriptive names. This step laid the groundwork for the next step, which is analyzing each data file and storing the results for later use. That’s what will be covered in the next step. We’ll write a script that can open each data file in the folder, filter out the data the doesn’t suit the needs of the project, perform calculations studying the performance of the heat pump, develop regressions predicting the performance of the unit, and check the accuracy of those regressions.