How to Identify the Conditions of Laboratory Tests and Split Large Data Files

In the last post, we examined how to design data sets to allow automated analysis of laboratory data. The next step in the process, once the data has been collected, is to program your script so that it knows the conditions of each test it's analyzing. The program must be able to determine the test conditions to perform the necessary calculations, and to correctly store data for further analysis later. There are several ways to provide the test condition information to the program, including the following methods:

· Writing the test conditions in the name of the data file,

· Putting the test number in the name of the data file and referencing the test plan,

· Putting a header in each test file, and

· Reading the test conditions from the data set.

These solutions all share the assumption that each test comes in an individual data file. However, sometimes data sets have several tests contained in a single data file. In that case, another algorithm must be developed to split the large data file into individual data files. Some possible approaches to this issue include:

· Splitting the data set based on a known conditioning period between tests, and

· Using control information printed in the test.

These topics will be discussed in further detail in the following sections.

Introducing the Needs of DWHR Data Analysis

As with the other posts, this topic will be discussed in the context of drain water heat recovery* data analysis. There are four main parameters that influence the performance of DWHR devices. They are:

· The drain-side water flow rate,

· The cold-side water flow rate,

· The drain-side inlet temperature, and

· The cold-side inlet temperature.

As a result, any method used to identify the conditions of individual tests must allow the program to identify these four parameters.

Identifying Test Conditions

The following sections provide details on the four different methods of identifying the test conditions that were listed in the introduction.

Writing the Conditions in the Name of the Data File

This strategy uses the name of the data file to store all the details of the data file. This way the analysis program can open the file for a specific test, and identify the conditions at the same time. One major strength of this approach is that the test conditions are not only understood by the program, but easily understood by the user. The downside is that file names can get long, and cumbersome to read.

As an example, assume that a test is performed with the following parameters:

· Drain-side flow rate = 3 gal/min,

· Cold-side flow rate = 3 gal/min,

· Drain-side inlet temperature = 100.4 deg F, and

· Potable-side inlet temperature = 50 deg F.

The following is an example file name that contains this information:


Note that decimal points cannot be used in the filename, so the drain-side inlet temperature is written as “1004” instead of “100.4”.

To read the conditions from this filename, the data analysis script will need an algorithm to read the contents of the file name. This could be done using string splitting to state the exact letters in the string to read, but generally it’s better to create an algorithm to read the values located between two identifiers. These identifiers are the characters used to separate the parameters from the descriptive terms in the file name. In the case of our example, they’re “=” and “_”. The values can then be identified using a function designed for the purpose. Continuing the example, the values for the four parameters can be identified using the following code.

def Find_Between(String, First, Last):


        Start = String.index(First) + len(First)

        End = String.index(Last, Start)

        return String[Start:End]


Flow_Drain = float(Find_Between(filename, 'FlowDrain=', '_FlowCold))

Flow_Cold = float(Find_Between(filename, '_FlowCold=', '_TemperatureDrain'))

Temp_Drain = float(Find_Between(filename, '_TemperatureDrain=', '_TemperaturePotable'))/10

Temp_Potable = float(Find_Between(filename, '_TemperaturePotable=', '.csv'))


The Find_Between function is used to identify the section of a string located between two different sections of the same string. It requires three inputs: 1) The full string that is being searched, 2) A section of the string prior to the section of interest, and 3) A section of the string following the section of interest. It then returns the value of the string located between the descriptors. If it doesn’t find one of the specified descriptors, it returns an error.

The four lines of code following the definition of Find_Between call the function to identify the values of the four parameters as specified in the file name, and converts them to float type. In the case of T_Drain, the value is divided by 10 so the program treats it as 100.4 instead of 1004.

Referencing the Test Number in the File Name

An alternative solution is to print the test number in the file name, and have the program open the test file to identify the conditions. This approach is less complicated to implement, and leads to data files with much shorter names. However, the data files are much less descriptive, making it more challenging to identify the contents of each test manually if that proves to be important.

For example, consider a file named “14 DWHR 30Jan2018.csv”. This file represents DWHR test number 14, performed on 30, January 2018. In order to identify the test conditions, the program needs to identify that it’s test number 14 and identify the conditions of test 14 by referencing the test matrix. The following line of code can identify the test conditions:

Number_Test = int(filename[0:3])

This code reads selects the first two indices in the filename, in this case ‘14’, and converts them to an integer value. Note that in tests with only a single digit, the returned string would be followed by a space. That space is dropped when converting to an integer.

The test matrix must then be referenced to identify the conditions of the test, based on the test matrix. It can be opened using the pandas** .read_csv function, as shown below.

Test_Matrix = pd.read_csv(r’C:\Users\JSmith\DataAnalysis\TestMatrix.csv’)

Note that the path in that code is completely made up, and must be modified to read the location of the actual test matrix.

The final step is reading the conditions of the test out of the test matrix. This is accomplished using the .loc function from pandas. Assume that the test matrix appears as shown in Table 1. The rows showing “…” for test number indicate that the matrix has additional tests, but they aren’t shown.

The conditions of the test can be identified using the following code.

Flow_Drain = Test_Matrix.loc[Number_Test, ‘Drain-Side Flow Rate (gal/min)’]

Flow_Cold = Test_Matrix.loc[Number_Test, ‘Cold-Side Flow Rate (gal/min)’]

Temperature_Drain = Test_Matrix.loc[Number_Test, ‘Drain-Side Temperature (deg F)’]

Temperature_Cold = Test_Matrix.loc[Number_Test, ‘Cold-Side Temperature (deg F)’]

This code will search the Test_Matrix data frame created when using pandas.read_csv to read the data frame for the 14th row, where Test_Number = 14, and pull the value from the appropriate column, as specified in each line of code. In this way the conditions of each test can be identified, and are ready for use in later analysis.

Putting a Header in Each File

A third approach is to put the test conditions in a header contained in each data file. In this method, the program opens the data file, reads the test conditions in the header, closes the data file, and re-opens it while excluding the portion containing the header. The two biggest advantages of this method are that 1) It’s simple to implement, and 2) There are no restrictions on the file name. The downside is that, unless the header is generated automatically, adding the headers to the data file can be quite labor intensive and tedious.

In this case, the data file would appear as shown in Table 2.

The first two rows of the above table provide the header. The top row lists the name of the parameter listed in that column, and the second row presents the nominal value of the test conditions. Row three is left blank as a separator, and row four presents the column names for the measured data in the test. The actual test data begins in row 5, and continues until the end of the data file.

To read the header, the program must first open the entire data file. This can be accomplished using the following code.

Data_File = pandas.read_csv(r’C:\Users\JSmith\DataAnalysis\TestData.csv’)

As before, the path used to locate the data file is a placeholder and needs to be changed to the correct path for the file that the program needs to open. Once the file is open, the header can be read using the following code.

Flow_Drain = Data_File.loc[0, ‘Drain-Side Flow Rate (gal/min)’]

Flow_Cold = Data_File.loc[0, ‘Cold-Side Flow Rate (gal/min)’]

Temperature_Drain = Data_File.loc[0, ‘Drain-Side Temperature (deg F)’]

Temperature_Cold = Data_File.loc[0, ‘Cold-Side Temperature (deg F)’]

Now that the program has identified the test conditions, the final step is to open the data file again without including the header. This is accomplished using the following line.

Data_File = pandas.read_csv(r’C:\Users\JSmith\DataAnalysis\TestData.csv’, header = 3)

The addition of “header = 3” tells the program to bypass the first three rows of the file and begin reading the data at the fourth row, which corresponds to the column names for the measured data. The program can now proceed as if the header did not exist.

Reading the Test Conditions from the Data File

A final option is to read the data that’s in the file, and identify the test conditions based on what occurred. This option is the least accurate, as it runs the risk of misidentifying the conditions, and should be considered as a last resort. The strength of this approach is that it doesn’t require assistance from the lab tester, and can identify the test conditions of unspecified data files.

As with the other approaches, this method needs to identify the key parameters used in testing. In our DWHR example, this means the drain and cold side water flow rate as well as the drain and cold side inlet temperature. These four parameters need to be identified by studying the data set. Figures 1 and 2 present a sample data set which demonstrates how this can be done.

Figure 1: Flow Rates During a Sample Data Set

Figure 2: Temperatures During a Sample Data Set

In the example data set, the steady state portion of the test is performed in the final 920 seconds of the test. The nominal conditions of the test can be approximated by calculating the average value of each point during this period. The four parameters required for DWHR analysis can be calculated using the following code.

Flow_Drain = Data_File[‘Drain-Side Flow Rate (gal/min)’][-50:].mean()

Flow_Cold = Data_File[‘Cold-Side Flow Rate (gal/min)’][-50:].mean()

Temperature_Drain = Data_File[‘Drain-Side Inlet Temperature (deg F)’][-50:].mean()

Temperature_Cold = Data_File[‘Cold-Side Inlet Temperature (deg F)’][-50:].mean()

These four lines of code select the last 50 entries in each data column, and calculate the mean. This provides an average value for that data point during the steady state period of the test, and an approximate value representing the test conditions.

The next step is identifying the nominal conditions used to set the test. Because measured data never perfectly matches the nominal conditions, there will be slight differences between the values specified in the test matrix and the values identified by the analysis script. Using the test matrix to identify the nominal values in the script allows easier tracking of results. This is accomplished by comparing the identified values from the test to the nominal values in the test matrix, and identifying the case will the least difference. Consider the following code, that can be used to identify the nominal Flow_Drain value. This example uses the same test matrix show in Referencing the Test Number in the File Name

Test_Matrix = pd.read_csv(r’C:\Users\JSmith\DataAnalysis\TestMatrix.csv’)

Test_Matrix[‘Difference Flow Drain (gal/min)’] = abs(Test_Matrix[‘Drain-Side Flow Rate (gal/min)’] – Flow_Drain)

Flow_Drain = Test_Plan.loc[Test_Plan['Error_Flow_Drain'][Test_Plan_First:Test_Plan_Final].idxmin(), 'Drain-Side Flow Rate (gal/min)']

Three things happen in this code:

1. The first line reads the test matrix.

2. The second line creates a new column in the table that identifies the absolute value of the difference between the nominal conditions of each test and the monitored conditions of the current test.

3. The third line identifies the row with the least difference between the nominal conditions and measured conditions, and sets Flow_Drain equal to the drain flow rate specified for that test.

Keep in mind that this code returns the nominal conditions that most closely match what occurred during the test. In situations with high measurement uncertainty, or with small differences in nominal conditions between each test, this code may identify the incorrect nominal conditions. Therefore, this approach should be considered the last resort.


Splitting Data Files

All the previously described approaches assume that each test has an independent data file. That doesn’t work if the data is delivered with several tests included in a single large data file. In that case, a script must split the different tests out of the data set and create individual data files. This can be done by identifying a repeatable signal, and creating new data files whenever that signal appears. Two examples of repeatable signals, taken from DWHR research, include:

· Each test features a flush period before beginning. The flush period is used to ensure that the DWHR device starts each test at a repeatable condition. It consists of 4.15 gal/min at 70 deg F on both sides of the device. The flow rate and/or temperature conditions can be used to identify this flush period.

· Control valves are used to direct water flow through the different parts of the test apparatus. One valve configuration indicates a flush period, while another indicates the test period. The flow valve configuration is another option for identifying the flush period.

Note that the second option, using control valve status, is more reliable than using the temperature and/or flow conditions. This is because some tests may use conditions similar to the flush period, and there’s a risk that the script could confuse the two.

Once the identifier is selected, code must be written to locate the instances where this occurs and create new test files accordingly. This can be done by removing the data representing the flush period, and identifying the instances where the index of the table changes. The following code shows an example where the flush occurs when “Solenoid Valve 1 Status (-)” = 0, and the test occurs when “Solenoid Valve 2 Status (-)” = 1.

Data_File = pd.read_csv(r’C:\Users\JSmith\Data.csv’)

Data_File[‘Index’] = Data_File.index

Data_File = Data_File.drop(Data_File[Data_File[‘Solenoid Valve 1 Status (-)’] == 0].index)

Data_File = Data_File.reset_index()

Data_File['Index_Shift'] = Data_File['Index'].shift(periods = -1)

Data_File['Index_Shift'] = Data_File['Index_Shift'] – Data_File['Index']


End_Of_Tests = []

End_Of_Tests = End_Of_Tests + Data_File[Data_File['Index_Shift'] > 1].index.tolist()



That code performs the following:

1. The first two lines read the data file, and create a column titled “Index” which matches the index of the table.

2. The third line removes all rows of the table where “Solenoid Valve 1 Status (-)” = 0, thus removing all portions of the code representing the flush period.

3. The fourth line resets the index of the table, such that it appears as if no rows have been removed. Note that the column titled “Index” is not changed in this step.

4. The fifth line creates a new column titled ‘Index_Shift’ which replicates the values from ‘Index’, but shifted up one row in the table. It allows an easy way to identify how much ‘Index’ changes from one row to the next.

5. The sixth line changes ‘Index_Shift’ to equal ‘Index’ minus ‘Index_Shift’ in any given row. During the steady state portion of a test, this value will equal 1. In cases where the flush period was removed and the data now jumps from one test to the next, ‘Index_Shift’ will be greater than 1.

6. The last three lines create an array called ‘End_Of_Tests’ with entries representing the index of the row which corresponds to the last data readings in any given test. It is filled with the index of any row where ‘Index_Shift’ is greater than 1. It also accepts the final index in the data set, because he ‘Index_Shift’ method will not identify that row as a final reading in a test.

After this code runs, the ‘End_Of_Tests’ array provides a list of all the indexes corresponding to the end of each test. The code then needs to split the data set into individual files, and save them as new .csv files with descriptive names. This can be accomplished using the following steps:

1. Create a for loop that cycles through the ‘End_Of_Tests’ array.

2. Use the techniques described in Reading the Test Conditions from the Data File to identify the nominal conditions of each test.

3. Create a file name for each data file that describes the conditions of the test. The code to do so will look something like:

File_Name = "FlowDrain=" + str(Flow_Drain) + "_FlowPotable=" + str(Flow_Potable) + "_TDrain=" + str(T_Drain) + "_TPotable=" + str(T_Potable)

This creates a set of individual data files that can be analyzed individually, with file names containing the conditions of the test. They are ready to be opened, and understood using the techniques descried in Writing the Conditions in the Name of the Data File.

Next Steps

The last two posts have discussed preparing and understanding data files. These are vital background steps, necessary before any actual data analysis can be performed. In the next post, we'll provide a basic intro to some Python packages used in data analysis. The introduction will be high level, with later posts providing significantly more detail on each.


This blog is an honest attempt to teach you valuable skills that you can use in your career, ranging from computer programming and Python packages, to scientific data analysis and automation, and all the way to simulation model calibration and validation. Since I provide this to you free of charge, I can only do it with support from the readers. Please consider supporting the blog through my Patreon account. I know that asking is obnoxious, but it really helps me keep this blog in operation so I can continue helping you.


*Those interested in drain water heat recovery devices, and how/why they're studied can review a LinkedIn article I wrote about it at

**Pandas is a package of Python functions designed to facilitate scientific data analysis. That's enough information for this post, future posts will provide more detail.