Designing Data Sets for Automated Laboratory Data Analysis

Creating repeatable data sets is vital to automating data analysis. Repeatable data sets allow creation of programs which process the data set in the exact same way, every single time, with no modification. At the same time, all experimental and/or simulation data sets will be complex, with changes throughout the test, or periods that are of more interest than others. Therefore, the goal is to create repeatable signals in a potentially non-repeatable process.

There are several ways to do this. They include using valve status or a control setpoint as a signal stating that the test has changed phases, comparing a measured condition to a setpoint, or using the time of the test. These options will be described using the example of a single test on a drain water heat recovery device.

Introducing an Individual Data Set

This topic will be described using an example data set emulating what occurs during testing of drain water heat recovery devices*. When characterizing one of those devices, the output of interest is the steady state effectiveness. The equation to identify effectiveness is shown in Equation 1. Per Equation 1, the important measurements in this testing are the flow rate, inlet temperature, and outlet temperature on both the drain, and potable** side of the device.



Figures 1 and 2 show example data that could be from a drain water heat recovery test, with Figure 1 showing water flow rate data and Figure 2 showing temperature data. The data sets are emulating a test in an equal flow configuration with 100.4 degree drain water temperature, and 50 degree potable water temperature. To emulate the variability of inherent in experimental data, some randomness has been added to the data points created for this sample data set.

 For the sake of this discussion, the important thing to note is that there are three phases to this test. The first phase is the conditioning period of the test. During this phase, the entire system is being flushed with 70-degree water to ensure that the unit has a stable starting condition. The first phase can be identified because the temperature data is all approximately 70 degrees, and the flow rates are close to 0.5 gallons/minute This phase continues for the first 410 seconds of the test, before the second phase begins. The second phase Is the warm-up period, as the hot and potable water enter the device, and the device approaches steady state operation. It can be identified because the flow rates suddenly jump from roughly 0.5 gallons/minute to roughly 3.5 gallons/minute. The third phase is the steady state effectiveness portion of the test. In the third phase, the flow rates are fairly stable at 3.5 gallons/minute, the drain-side inlet temperature is fairly stable at 100.4 degrees, and the potable-side inlet temperature is fairly stable at 50 degrees. Because the system is operating steadily, it is used to identify the performance of the device under steady state conditions. This phase progresses from 440 seconds to the end of the test.

Figure 1: Example Water Flow Data from a Typical Test

Figure 2: Example Temperature Data from a Typical Test

Figure 3 shows the effectiveness calculated using the data shown in Figures 1 and 2, and the effectiveness equation. As can be expected, the effectiveness calculated during the first phase of the test is not valuable. The average effectiveness value is significantly lower than during the third phase, and the scatter in the data is extremely high. The lower average is caused by the very small temperature differences, and the high scatter is caused by the low flow rate. Small changes from one measurement to the next cause high variations in results. The effectiveness results in the second phase are much closer to the results in the third phase, but there is still variation.

Figure 3: Calculated Effectiveness Using the Sample Data

This change in effectiveness during the three phases of the test is important because it is a fundamental challenge to automating data analysis. Since the desired outcome from each test is to calculate the average effectiveness of the device in steady state, this entire data set cannot be used. The calculated average of the entire data set would be very different from the calculated average from the steady state portion. This necessitates filtering the data set such that the analysis focuses on the desired portion, and designing the experiments to encourage it.

Methods of Isolating the Desired Data Set

There are an infinite number of ways to isolate the desired data set; the trick is to find one that works well for any given application, based on the one’s needs and control over the available data. This section will describe a few different methods, while comparing their strengths and weaknesses.

Using a Control Signal

The most powerful method of isolating the desired data is printing a control signal in the data set. This control signal states some condition of the test that identifies the appropriate phase of the test. Filtering the data set using that control signal allows the analyzer to reduce the data set to include only the relevant data, and calculate an accurate result.

Using the example drain water heat recovery testing, printing the status of flow control valves would achieve this result. A valve status must change in order for the flow rate through the device to suddenly change from 0.5 gallons/minute to 3.5 gallons/minute at the end of the first phase. Printing this control valve signal gives the program a solid point to identify that the conditioning phase has ended, and the testing phase has begun. Figure 4 shows the same flow data from a typical test, with the control valve signal added in.

Figure 4: Valve Status and Flow Data in a Typical Test

Figure 5 shows the effectiveness data from the same data set, filtered by the valve status signal. Because of the use of that valve signal, this effectiveness data set includes only data when there is flow through the device (Phases 2 and 3). The inclusion of data from Phase 2 would cause a minor amount of error, though nowhere near as much as including the data from Phase 1. The steadiness of the effectiveness data shown in Figure 5 indicates that the average effectiveness over the steady state period would be an accurate calculation.

Figure 5: Effectiveness Data Filtered by a Valve Signal

Figure 5: Effectiveness Data Filtered by a Valve Signal

Using a Known Time

A second way of isolating the desired portion of the data is filtering the data to select a specific time period. This method works well because it gives the user the most control of the resulting data set. Compared to using a control signal, as was described in the previous section, this method can isolate the data set to solely the third phase without including the second phase. Removing the data from the second phase will slightly increase the accuracy of the result. The downside of using time-based control is that it requires the user to identify a time period which is identical in every test. If there is any deviation from the original test plan, the analysis script must be modified accordingly.

The third phase in the example data set began at 440 seconds, so that would be a logical time condition to use to analyze the data. However, there is no way to be certain that each test would proceed identically, and that the third phase would begin by 440 seconds each time. Thus, setting the filter a bit later increases the safety of the analysis. The downside is that the analysis is completed using a smaller quantity of data, but the impact should be negligible as the steady state portion of the test is 935 seconds long. Figure 6 shows the impact of filtering the data to the last 900 seconds of the test.

Figure 6: Effect of Filtering Data to the Last 900s

Using the data from the final 900 seconds of the test would isolate the resulting data set to only the third phase of the test. It would also exclude a small portion of the third phase in the sample data, in case the second phase is slower in other tests. At the same time, 900 seconds converts to 15 minutes, which is more than enough time to identify the steady state operation of the device. Figure 7 shows the effectiveness data when it is filtered to include only the last 900 seconds of the sample data set.

Figure 7: Effectiveness Data Filtered to the Last 900 Seconds


Using a Known Setpoint

Another possible solution is the use of a known setpoint to filter the data. This is achieved by creating the filter that selects only data when the measured test conditions are close to the known setpoint of the test. This is a powerful solution because it is more flexible than the time-based filter, allowing the analysis to adapt to changing test methods without modification, while not requiring additional points to be added to the data set like the control signal filter. The downside is the possibility for error from specifying the filter imperfectly. The filter must be designed to accept a wide enough range of data that all valid points are accepted, despite measurement uncertainty, yet thin enough to reject invalid data points, such as those in the first and second phases.

The third phase in the example data set could be identified using either the flow rate or temperature data. Because the temperature data approaches the set point more slowly than the flow rate data, it is best to use the temperature data. This ensures that the data from the second phase is removed, isolating only the third phase. In this example, a filter could be designed based on either the drain or potable side inlet temperature. The temperature change, relative to the temperature in the first phase, is stronger on the drain side than the potable side, thus using the drain side for the filter creates less change of filtering error.

To highlight the importance of setting the filter width effectively, Figures 8 and 9 present the effectiveness data using different filter widths. Figure 8 shows all effectiveness data when the drain-side inlet temperature is within 0.75 °F of the 100.4 °F setpoint. The results are very similar to those obtained using the time-based approach, because this particular filtering approach captured all of the data from the third phase of the test. Figure 9 shows all effectiveness data when the drain-side inlet temperature is within 0.25 °F of the 100.4 °F setpoint. The narrower filter seems like a safer choice because it filters out more erroneous data during the second phase; however, the scatter in the data set was also greater than 0.25 °F and this tight filter removed a significant portion of the meaningful data. In this example, the filter width of 0.75 °F performed much better.

Figure 8: Effectiveness Data Filtered to Drain Temperature Setpoint +/- 0.75 °F


Figure 9: Effectiveness Data Filtered to Drain Temperature Setpoint +/- 0.25 °

General Concerns

There are two general topics which should be considered when designing data sets to allow automation of analysis. They are:

Pandas*** tip

The pandas package references different columns in the data frame by name, not by number. This means that the structure of the data files used in the analysis can be changed in any way necessary, so long as the names of the columns used in the analysis remain the same.


Some of these approaches require more collaboration from test staff than others. Using a control signal requires that the control signal be included in the data file, while using time requires the timing of each test to be repeatable. Using a known setpoint, however, only requires that the tester perform the test correctly. If collaborating with test staff, especially somebody who isn’t particularly flexible, this could be a major factor in selecting a filtering method.


This blog is an honest attempt to teach you valuable skills that you can use in your career, ranging from computer programming and Python packages, to scientific data analysis and automation, and all the way to simulation model calibration and validation. Since I provide this to you free of charge, I can only do it with support from the readers. Please consider supporting the blog through my Patreon account. I know that asking is obnoxious, but it really helps me keep this blog in operation so I can continue helping you.


*Those desiring more information on drain water heat recovery devices, or why anybody would perform experiments on them, should review the following documents.


**If you aren't in the residential water heating industry you might not be familiar with the word "potable." It's the cold water entering your house. Much to my chagrin, the "pot" part is pronounced less like pot and more like boat. Which makes no sense; if it was pronounced like pot it would literally say that it's water you can put in your pot. Which is what it is.

***If you aren't experienced with Python, you may not be familiar with pandas. It's a package of Python functions that are extremely valuable for scientific data analysis. We'll have a series of posts introducing beginners to Python, and detailing features of several packages later.