Perhaps the strongest argument against automation of data analysis, is that of reliability. A computer algorithm that mindlessly performs calculations and analysis won’t identify poorly performed tests or mistakes in analysis, and these oversights will lead to errors in the final project results. This isn’t simply people being afraid of new ways of doing things, it’s a legitimate concern. It’s especially true in the case of laboratory testing; lab tests don’t always progress as planned, and sometimes those mistakes need to be identified during data analysis.
Fortunately, it’s possible to add error checking methods to the program. Some of these techniques, namely printing intermediate outputs and plotting all data, require manual data checking though still significantly less human effort than fully manual data analysis. Others, namely creating an automated data checking algorithm, allow the program to do the error checking on their own significantly reducing data checking time demands. Combined with the natural repeatability of computer programs, these approaches can lead to data analysis processes that are more robust than manual calculations, while also being significantly faster.
The following sections will describe these three approaches to data quality checking in Python programs.
Printing Intermediate Outputs
Printing intermediate outputs is analogous to showing all calculations. This is a good practice because it both facilitates debugging when the program is created, and allows others to check the results and build confidence in the automated data analysis program. Because many people won’t want to review Python code directly, they won’t be able to check the formulas as they might with an Excel spreadsheet. This makes it especially important to provide as many intermediate outputs as possible, so people can check the calculations on their own.
The basic premise behind printing intermediate outputs is to present as many steps in the calculations as possible, in a format similar to Excel. This facilitates checking the results from the code, by making it easier for others to understand. Then they can easily perform their own calculations, and compare their results to the Python output. This is generally performed using the two following steps.
1. Present as many calculation details as possible within the data frame. External variables, or lists may be necessary, but should be used sparingly. Containing all of the data and calculations in a single data frame makes it easier for others to understand and check the calculations.
2. Print the data frame from each test to a unique .csv file.
Some aspects of presenting calculation details within the data frame come naturally. Most calculations will be performed on the data frame, and the results will be stored in the data frame as a direct result. Other aspects require a little extra effort. For instance, constants would more naturally be used as independent variables, but adding a column for them in the data frame allows others to check these constants while examining only the output table.
Saving a new .csv file for each test is a useful way to ensure that all calculations throughout an entire project are saved, rather than overwriting old calculations as results from new tests are analyzed. This is generally done by 1) Creating a new folder as each test is analyzed, to store the results from that test and 2) Saving the results to a .csv file using a dynamic file name, that changes to represent the conditions of the test currently being analyzed. These two goals can be accomplished using the following code. Note that the example code uses the techniques shown in How to Identify Test Conditions and Split Data Files to track the conditions of the test.
Flow_Drain = Find_Between(filename, '_FlowDrain=', '_FlowCold)
Flow_Cold = Find_Between(filename, '_FlowPotable=', '_TemperatureDrain')
Temp_Drain = Find_Between(filename, '_TemperatureDrain=', '_TemperaturePotable')
Temp_Potable = Find_Between(filename, '_TemperaturePotable=', '.csv')
Folder = r’C:\Users\JSmith\DataAnalysis\Data\Flow_Drain=’ + Flow_Drain + ‘Flow_Cold=’ + Flow_Cold + ‘Temp_Drain’ = Temp_Drain + ‘Temp_Potable=’ + Temp_Potable
if not os.path.exists(Folder):
Data.to_csv(Folder + ‘\Flow_Drain=’ + Flow_Drain + ‘Flow_Cold=’ + Flow_Cold + ‘Temp_Drain=’ + Temp_Drain + ‘Temp_Potable=’ + Temp_Potable + ‘.csv’, index = False)
Breaking the preceding code into four steps, it:
1. Uses the techniques from How to Identify Test Conditions and Split Data Files to identify the nominal conditions of the data set currently being analyzed. For the context of this chapter, this ensures that the program has the information it needs to create names of folders and files, and that the results are stored in an orderly fashion which can easily be understood later.
2. Uses the conditions read in Step 1 to create a variable named Folder. The path of Folder is unique to this test, and uses the conditions of the test in the name of the file.
3. Checks to see if there is already a folder matching that name and, if not, creates one.
4. In a real program, the fourth step would take place after the data analysis of the test was completed. That section was skipped in the example code because it is not relevant to this part of the discussion. After the analysis is completed, this step writes the data frame to a new .csv file. The .csv file is located within the new folder, containing the nominal test conditions, and itself contains the nominal test conditions. This rigorous naming structure ensures that the correct data set can easily be located as needed later.
Using Plots to Check Results
One convenient way to check the quality of test data, and the performance of the associated data analysis, is via plots. The advantage of automated data analysis programs is that they can rapidly create plots for hundreds of tests, reducing the manual labor required to generate them individually. This allows the user to rapidly peruse the plots, and rapidly ensure data quality.
The process of automatically generating and savings plots for each test is now something that the reader can do. The general process is to use the techniques described in How to Identify Test Conditions and Split Data Files and An Introduction to Python Packages that are Useful for Automated Data Analysis to create a program that cycles through all the data sets, performs the needed calculations, generates the needed plots, and saves the results. These techniques have been covered already.
The key concept here is to plot enough features of the data set to be able to quickly, and visually check to ensure that the test proceeded correctly. Returning to the drain water heat recovery (DWHR) example (As described in the Introducing an Individual Data Set section of Designing Data Sets for Automated Laboratory Data Analysis), this means that saved plot must allow the user to quickly determine:
1. That the test drain-side and cold-side flow rates closely matched the nominal test conditions called for in the test plan,
2. That the test drain-side and cold-side inlet water temperatures matched the nominal test conditions called for in the test plan,
3. That all parameters were steady enough to ensure quality, steady-state operation,
4. That the filters used to identify the steady state period of the test selected the correct portion of the data set, and
5. That the final steady-state effectiveness value is steady and reasonable.
This goal can be accomplished with three plots.
Figure 1 presents an example plot showing the water flow rate across both sides of the DWHR device. Assume for the sake of this example that the nominal flow rate conditions for this test were 3.5 gal/min on both sides of the device. In Figure 13, both flow rates range between 3.4 and 3.6 gal/min. This is a small amount of variation, within the anticipated range of variation in a test, and thus closely matches the nominal test conditions. Figure 13 shows that the first condition is satisfied. Examples of dissatisfactory operation include an average flow rate of 3.2 gal/min on either side, or scatter ranging from 3.0 to 4.0 gal/min.
Figure 13 also shows that the flow rates meet condition three. While there is a small amount of scatter in the data, as anticipated, the long-term trend is very stable right around 3.5 gal/min. A test would be considered dissatisfactory if, for example, the flow rate had temporarily dropped to 3.0 gal/min before returning to the set flow rate.
Figure 2 serves the same purposes for temperature data. A typical DWHR test calls for 100.4 °F inlet temperature on the drain side, and 50 °F inlet temperature on the potable side. Like the flow rate data, the test is considered valid if the recorded inlet temperatures are close to those nominal conditions, and do not vary dramatically throughout the steady state portion of the test. Both of those conditions are met, so this plot indicates that the recorded temperatures meet conditions two and three.
Figure 3 presents a plot of the calculated effectiveness of the DWHR device. Per the techniques shown in Designing Data Sets to Allow Automation, this data set is filtered to only show data after the valves switch to the test flow rate. The data clearly bounces around an effectiveness rating of 0.34 with some variation. The variation is expected, because there’s some variation in both the temperature and flow rate data. The first few data points make it clear that the unit was in a transition period when the filter first applied, but there are only a few so the impact on calculated average effectiveness is minimal. Aside from those beginning points, the presented effectiveness is quite stable. This plot confirms that conditions four and five are met by this data set.
Having these three plots available allows the user to ensure that this test was performed correctly, and that the data is valid, with mere seconds of active involvement. Including a section in an automated data analysis program that creates and saves the required plots is a very effective and efficient way to ensure the quality of a data set.
Automated Data Checker
The most detailed and automated way to check the quality of the data in a test is to create an automated data checker. This is a script that reviews the measured data to determine what was recorded, compares it to the nominal test conditions, determines if they’re acceptable, and report unacceptable results to the user. This can be advantageous for the user, as it limits the number of plots that need to be manually reviewed. Projects with hundreds of tests can easily yield thousands of plots, and reducing the number that need to be reviewed saves significant amounts of time, tedium, and project budget.
The following code provides an example of how this process can be accomplished. Assume that the program already has a data frame named “Temp” to temporarily store information about questionable results, and a second data frame name “SuspiciousTests” with the full list.
if abs(Temperature_Drain - np.mean(Data['Hot Inlet Temperature (deg F)'])) > Threshold_Difference_Temperature:
Temp.loc[0, 'Filename'] = filename
Temp.loc[0, 'Test Parameters'] = 'D' + str(Flow_Drain) + '-P' + str(Flow_Potable) + '-T' +str(Temperature_Drain)
Temp.loc[0, 'Code'] = 'Temperature_HotInlet_Avg'
Temp.loc[0, 'Value'] = np.mean(Data['Hot Inlet Temperature (deg F)'])
SuspiciousTests = SuspiciousTests.append(Temp)
SuspiciousTests.to_csv(r’C:\Users\JSmith\DataAnalysis\SuspiciousTests.csv’, index = False)
That code works as follows.
1. First, it compares the nominal drain-side inlet temperature, Temperature_Drain, to the average measured drain-side inlet temperature. If the difference is greater than a pre-specified value, set with the variable Threshold_Difference_Temperature, it determines that there was an issue with the test.
2. If there was an issue with the test it records the parameters of the test. These parameters include the filename, nominal test conditions, a code representing the test condition that was not met, and the measured value of that condition. These entries are all added to the SuspiciousTests data frame.
3. After the program processes all the data, SuspiciousTests is printed to a .csv file creating a record of which tests appear to be dis-satisfactory.
The example showed a single check, comparing the average drain inlet temperature to the nominal condition for the test. A thorough program would include checks for the other test parameters, as well as checks of the standard deviation to ensure that all parameters were stable.
In general, it’s best to remember the following guidelines when creating an automated data checking algorithm:
1. It should check all nominal test conditions, to ensure that they are adequate in both average and standard deviation.
2. It should check the results of any filters, to ensure that they caught the correct range of data.
3. It should examine the final calculated output, to ensure that it is within the expected range and to ensure that the test provided a reliable signal.
4. It needs to be checked thoroughly, to ensure that it is identifying problematic tests but not adequate ones, before being trusted.
The previous posts have all focused on analyzing independent results from independent tests. So far we've discussed how to design tests at allow automation, how to identify the conditions of tests from data files, how to split files as needed, some useful Python packages that can be used to analyze the data, and how to automatically ensure that the tests and analysis are valid. The next post will focus on how to logically store those data files to facilitate later analysis of the data sets.
 Catering to dedicated Excel users may inspire nominations for Upper Class Twit of the Year, but it really does make things a lot easier.