Analyzing Data Sets With Multiple Test Types

The previous posts have all discussed methods for automating data analysis using Python when all tests are similar. This won’t always be the case. Sometimes tests will be used for different purposes; for example, some tests may collect data for regression development, while others search for behaviors or control logic in specific test cases. This creates an added level of complexity when writing scripts to analyze the data; the approach must be flexible enough to correctly analyze each of these different cases.

The solution to this issue lies in creating different functions for each different test type, as well as a central script to run them all. The role of the central script is to:

  1. Use a glob loop to iterate through all test files in the folder [1], and
  2. Identify the test type using the techniques in How to Identify the Conditions of Laboratory Tests and Split Large Data Files, and call a function written to analyze that type of test.

This means that it essentially opens all files in the folder, identifies the type of test contained in each file, and calls a script written to analyze that type of file. The scripts analyzing specific file types need to be written using the techniques described in the preceding posts. Their role is to:

  1. Accept inputs passed to it from the central script,
  2. Perform all data analysis calculations,
  3. Perform error checking to ensure that the tests were performed correctly,
  4. Print any plots that may be necessary, and
  5. Save the desired outputs (Data frames and variables not saved or returned are lost when the function complete).

 

Analysis Scripts

The role of the analysis scripts is essentially the same as what has previously been discussed. They read the data files, perform data analysis calculations, check the data to ensure there are no errors, and plot the results, and save the results as necessary. There are two distinct differences between how they’ve been used in the past. First, the programs must be defined as functions and accept inputs that get passed into them from the master program. Second, remember that the data must be saved by the analysis script before the program returns to the central script.

The first issue is overcome by remembering the rules of writing functions in Python. The rules for creating functions can be found in the Python Function Definition documentation. There are a few specific points to keep in mind:

  • A Python function is defined with the form “def FunctionName(NecessaryVariables):”
    • “def” informs Python that you are defining a new function,
    • “FunctionName” is the name that you give to your function and later use to call that function,
    • “(NecessaryVariables)” represents a list of variables, separated by commas and written within parentheses [2], that the program needs to perform all necessary calculations,
    • “:” is a necessary part of closing the line.
  • Since the individual script will be a called function, it won’t have it’s own glob loop to identify the file and test conditions. Remember to include these in the list of variables passed to it,
  • As is standard in Python syntax, all code defining the function must be indented, and
  • As tempting as it may be to assume that all necessary packages are imported in the central script, ensure that each individual script imports the packages it needs to avoid confusion later.

This leads to a script that follows the following basic pattern:

  1.  def FunctionName(FileName, DataFrame, OtherVariables):”
  2.  import pandas, numpy, bokeh, matplotlib and/or other packages as needed for your project,
  3.  Read the file name to make sense of the data using techniques from How to Identify the Conditions of Laboratory Tests and Split Large Data Files,
  4.  Perform all necessary calculations,
  5.  Perform error checking as needed using the techniques from Checking the Quality of Testing and Analysis Results,
  6.  Plot the data using bokeh or matplotlib as briefly described in An Introduction to Python Packages that are Useful for Automating Data Analysis,
  7.  Save the resulting data frame to a .csv file using the naming conventions and logic in Storing Intermediate Data Files for Later Analysis.
  8.  Return to the central program, so it can start the process for the next file.

An analysis script following this generic approach can perform data analysis on a specific type of test when called by a central program. Creating one for each type of test in a project provides the basis that the central program needs to co-ordinate analysis of all test types.

Central Program

The central program then performs the role of loading all tests, identifying the type of test represented by each file, calling the appropriate analysis program, and passing them the appropriate variables.

The first step in the central program is importing the relevant scripts. This is done using the same general approach as importing any other function. In order to easily import a function from a user-made script, it’s important to save the analysis scripts in the same folder as the central program. This then allows the analysis scripts to be imported using a simple import statement. Two examples are shown below:

  •  from DWHR_Behavior import DWHR_Behavior_RepeatedShortDraws
  •  from DWHR_Effectiveness import DWHR_Effectiveness_SteadyState

Those two lines of code import functions that can be used to analyze the behavior and effectiveness of drain water heat recovery (DWHR) [3] devices under different circumstances. These import statements will correctly import the desired functions if a) The folder contains files called “DWHR_Behavior” and “DWHR_Effectiveness” and b) Those two files contain functions called “DWHR_Behavior_Repeated ShortDraws” and “DWHR_Effectiveness_SteadyState.”

The second step is creating a list of file names in the folder, and creating a for loop iterating through them. This is accomplished using the glob package, and the techniques described in An Introduction to Python Packages that are Useful for Automating Data Analysis. This will allow the central program to iterate through each of the files in the folder, and later call each of the analysis scripts as required.

The third step is identifying the type of test that the active data file represents. This can be done using a modification of the techniques described in How to Identify the Conditions of Laboratory Tests and Split Large Data Files. That post described identifying the test parameters, but the same approach can be used to read a column in the test matrix that specifies the type of test. The following code shows an example of how to identify the test type from a test file and save it to a variable called "TestType" assuming that we have already identified the test number (See the section titled 'Referencing the Test Number in the File Name' for recommendations).

Test_Type = Test_Matrix.loc[Test_Matrix['TestNumber'] == TestNumber, 'Test Type'].item()

That code uses the pandas .loc command to identify the appropriate value in the Test_Matrix data frame. It can be understood with the following explanations:

  • The .loc command uses the following syntax: DataFrame.loc[Row, Column] to identify the value stored in the specific column,
  • Test_Matrix['TestNumber'] == TestNumber uses a boolean check to identify the row where the column 'TestNumber' is equal to the variable 'TestNumber', and
  • The .loc command naturally returns a series with information such as the row of the identified cell, the variable type, the name of the column, and the value of the cell itself. Adding '.item()' at the end tells pandas to return only the value of the cell.

The fourth step is identifying the inputs needed by the appropriate analysis script. This will be highly customized, because it depends on the needs of any individual project and script. It will include things like the equipment under test, the parameters of the test, any control logic that may be relevant, and so on.

The final step in the central program is to call the function matching the identified test type. This can be done by using a series of it statements checking the file type, and containing the appropriate function call within each of them. Remember to include the necessary variables when calling the function. For example, the following code shows how to call the correct script assuming that we have a variable called 'TestType' and the two functions imported above.

if 'TestType' == 'Behavior':

    Behavior(NecessaryVariables)

elif 'TestType' == 'Effectiveness':

    Effectiveness(NecessaryVariables)

And that’s it. Once the central program calls the appropriate function, the analysis scripts take care of the rest.

Next Steps

As of you, you should know the fundamentals of automating laboratory data analysis for projects with both single and multiple test types. The techniques previously discussed here will allow you to plan out your projects, and write programs as needed to make the process significantly easier and faster. The next topic is that of creating different scripts for different parts of the project. This will include separate scripts for splitting the files, analyzing the data, validating the regressions, and so on. This topic will be discussed in the next post.

 

 

[1] For an introduction to using glob, visit An Introduction to Python Packages that are Useful for Automating Data Analysis.

[2] E.g. (FileName, DataFrame, Unit)

[3] For more information on how drain water heat recovery devices work, see Introducing an Individual Data Set in Designing Data Sets for Automated Data Analysis.