Storing Intermediate Results for Later Analysis

So far, all of the discussion has been in analyzing results from individual tests. The next step is to begin to think about the bigger picture, and create ways to combine those individual test results into data sets describing the results from the entire project. The first step is storing the individual test results in a logical manner that facilitates later analysis.

There are two general tricks to storing intermediate results for later analysis in an automated process. The first is planning the organizational structure, to ensure that all files can be easily located when needed. The second is using dynamic file names in the code, so that the results are saved to new files with each iteration through the program.

Creating the Folder Hierarchy

Planning the organizational structure essentially means creating a folder hierarchy that makes sense for a given project. For example, say that a project includes performing several experiments on multiple pieces of equipment. The goal is to create regressions emulating the performance of each piece of equipment. In this case, there’s value in creating a folder for each piece of equipment, then storing results from individual tests within the corresponding folders. Figure 1 shows an example of how this folder hierarchy could be structured.

 

Figure 1: Example Folder Hierarchy

Storing Files Using Dynamic Names

The second point to keep in mind is that all references to stored data should use dynamic names that use variables, taken from the data set using the techniques described in How to Identify the Conditions of Laboratory Tests and Split Data Files, to create filenames specific to that data set. For example, a data set may contain data specific to Equipment 2 Test 3. In that case, any code saving data for that data set must use variables to specify that it needs to use the “Equipment 2 Test 3” subfolder of the “Equipment 2” folder.

When creating the folder structure, it is necessary to ensure that all folders exist. There are two approaches to doing this. The first, is to manually create folders for the project, laying everything out ahead of time. That may be a good approach if it helps think through the process, and create a strong structure, but this is a blog on automating everything! It’s easier to let Python do the work. The structure can be created automatically by including the appropriate code in the analysis loop. It is done using the following steps.

1) Import the os package, enabling access to commands controlling the computers operating system. This can be done with the Python code “import os”.

2) Within the analysis loop, use the techniques in How to Identify the Conditions of Laboratory Tests and Split Data Files to determine which test is being performed. Using the hierarchy table in Figure 1 as an example, this might result in a variable “Equipment” set to “Equipment 2” and a variable “Test” set to “Test 3”. Ensure that both values are stored in their variables as strings.

3) Specify the folder for the existing data set, using variables and input from the data set. In our current example, this could be done with the following code:        

  Folder = r’C:/Users/JSmith/DataAnalysis/’ + Equipment + ‘/’ + Test

4) Determine whether or not the folder exists using the os.path.exists command, and create the folder if needed using the following code:

if not os.path.exists(Folder) :

    os.makedirs(Folder)

Those steps create code that will automatically generate all folders needed for the structure. The same techniques can be used to create further levels of subfolders as needed for any given project.

Then, the results of each test need to be stored in the appropriate folders. The code for saving the results varies between packages. Results can be saved with pandas, bokeh, and matplotlib using the following code examples.

pandas

Data frames have the conveniently labeled .to_csv function. Readers should consult the pandas documentation for specific details of how this works, but the general approach is to call the function and specify the file path. For the current example and a data frame called ‘Data’, this can be done with the following code:

 

Data.to_csv(r’C:/Users/JSmith/DataAnalysis/’ + Equipment + ‘/’ + Test + ‘/’ + Equipment + '_' + Test + ‘.csv’)

 

The final portion, ‘/’ + Equipment + '_' + Test + ‘.csv’, was added to the previous code to provide a name to the .csv file placed in the folder. A shorter way to accomplish the same objective, assuming that the previous code was used to define the variable Folder, is: Data.to_csv(Folder + ‘/’ + Equipment + '_' + Test + ‘.csv’).

bokeh

bokeh uses a somewhat more complicated approach to saving files. This provides the ability to store multiple plots within a single file. It is performed using the following steps:

1) Create a gridplot. The gridplot function allows specification of how multiple plots should be contained within a single file. One array is used to specify the main gridplot, while smaller arrays can be used to specify multiple plots within any given row. For example, a gridplot with two plots on the first row and three on the second would be programmed like:

      p = gridplot([[p1, p2], [p3, p4, p5]])

2) Specify the desired file location, and the desired title. Continuing the example of Equipment 2 – Test 3, this could be done with the following code:

      output_file(Folder + ‘/’ + Test + ‘.html’, title = Test + ‘.html’)

· Save the plot. This is done with the intuitive save() command. The syntax to save the plot in this example is:

      save(p)

matplotlib

matplotlib uses a very simple file saving convention. The command is plt.savefig(). The syntax for this example is:

      plt.savefig(Folder + ‘/’ + Test)

Next Steps

Now that our results from each test are calculated and stored in a logical manner, the next step is to create models of the data set. The next blog post will present some high level concepts to keep in mind when developing and validating regression models for experimental data sets.