An Introduction to Python Packages that are Useful for Automating Data Analysis

Now that the data sets are designed to allow automation and the program can identify the test conditions, the next step is automating analysis of each individual data file. The advantages of doing so are clear: Higher efficiency allows analysis of more data, leading to more thorough research, with lower budgets while avoiding enough boring repetition to make the Spanish Inquisition proud[1].

Automating analysis of each individual test relies on the capabilities of several available packages. These packages include glob, pandas, bokeh, and matplotlib. glob is a simple yet useful package that allows a user to perform actions on all files in a folder that match a certain type. pandas is a highly versatile package that performs data analysis on tabular data sets, referred to as data frames. bokeh and matplotlib are two different plotting tools, with different strengths and weaknesses. This chapter will provide a high level overview of these packages as well as guidance on using these packages to analyze individual data files.

glob

glob is a simple package that creates lists of files which match a certain condition. This is extremely useful for automating data analysis as a for loop can be used to automatically iterate through that list of filenames. This creates a situation where code is used to analyze every file in the list, and is the core of automated data analysis. The logic proceeds as follows:

1. Place all test files in the same folder.

2. Use glob to create a list of the filenames in that folder.

3. Use a for loop to iterate through each filename in that list.

4. Use other packages (As described in future sections) to perform the data analysis on each filename, as the for loop iterates through the glob list.

The glob package was mainly created to present the single function called glob. There are now other forms, such as iglob, but glob itself serves this purpose very well. To use glob, the program must tell it where to search for files, and how to identify those files. The search location can be either a single folder, or a root folder containing subfolders. The files can be identified with the notation ‘*.[FileType]’ where [FileType] is replaced with a filename extension, such as .csv, and * tells glob to identify all files of that type. An example of globs application is as follows:

 

Folder = r’C:\Users\JSmith\DataAnalysis\Data’

FileNames = glob.glob(Folder + ‘/*.csv’)

 

The definition of Folder tells glob what folder to search in. Note that another \ would be needed before writing the filename, but that yields a syntax error if added to the code defining Folder. Therefore a / is added to the declaration of FileNames instead. The addition of ‘*.csv’ tells glob to add all .csv files in ‘C:\Users\JSmith\DataAnalysis\Data.’ Note that replacing ‘Data’ with ‘*’ would have told glob to search all folders in ‘C:\Users\JSmith\DataAnalysis’ for .csv files. That capability is useful if the desired files are stored in multiple subfolders, instead of a single location.

After glob creates a list of filenames, the program needs a for loop to cycle through each of the files. This can be accomplished with the following line of code:

 

for FileName in FileNames:

 

Now the program has the structure necessary to iterate through all the test files. The next step is adding some data analysis methods to the for loop using pandas.

pandas

Pandas is an extraordinarily thorough package, designed specifically to be used for scientific data analysis. As such a detailed package it cannot be explained in detail here, and this section will focus on some of the needed basic functionality. Readers interested in learning about more pandas functions should consult their detailed documentation[2]. We will provide a more detailed discussion of useful pandas features in later posts.

.read_csv

The very first pandas command to use when writing an automated data analysis program is read_csv(). This command does exactly what the name implies; it reads a .csv file. Several details on how pandas should read the .csv file can be expressed within the parentheses. One example that has been shown previously is the header command, which instructs pandas to read the header from a specified row instead of from the top row. Those wanting to set a specific column for the index, perhaps a column such as “Time since test start (s)”, can use the index_col command. Those who prefer working with data sets using date and time for the index will want to use the parse_dates command to ensure that the dates work correctly when referenced later. But, ultimately, the most important part of using read_csv() is ensuring that the data is read into the data frame for use later.

Mathematical Operations

Once the data frame is open most mathematical operations can be performed pleasingly easily. Standard mathematical notations work as anticipated. Calculations can be performed on entire columns of the data frame by writing the name of the column as if it’s a variable. Constants can be mixed in with the variables, as expected. An example showing this, in the context of calculating heat added to water, is presented below.

 

Data[‘Heat Transfer (Btu/hr)’] = Data[‘Flow Rate (gal/min)’] * Density_Water * SpecificHeat_Water * (Data[‘Outlet Temperature (deg F)’] – Data[‘Inlet Temperature (deg F)’]) * 60

 

The preceding example code can be understood with the following statements:

· “Data” here is the name of the data frame being referenced,

· The notation “DataFrame[‘Text’]” references a column, specified by [‘Text’], in the chosen DataFrame, As an example, Data['Flow Rate (gal/min)'] refers to the data column representing "Flow Rate (gal/min)" in the Data data frame.

· The first term, Data[‘Heat Transfer (Btu/hr)’], creates a new column in the data frame “Data” labeled “Heat Transfer (Btu/hr)’]”,

· The rest of the references to Data are reading columns of measurements,

· Density_Water and SpecificHeat_Water are constant properties of water that were specified earlier in the program, and

· The final ‘* 60’ converts the result from Btu/min to Btu/hr.

.loc

Another very useful pandas feature is the .loc function. This function locates a specific cell in the data frame. It can either read a value from that cell, or write a value to it. It uses the syntax DataFrame.loc[RowNumber, ColumnName]. The row number can be specified explicitly, with an iterating parameter in a for loop, or using a condition. Some examples are shown below.

 

StartTime = Data.loc[0, ‘Test Time (s)’]

Data.loc[0, ‘Test Time (s)’] = 0

 

The first example reads the value from the first row of the data frame, in the column ‘Test Time (s)’ and stores it in the variable StartTime. The second example sets the value for the first row in the column ‘Test Time (s)’ to 0.

.to_csv

The final function which will be described here is the to_csv function. This function prints the data frame to a .csv file, where it can be stored for later access. The required syntax is DataFrame.to_csv(FileName), where FileName is replaced with the complete path stating where the file should be saved. to_csv includes several options which can also be specified, but the most important one is the index declaration. Unless the index is heavily used in data analysis, it’s best to leave the index out of the saved data file. In this case, add the option ‘index = False’ within the parentheses.

Plotting

There are two main plotting packages that are commonly used in scientific data analysis with Python. These are bokeh, and matplotlib. Both packages have their strengths and weaknesses. Since both packages have extensive documentation libraries of their own[3][4], this text will focus on introducing the user to a few key concepts rather than a detailed tutorial.

bokeh

bokeh provides a concise, intuitive interface for creating plots. Basic plots can be created with a few specific lines of code. The following example provides guidance on how to create a basic plot, comparing a measured water flow rate to the flow rate specified in the test protocol.

 

p1 = figure(width=1600, height=400, x_axis_label=Time (hr)', y_axis_label=Flow Rate (gal/min)', title = 'Water Flow Rate (gal/min)')

p1.circle(Data['Test Time (hr)'], Data[‘ Flow Rate (gal/min)’], legend='Measured', color = 'red') 

p1.line(Data[Time of Day (hr)'], Data[Water Flow Rate (gal/min)'], legend = ‘Specified’, color = 'blue')

 

That code can be understood with the following statements:

· The first line creates a bokeh figure called p1. The terms within the parentheses are used to specify some details of the basic structure of the plot.

· The ‘width’ and ‘height’ commands set the dimensions of the plot window.

· ‘x_axis_label’ is used to specify the title of the x axis. In this case, the x axis represents the time since the experiment started expressed in hours.

· ‘y_axis_label’ does the same for the y axis. In this case, the y axis will show the water flow rate in gallons per minute.

· The following line adds a data series showing the measured data as red circles. Circles are typically used to represent measured data, as they don’t imply anything about the values between points as a line would. This data set shows up in the legend as “Measured.”

· The third line creates a new data series representing the specified water flow rate as a blue line. Lines are typically used to represent continuous values, such as experimental setpoints or simulation results. This data set is referred to as “Specified” in the legend.

The interactivity of plots is the primary advantage of bokeh. The package includes several tools that can be added to plots, which allow the user to explore the plots in more depth. The following points address some of the potentially powerful bokeh tools. Detailed documentation on bokeh tools is provided on their website[5].

· HoverTool: This tool allows the user to read the values of data points by simply hovering the cursor over the data points in question. As the user moves the cursor across a plot, HoverTool creates a pop-window showing the values of all data points. Generally, it tracks the cursor across the x-axis, and lists the y-coordinate for each point. While HoverTool is enabled by default, some coding is required to make it display correctly.

· BoxZoomTool: The box zoom tool allows the user to do exactly what it says, to click and drag to specify a box to zoom in on. This handy feature allows closer inspection of key points in the data set, without creating entirely new plots.

· PanTool: The pan tool again does exactly what it says, it allows users to click and drag the mouse to pan the plot. This is especially useful when combined with the BoxZoomTool, as the user can then zoom in on a small subset of the data (Rather than seeing the entire plot) and pan to the sides to see surrounding data points.

matplotlib

matplotlib provides another concise, intuitive way to create plots. The following lines of code provide an example of how to create the same plot as in the bokeh example.

 

fig = plt.figure(figsize = (10,5))

plt.plot(Data['Test Time (hr)'], Data[‘Flow Rate (gal/min)’], marker = ‘o’, color = ‘red’, linestyle = ‘None’)

plt.plot(Data[Time of Day Time (hr)'], Data[‘Water Flow Rate (gal/min)’], marker = ‘none’, color = ‘blue’, linestyle = ‘solid)

plt.xlabel('Time (hr)')

plt.ylabel('Water Flow Rate (gal/min)')

plt.legend([‘Measured’, ‘Specified’)

 

The following statements explain how the matplotlib code functions:

· The first line creates a plot using the matplotlib.pyplot.plt function. It names the plot as ‘fig’ and uses the figsize property to set the figure size to (10,5)[6].

· The second line adds a data set representing the measured data to the plot. The first entry specifies that Data[‘Test Time (hr)’] be used for the x data, and the second entry specifies Data[‘Flow Rate (gal/min)’] for the y data. The remaining three modifiers state that the data should be represented as red ‘o’s with no line.

· The third line adds a data set representing the water flow profile specified in the test. It instructs matplotlib to use Data[‘Time of Day (hr)’] as the x-axis data, Data[‘Water Flow Rate (gal/min)’] as the y-axis data. The final modifiers state that the data should be represented using blue lines with no markers.

· The fourth and fifth lines state that the x and y axis labels are “Time (hr) and “Water Flow Rate (gal/min)” respectively.

·  The final line creates a legend stating that the first data set is called “Measured” and the second is called “Specified.”

This basic functionality creates a standard, easy to read 2-dimensional plot. However, the biggest advantage of matplotlib is in the ability to create 3-dimensional plots. These are extremely useful when creating 2-dimensional regressions, such as the Unequal Flow case of drain water heat recovery[7], as it allows visualization of the more complicated models. To create 3-dimensional plots, the correct matplotlib function needs to be imported. Specifically, Axes3D must be imported from mpt_toolkits.mplot3d. After that, a 3-dimensional plot can be created with the following code.

 

fig = plt.figure(figsize = (10,5))

nDPlot = fig.add_subplot(111, projection=’3d’)

nDPlot.scatter(Data[‘Drain Flow Rate (gal/min)’], Data[‘Cold Flow Rate (gal/min)’], Data[‘Effectiveness (-)’])

nDPlot.set_xlabel(‘Drain Flow Rate (gal/min)’)

nDPlot.set_ylabel(‘Cold Flow Rate (gal/min)’)

ndPlot.set_zlabel(‘Effectiveness (-)’

 

The following statements describe how this code functions:

· The first line is a repetition of the code from the 2-dimensional plot, creating the figure and setting the figure size.

· The second line creates a subplot in the figure, and states that it will be 3-dimensional. Since no other subplots are added, this plot fills the entire space specified by the previous line.

· The third line adds data to the figure. It uses Drain Flow Rate as the x data, Cold Flow Rate as the y data, and Effectiveness as the z data.

· The remaining three lines specify the labels for the x, y, and z axes to match the data sets represented.

Another valuable trick of 3-dimensional plotting in matplotlib is that they can be made somewhat interactive. The previously described code will create a plot which, if saved, creates an image from a default angle. That default angle is often not the best angle, leaving users wanting the ability to choose differently. This can be done by typing “%matplotlib qt” into the IPython console of the development environment before running the code. Then, when the program creates the plots, it will open new windows with each of the 3-dimensional plots. The user can then click and drag on the plots to change the view angle, and save the desired version.

Next Steps

This post has provided a high-level introduction to some packages that are useful for automated data analysis, and how they can be combined to analyze individual data files. It hasn't addressed the main weakness of automated data analysis: the potential for testing and analysis errors to go unnoticed. The next post will address this topic, and provide guidance on several different ways to check the quality of the testing an analysis, to ensure that all results are valid.

 

 

 

[1] Nobody expects the Spanish Inquisition! (Consider this a not-so-subtle reminder that the computer language Python is, in fact, a reference to Monty Python)

[2] http://pandas.pydata.org/pandas-docs/stable/

[3] https://bokeh.pydata.org/en/latest/

[4] https://matplotlib.org/

[5] https://bokeh.pydata.org/en/latest/docs/reference/models/tools.html

[6] Note that bokeh and matplotlib use different conventions for sizing plots. These two examples may yield different plots of different sizes.

[7] Drain water heat recovery was introduced in the "Introducing an Individual Data Set" section of this post: https://www.1000x-faster.com/blog/2018/5/25/automated-laboratory-data-analysis-designing-the-data-set