In this post I will attempt to explain how I used `Pandas`

and `Matplotlib`

to quickly generate server requests reports on a daily basis.

**Problem To Be Solved:** Generate a Scatter plot of the number of requests to a particular URL along with the 99th, 95th and 90th percentile of requests for the duration of a day.

**Breakdown:**

**Reading multiple server access log files.****Parsing the timestamp fields so that the graph can be scaled appropriately.****Aggregating the timestamp fields.****Calculating the Quantile values.****Generating the Scatter Plot.**

In this post I will leave out the details related to point number 1 and will concentrate on the remaining points. Since this was a fairly mundane task and had to be done daily for a couple of weeks, I intended on automating the process as well as generating the scatter plot as quickly as possible.

### Parsing the timestamp fields so that the graph can be scaled appropriately

Let’s assume the input file contains only timestamps and the file has been read into a list using the following code.

```
lines = [line.rstrip('\n') for line in open(file_name)]
```

Since the x-axis of the scatter plot will contain timestamp values, the `lines`

`List`

which currently contains string values, needs to be converted into a `List`

of timestamp values. My initial instinct was to use an explicit `for`

loop. In an effort to speed up my script I came across python’s built-in `map()`

function and the concept of `List`

comprehension which reduce the `for`

loop overhead when the loop body is relatively simple, check here for more details. After a bit of benchmarking I settled on using the `List`

comprehension method.

1
2
3

from datetime import timedelta, datetime
dt_lst = [datetime.strptime(date_str, '%H:%M:%S') for date_str in lines]

### Aggregating the datetime fields

In this aggregation step the goal was to perform a group-by on the timestamps in order to calculate the number of requests per second. To achieve this I used `Numpy`

which is a package for scientific computing and `Pandas`

which is a data analysis library. The `Series`

and `DataFrame`

objects from the `Pandas`

library and the `ones()`

method from the `Numpy`

package are used to generate the data structure show in *Figure 1*, the code snippet below contains the details.

*Note: I could have probably achieved the same result using only Pandas.*

1
2
3
4
5
6
7
8
9
10
11
12
13

import pandas as pd
import numpy as np
from pandas import DataFrame, Series
# Create a Series named "Request_Time"
sr_dt = Series(dt_lst, name='Request_Time')
# Create a DataFrame using the Request_Time Series
df = DataFrame(sr_dt)
# Create an array of 1's using Numpy
count = sr_dt.size
ones = np.ones(count, dtype=int)
# Add the ones array to the DataFrame with the header "Counts"
df['Counts'] = ones

The `DataFrame`

object is a tabular data structure with labeled rows and columns, you may think of it as an Excel Spreadsheet. After creating the `DataFrame`

`df`

you will end up with a structure as displayed below.

*Figure 1: Pandas DataFrame*

The next step is to group-by the *Request_Time* column and sum up the counts, this is achieved by calling the `groupby`

method on the `DataFrame`

object.

1

grouped = df.groupby('Request_Time').count()

This gives us an aggregated `DataFrame`

as displayed below

### Calculating the Quantile values

Calculating the quantile values on the aggregated `DataFrame`

can be done by calling the `quantile()`

method which returns a `DataFrame`

containing the value and the `dtype`

.

1
2
3

ninety_five_quant = grouped.quantile(.95)[0] # [0] since we only need the quantile value
ninety_ninth_quant = grouped.quantile(.99)[0] # [0] since we only need the quantile value
ninety_eight_quant = grouped.quantile(.98)[0] # [0] since we only need the quantile value

### Generating the Scatter Plot

The Scatter plot is generated using the `Matplotlib`

library. Since we are plotting timestamp values on the x-axis we will use the `plot_date()`

method.

1

ax.plot_date(x, y, xdate=True, ydate=False, color='skyblue')

The x-axis contains the timestamp values, the y-axis contains the request counts, `xdate=True`

indicates the x-axis contains date values.

*Note: The Matplotlib library contains extensive documentation on all it’s API’s as well as a vast array of examples, hence I will refrain from going into more details in this post, you may check this link for more details.*

*The snippet below contains code related to rendering the Scatter Plot. I have taken a very simplistic and naive approach since it was good enough for my requirement.*

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

import matplotlib.pyplot as plt
total_req = 'Total request: ' + str(count)
nine_five_quant_str = '95th Quantile: ' + str(ninety_five_quant)
nine_eight_quant_str = '98th Quantile: ' + str(ninety_eight_quant)
nine_nine_quant_str = '99th Quantile: ' + str(ninety_ninth_quant)
# x and y axis values are extracted from the grouped DataFrame
x = grouped.index
y = grouped.values
print 'Plotting Graph..'
fig = plt.figure()
fig.suptitle('Scatter Plot', fontsize=14, fontweight='bold')
ax = fig.add_subplot(111)
fig.subplots_adjust(top=0.85)
ax.set_xlabel('Request Time')
ax.set_ylabel('Request Count')
text_x_axis_value = 0.9
ax.text(text_x_axis_value, 0.90, total_req, horizontalalignment='center', verticalalignment='center', transform = ax.transAxes)
ax.text(text_x_axis_value, 0.88, nine_five_quant_str, horizontalalignment='center', verticalalignment='center', transform = ax.transAxes)
ax.text(text_x_axis_value, 0.86, nine_eight_quant_str, horizontalalignment='center', verticalalignment='center', transform = ax.transAxes)
ax.text(text_x_axis_value, 0.84, nine_nine_quant_str, horizontalalignment='center', verticalalignment='center', transform = ax.transAxes)
ax.plot_date(x, y, xdate=True, ydate=False, color='skyblue')
plt.show()

### Summary

In this post we have seen how:

`List`

comprehension can give us better performance over an explicit`for`

loop when the loop body is relatively simple. Please check here for more details on the topic.- The
`Pandas`

library can be used to calculate aggregates and quantiles. We used the`Series`

and`DataFrame`

objects which are the core data structures of the`Pandas`

library to achieve this. - Generate a Time-Series Scatter Plot using the
`Matplotlib`

library.