Comparing series data

Problem: You collect a time series for a compartment in a CompartmentedModel using Monitor, or some other series data. You then want to compare these series between different experimental runs, or find the mean of several series. How do you wrangle the data into a form that can be worked with?

Solution: This is really a pandas question, but one that’s extremely common in network science and so very relevant to epydemic.

The issue is that time series land in a result set as arrays. This makes them hard to manipulate. What we would probably prefer is a DataFrame whose rows are the series and whose columns are the observation times, degrees, or whatever: extracting a sub-set of rows from the original result set DataFrame and then re-formatting them into a more suitable shape.

Fortunately pandas provides the functions we need. It can create a DataFrame from a list of lists, and we can then rename the columns to be meaningful.

As an example, suppose we’ve used Monitor to capture the progression of an SIR epidemic on an ER network:

from epydemic import SIR, Monitor, ProcessSequence, ERNetwork, StochasticDynamics
from epyc import Lab

lab = Lab()

N = int(1e5)
kmean = 50

lab[ERNetwork.N] = N
lab[ERNetwork.KMEAN] = kmean
lab[SIR.P_INFECTED] = 0.001
lab[SIR.P_INFECT] = 0.0001
lab[SIR.P_REMOVE] = 0.001
lab[Monitor.DELTA] = 1
lab['repetitions'] = range(10)

e = StochasticDynamics(ProcessSequence([SIR(), Monitor()]), ERNetwork())
lab.runExperiment(e)

This will result in a result set with result columns that capture the size of each locus in the SIR model – the SI edges and I nodes – over time. We can retrieve the column name of the time series for the I locus by:

c = Monitor.timeSeriesForLocus(SIR.INFECTED)

If we extract the result set for this experiment it will have 10 rows (one per repetition), and we can project-out just the time series using:

df = lab.dataframe()
tss = df[Monitor.timeSeriesForLocus(SIR.INFECTED)]

Then each row will have a single column, which will itself be an array – not quite what we wanted, so now we need to “explode” this list-like column into a row in a new DataFrame with a more convenient shape:

from pandas import DataFrame

infecteds = DataFrame(tss.values.tolist()).rename(columns=lambda i: i)

This converts the data to “wide” format, creating a DataFrame with the same rows but with a column per sample rather than one column with an array of samples.

Note

pandas has a method called DataFrame.explode that does something slightly different to what we do here, and so isn’t suitable (unfortunately).

The columns are labelled by the index of the sample point (we need to do this as otherwise the column labels get confused): since Monitor samples each experiment at the time time points, we could instead use the experimental times as column labels if we wanted. by changing the renaming. But there’s a problem: although all the samples happen at the same times, some experimental runs are longer than others. We therefore need to extract the sample points of the longest experimental run, and use these as column labels. Shorter runs will then get NaN (“not a number”) values for times that they didn’t sample, and this won’t affect any statistics we compute later. The incantation to do this is:

ts = df.loc[(df[Monitor.OBSERVATIONS].apply(len) == df[Monitor.OBSERVATIONS].apply(len).max())].iloc[0][Monitor.OBSERVATIONS]

(The outer df.loc[] extracts all the rows for which the length of the Monitor.OBSERVATIONS column matches the maximum length, and then uses .iloc[0] to extract the first such row – there might be several, but by definition they’ll all be the same – and returns its values.) We can use these for renaming if we want to:

infecteds = DataFrame(tss.values.tolist(), columns=ts)

We’ll also use this time series for the x-axis ticks on the plot.

Putting it all together we can, for example, generate a plot of the means and error bars of the size of a compartment over time, taken across a range of experiments. We’ll create this plot across a short range and space-out the samples to make a clearer plot, and use seaborn to tweak the presentation style.

import matplotlib
import matplotlib.pyplot as plt
import seaborn
matplotlib.style.use('seaborn')

fig = plt.figure(figsize=(7, 5))

maxt = 5000
step = 200

df = lab.dataframe()
ts = df.loc[(df[Monitor.OBSERVATIONS].apply(len) == df[Monitor.OBSERVATIONS].apply(len).max())].iloc[0][Monitor.OBSERVATIONS]
infecteds = DataFrame(df[Monitor.timeSeriesForLocus(SIR.INFECTED)].values.tolist()).rename(columns=lambda i: ts[i])

# compute the mean and standard error of the time series using pandas
infectedMeans = list(infecteds.mean())
infectedErrors = list(infecteds.std())

# plot the mean with error bars
plt.errorbar(ts[:maxt:step], [I / N for I in infectedMeans[:maxt:step]],
             yerr=[e / N for e in infectedErrors[:maxt:step]],
             color='blue', marker='.',
             ecolor='red', capsize=2)

plt.ylabel('$I(t)$')
plt.xlabel('$t$')
plt.title(f'Mean infected fraction of population over time (with error bars)\n$(N = {N}, \\langle k \\rangle = {kmean})$')

_ = plt.show()

This is just one choice of plotting style, of course: some people prefer to plot the raw data rather than just summary statistics. We can do this too, using the raw exploded data as well as the summaries:

fig = plt.figure(figsize=(7, 5))

# plot the raw data time series
for i in range(len(infecteds)):
    cs = list(infecteds.iloc[i])
    plt.scatter(ts[:maxt:step], [I / N for I in cs[:maxt:step]],
                color='lightblue', marker='.',                # raw data in a light shade
                label='raw' if i == 0 else None)              # only one label for all the raw plots

# plot the mean on top
plt.plot(ts[:maxt:step], [I / N for I in infectedMeans[:maxt:step]],
         color='blue', marker='.',                            # mean in a darker shade
         label='mean')

plt.ylabel('$I(t)$')
plt.xlabel('$t$')
plt.title(f'Mean infected fraction of population over time (with raw data)\n$(N = {N}, \\langle k \\rangle = {kmean})$')

_ = plt.show()

This has the advantage of showing any outliers in the raw data that the summary statistics will mask. This can be important when interpreting the data.