10 Estimating Models

10.3 Evaluating Goodness-of-fit

No single method provides a complete assessment of goodness-of-fit; rather, model fitting should always be seen as a loop, in which we start by fitting a model, then examine the results, attempt to diagnose any model failures, and then improve our models accordingly. In the following, we describe different methods for assessing the goodness-of-fit.

Before beginning with specific examples of these methods, we note that it is very important to evaluate the goodness of fit on data that was not used for fitting. The part of the data used for fitting model parameters is called the training set and the part of the data reserved to evaluate the goodness of fit is called the test set. Data in the test set is said to be predicted by the model, while it is simply reproduced in the training set. By simply adding parameters to the model, the quality of the fit on the training set increases. Given a sufficient number of parameters, the model might be able to reproduce the training set perfectly, but that does not mean that data in the test set is well predicted. In fact it is usually the opposite: overly complicated models that are ‘overfit’ on the training data (i.e., which fit not only the reproducible signal in the training set but also the noise) will often do a bad job generalizing and predicting new data in the test set. Thus in the following we assume that the goodness of fit quantities are computed using ‘cross-validation’: parameters are estimated using the training set, and then the goodness of fit quantification is performed on the test set.

10.3.1 Comparing Spiking Membrane Potential Recordings

Given a spiking membrane potential recording, we can use traditional measures such as the squared error between model and recorded voltage to evaluate the goodness of fit. This approach, however, implicitly assumes that the remaining error has a Gaussian distribution (recall the close relationship between Gaussian noise and the squared error, discussed above). Under diffusive noise, we have seen (Chapter 8) that membrane potential distributions are Gaussian only when all trajectories started at the same point and none have reached threshold. Also, a small jitter in the firing time of the action potential implies a large error in the membrane potential, much larger than the typical subthreshold membrane potential variations. For these two reasons, the goodness-of-fit in terms of subthreshold membrane potential away from spikes is considered separately from the goodness-of-fit in terms of the spike times only.

In order to evaluate how the model predicts subthreshold membrane potential we must compare the average error with the intrinsic variability. To estimate the first of these two quantities, we compute the squared error between the recorded membrane potential utexpu^{\rm exp}_{t}and model membrane potential utmodu^{\rm mod}_{t} with forced spikes at the times of the observed ones. Since spike times in the model are made synchronous with the experimental recordings, all voltage traces start at the same point. A Gaussian assumption thus justified, we can average the squared error over all recorded times tt that are not too close to an action potential:

RMSEnm=1TΩ1Nrepi=1NrepΩ1(uiexp(t)-uimod(t))2dt{\rm RMSE}_{\rm nm}=\sqrt{\frac{1}{T_{\Omega_{1}}N_{\rm rep}}\sum_{i=1}^{N_{% \rm rep}}\int_{\Omega_{1}}\left(u^{\rm exp}_{i}(t)-u^{\rm mod}_{i}(t)\right)^{% 2}{\text{d}}t} (10.37)

where Ω1\Omega_{1} refers to the ensemble of time bins at least 5 ms before or after any spikes and TΩ1T_{\Omega_{1}} is the total number of time bins in Ω1\Omega_{1}. RMSEnmRMSE_{\rm nm} has index nn for ‘neuron’ and index mm for ‘model’. It estimates the error between the real neuron and the model.

To evaluate the second quantity, we compare recorded membrane potential from multiple repeated stimulations having the same stimulus. Despite the variability in spike timings, it is usually possible to find times which are sufficiently away from a spike in any repetition and compute the averaged squared error

RMSEnn=2TΩ2Nrep(Nrep-1)i=1Nrepj=1i-1Ω2(ujexp(t)-uiexp(t))2dt{\rm RMSE}_{\rm nn}=\sqrt{\frac{2}{T_{\Omega_{2}}N_{\rm rep}(N_{\rm rep}-1)}% \sum_{i=1}^{N_{\rm rep}}\sum_{j=1}^{i-1}\int_{\Omega_{2}}\left(u^{\rm exp}_{j}% (t)-u^{\rm exp}_{i}(t)\right)^{2}{\text{d}}t} (10.38)

where Ω2\Omega_{2} refers to the ensemble of time bins far from the spike times in any repetition and TΩ2T_{\Omega_{2}} is the total number of time bins in Ω2\Omega_{2}. Typically, 20 ms before and 200 ms after the spike is considered sufficiently far. Note that with this approach we implicitly assume that spike-afterpotentials have vanished 200ms after a spike. However, as we will see in Chapter 11, the spike afterpotentials can extend for more than one second, so that the above assumption is a rather bad approximation. Because the earlier spiking history will affect the membrane potential, the RMSEnn{}_{\rm nn} calculated in Eq. (10.38) is an overestimate.

To quantify the predictive power of the model, we finally compute the model error with the intrinsic error by taking the ratio

RMSER=RMSEnnRMSEnm\rm RMSER=\frac{RMSE_{nn}}{RMSE_{nm}} (10.39)

The root-mean-squared-error ratio (RMSER) reaches one if the model precision is matched with intrinsic error. When smaller than one, the RMSER indicates that the model could be improved. Values larger than one are possible because RMSEnn{}_{\rm nn} is an overestimate of the true intrinsic error.

10.3.2 Spike Train Likelihood

The likelihood is the probability of generating the observed set of spike times S(t)S(t) with the current set of parameters in our stochastic neuron model. It was defined in Eq. (9.10) of Sect. 9.2 which we reproduce here

Ln(S|θ)=t(i)Sρ(t(i)|S,θ)exp[-0Tρ(s|S,θ)ds]\displaystyle L^{n}(S|\theta)=\prod_{t^{(i)}\in S}\rho(t^{(i)}|S,\theta)\,\exp% \left[-\int_{0}^{T}\rho(s|S,\theta){\text{d}}s\right] (10.40)

where we used ρ(t(i)|S,θ)\rho(t^{(i)}|S,\theta) to emphasize that the firing intensity of a spike at t(i)t^{(i)} depends on both the stimulus and spike history as well as the model parameters θ\theta.

The likelihood LnL^{n} is a conditional probability density and has units of inverse time to the power of nn (where nn is the number of observed spikes). To arrive at a more interpretable measure, it is common to compare LnL^{n} with the likelihood of a homogeneous Poisson model with a constant firing intensity ρ0=n/T\rho_{0}=n/T, i.e., a Poisson process which is expected to generate the same number of spikes in the observation interval TT. The difference in log-likelihood between the Poisson model and the neuron model is finally divided by the total number nn of observed spikes in order to obtain a quantity with units of ‘bits per spike’:

1nlog2Ln(S|θ)ρ0ne-ρT\frac{1}{n}\log_{2}\frac{L^{n}(S|\theta)}{\rho_{0}^{n}e^{-\rho T}} (10.41)

This quantity can be interpreted as an instantaneous mutual information between the spike count in a single time bin and the stimulus given the parameters. Hence, it is interpreted as a gain in predictability produced by the set of model parameters θ\theta. One advantage of using the log-likelihood of the conditional firing intensity is that it does not require multiple stimulus repetitions. It is especially useful to compare on a given data set the performances of different models: better models achieve higher cross-validated likelihood scores.

10.3.3 Time-rescaling Theorem

For a spike train with spikes at t(1)<t(2)<<t(n)t^{(1)}<t^{(2)}<...<t^{(n)} and with firing intensity ρ(t|S,θ)\rho(t|S,\theta), the time-rescaling transformation tΛ(t)t\to\Lambda(t) is defined as

Λ(t)=0tρ(x|S,θ)dx.\Lambda(t)=\int_{0}^{t}\rho(x|S,\theta){\text{d}}x\,. (10.42)

It is a useful and somewhat surprising result that Λ(t(k))\Lambda(t^{(k)}) (evaluated at the measured firing times) is a Poisson process with unit rate (73). A correlate of this time-rescaling theorem is that the time intervals

Λ(t(k))-Λ(t(k-1))\Lambda(t^{(k)})-\Lambda(t^{(k-1)}) (10.43)

are independent random variables with an exponential distribution (c.f. Chapter 7). Re-scaling again the time axis with the transformation

zk=1-exp[-(Λ(t(k))-Λ(t(k-1)))]z_{k}=1-\exp\left[-\left(\Lambda(t^{(k)})-\Lambda(t^{(k-1)})\right)\right] (10.44)

forms independent uniform random variables on the interval zero to one.

Therefore, if the model ρ(t|S,θ)\rho(t|S,\theta) is a valid description of the spike train S(t)S(t), then the resulting zkz_{k} should have the statistics of a sequence of independent uniformly distributed random variables. As a first step, one can verify that the zkz_{k}’s are independent by looking at the serial correlation of the interspike intervals or by using a scatter plot of zk+1z_{k+1} against zkz_{k}. Testing whether the zkz_{k}’s are uniformly distributed can be done with a Kolmogorov-Smirnov (K-S) test. The K-S statistic evaluates the distance between the empirical cumulative distribution function of zkz_{k}, P(z)P(z), and the cumulative distribution of the reference function. In our case, the reference function is the uniform distribution, so that its cumulative is simply zz. Thus,

D=supz|P(z)-z|.D=\sup_{z}\left|P(z)-z\right|. (10.45)

The K-S statistic converges to zero as the empirical probability P(z)P(z) converges to the reference. The K-S test then compares DD with the critical values of the Kolmogorov distribution. Figure 10.6 illustrates two examples: one where the empirical distribution was far from a uniform distribution and the other where the model rescaled time correctly. See (173) for a multivariate version of this idea that is applicable to the case of coupled neurons. To summarize, the time-rescaling theorem along with the K-S test provide a useful goodness-of-fit measure for spike train data with confidence intervals that does not require multiple repetitions.

A B
Fig. 10.6: Time-rescaling theorem as a goodness-of-fit. Illustrating the K-S test by plotting the cumulative probability of zkz_{k} as a function of quantiles. A. Rescaling time using an inadequate model does not result in a uniform distribution of zkz_{k} as can be seen by comparing the empirical distribution (thick black line) with the diagonal. Dashed lines illustrate 95% confidence bounds. B. As in A but with a better rescaling of time. The empirical distribution follows the cumulative of the uniform distribution within the confidence bounds.

10.3.4 Spike Train Metric

Evaluating the goodness-of-fit in terms of log-likelihood or the time-rescaling theorem requires that we know the conditional firing intensity ρ(t|S,θ)\rho(t|S,\theta) accurately. For biophysical models as seen in Chapter 2 but complemented with a source of variability, the firing intensity is unavailable analytically. The intensity can be estimated numerically by simulating the model with different realizations of noise, or by solving a Fokker-Planck equation, but this is sometimes impractical.

Another approach for comparing spike trains involves defining a metric between spike trains. Multiple spike timing metrics have been proposed, with different interpretations. A popular metric was proposed by Victor and Purpura (1996). Here, we describe an alternative framework for the comparison of spike trains that makes use of vector space ideas, rather than more general metric spaces.

A B
Fig. 10.7: Distance and angular separation between spike trains seen as vectors. A. Three schematic spike trains where the first two have the same number of spikes and roughly the same timing. B. The spike trains in A can be represented as vectors where 𝐒1\mathbf{S}_{1} and 𝐒2\mathbf{S}_{2} have the same length but a slightly different orientation due to small differences in spike timing. The third spike train 𝐒3\mathbf{S}_{3} is much longer due to the larger number of spikes. It is at a squared distance D23=||𝐒2-𝐒3||2D_{23}=||\mathbf{S}_{2}-\mathbf{S}_{3}||^{2} from 𝐒2\mathbf{S}_{2} and at angle θ\theta.

Let us consider spike trains as vectors in an abstract vector space, with these vectors denoted with boldface: 𝐒\mathbf{S}. A vector space is said to have an inner (or scalar) product if for each vector pair 𝐒i\mathbf{S}_{i} and 𝐒j\mathbf{S}_{j} there exists a unique real number (𝐒i,𝐒j)\left(\mathbf{S}_{i},\mathbf{S}_{j}\right) satisfying the following axioms: commutativity, distributivity, associativity and positivity. There are multiple candidate inner products satisfying the above axioms. The choice of inner product will be related to the type of metric being considered. For now, consider the general form

(𝐒i,𝐒j)=0T--KΔ(s,s)Si(t-s)Sj(t-s)dsdsdt,\left(\mathbf{S}_{i},\mathbf{S}_{j}\right)=\int_{0}^{T}\int_{-\infty}^{\infty}% \int_{-\infty}^{\infty}K_{\Delta}(s,s^{\prime})S_{i}(t-s)S_{j}(t-s^{\prime}){% \text{d}}s{\text{d}}s^{\prime}{\text{d}}t, (10.46)

where KΔK_{\Delta} is a two-dimensional coincidence kernel with a scaling parameter Δ\Delta, and TT is the maximum length of the spike trains. KΔK_{\Delta} is required to be a non-negative function with a global maximum at s=s=0s=s^{\prime}=0. Moreover, KΔ(s,s)K_{\Delta}(s,s^{\prime}) should fall off rapidly so that KΔ(s,s)0K_{\Delta}(s,s^{\prime})\approx 0 for all s,s>Δs,s^{\prime}>\Delta. Examples of kernels include KΔ(s,s)=k1(s)k2(s)K_{\Delta}(s,s^{\prime})=k_{1}(s)k_{2}(s^{\prime}). For instance, k1(s)=k2(s)=1Δe-s/ΔΘ(s)k_{1}(s)=k_{2}(s)=\frac{1}{\Delta}e^{-s/\Delta}\Theta(s) is a kernel that was used by van Rossum (531). The scaling parameter Δ\Delta must be small, much smaller than the length TT of the spike train.

For a comparison of spike-trains seen as vectors the notions of angular separation, distance and norm of spike trains are particularly important. The squared norm of a spike train will be written ||𝐒i||2=(𝐒i,𝐒i)||\mathbf{S}_{i}||^{2}=\left(\mathbf{S}_{i},\mathbf{S}_{i}\right). With KΔ(s,s)=δ(s)δ(s)K_{\Delta}(s,s^{\prime})=\delta(s)\delta(s^{\prime}), we observe that (𝐒i,𝐒i)=0TSi(t)dt=ni\left(\mathbf{S}_{i},\mathbf{S}_{i}\right)=\int_{0}^{T}S_{i}(t){\text{d}}t=n_{i} where nin_{i} is the number of spikes in 𝐒i\mathbf{S}_{i}. Therefore the norm of a spike train is related to the total number of spikes it contains. The Victor and Purpura metric is of a different form than the form discussed here, but it has similar properties (see exercies).

The norm readily defines a distance, DijD_{ij}, between two spike-trains

Dij2=||𝐒i-𝐒j||2=(𝐒i-𝐒j,𝐒i-𝐒j)=||𝐒i||2+||𝐒j||2-2(𝐒i,𝐒j).D_{ij}^{2}=||\mathbf{S}_{i}-\mathbf{S}_{j}||^{2}=\left(\mathbf{S}_{i}-\mathbf{% S}_{j},\mathbf{S}_{i}-\mathbf{S}_{j}\right)=||\mathbf{S}_{i}||^{2}+||\mathbf{S% }_{j}||^{2}-2(\mathbf{S}_{i},\mathbf{S}_{j}). (10.47)

The right hand side of Eq. (10.47) shows that that the distance between two spike trains is maximum when (𝐒i,𝐒j)\left(\mathbf{S}_{i},\mathbf{S}_{j}\right) is zero. On the other hand, Dij2D_{ij}^{2} becomes zero only when Si=SjS_{i}=S_{j}. This implies that (𝐒i,𝐒j)=(𝐒i,𝐒i)=(𝐒j,𝐒j)\left(\mathbf{S}_{i},\mathbf{S}_{j}\right)=\left(\mathbf{S}_{i},\mathbf{S}_{i}% \right)=\left(\mathbf{S}_{j},\mathbf{S}_{j}\right). Again consider KΔ(s,s)=δ(s)δ(s)K_{\Delta}(s,s^{\prime})=\delta(s)\delta(s^{\prime}), then DijD_{ij} is the total number of spikes in both 𝐒i\mathbf{S}_{i} and 𝐒j\mathbf{S}_{j} reduced by 2 for each spike in 𝐒i\mathbf{S}_{i} that coincided with one in 𝐒j\mathbf{S}_{j}. For the following, it is useful to think of a distance between spike trains as a number of non-coincident spikes.

The cosine of the angle between 𝐒i\mathbf{S}_{i} and 𝐒j\mathbf{S}_{j} is

cosθij=(𝐒j,𝐒i)||𝐒i||||𝐒j||.\cos\theta_{ij}=\frac{\left(\mathbf{S}_{j},\mathbf{S}_{i}\right)}{||\mathbf{S}% _{i}||\,||\mathbf{S}_{j}||}. (10.48)

This angular separation relates to the fraction of coincident spikes. Fig. 10.7 illustrate the concepts of angle and distance for spike trains seen as vectors.

A B
Fig. 10.8: The spike density vector. A. A set of NN spike trains (𝐒1\mathbf{S}_{1}, 𝐒2\mathbf{S}_{2}, … 𝐒N\mathbf{S}_{N}) are combined to yield an estimate of the spike density 𝝂^\hat{\boldsymbol{\nu}}. At the limit NN\rightarrow\infty the spike density converges to the instantaneous firing rate ν\nu. B. Schematic representation of the quantities in A. The variability VV measures the scatter of the individual spike trains around their mean 𝝂^X\hat{\boldsymbol{\nu}}_{X}.

10.3.5 Comparing Sets of Spike Trains

Metrics such as DijD_{ij} described above can quantify the similarity between two spike trains. In the presence of variability, however, a simple comparison of spike trains is not sufficient. Instead, the spike train similarity measure must be maximally sensitive to differences in the underlying stochastic processes.

We want to know if spike trains generated from a neuron model could well have been generated by a real neuron. We could simply calculate the distance between a spike train from the neuron and a spike train from the model, but neurons are noisy and we will find a different distance each time we repeat the recording. To achieve better statistics, we can compare a set of spike trains from the model with a set of spike trains from the neuron.

Let the two sets of spike trains be denoted by XX and YY, containing NXN_{X} and NYN_{Y} spike trains, respectively. First, it is useful to define some characteristics of such sets of spike trains. A natural quantity to consider is the average of the norms of each spike train within a set, say XX,

L^X=1NXi=1NX||𝐒i(x)||2.\hat{L}_{X}=\frac{1}{N_{X}}\sum_{i=1}^{N_{X}}||\mathbf{S}_{i}^{(x)}||^{2}. (10.49)

where we have used ‘^\,\hat{\,}\,’ to denote that the quantity is an experimental estimate. We note that L^X\hat{L}_{X} is related to the averaged spike count. LXL_{X} is exactly the averaged spike count if the inner product satisfies i) KΔ(s,s)dsds=1\int\int K_{\Delta}(s,s^{\prime}){\text{d}}s{\text{d}}s^{\prime}=1 and ii) KΔ(s,s)=0K_{\Delta}(s,s^{\prime})=0 whenever |s-s||s-s^{\prime}| is greater than the minimum interspike interval of any of the spike trains considered. The interpretation LXL_{X}\sim spike count is helpful for the discussion in the remainder of this section. Furthermore, the vector of averaged spike trains

𝝂^X=1NXi=1NX𝐒i(x).\hat{\boldsymbol{\nu}}_{X}=\frac{1}{N_{X}}\sum_{i=1}^{N_{X}}\mathbf{S}_{i}^{(x% )}. (10.50)

is another occurrence of the spike density seen in Chapter 7. It defines the instantaneous firing rate of the the spiking process, ν(t)=𝝂^\nu(t)=\langle\hat{\boldsymbol{\nu}}\rangle.

A B
Fig. 10.9: Distance and angular separation between spike densities. A. Two spike densities corresponding to a sum of spike trains labeled XX and YY. B. Schematic representation of the densities 𝝂^X\hat{\boldsymbol{\nu}}_{X} and 𝝂^Y\hat{\boldsymbol{\nu}}_{Y} seen as vectors. Both are separated by a distance ||𝝂^X-𝝂^Y||2||\hat{\boldsymbol{\nu}}_{X}-\hat{\boldsymbol{\nu}}_{Y}||^{2} and angle θ\theta. The variability is shown as a gray cloud centered on the vectors.

In the vector space, 𝝂^X\hat{\boldsymbol{\nu}}_{X} can be thought of as lying at the center of the spike trains seen as vectors (Fig. 10.9), note that other ‘mean spike train’ could be defined (561). The size of the cloud quantifies the variability in the spike timing. The variability is defined as the variance

V^X=1NX-1i=1NX||𝐒i(x)-𝝂^X||2.\hat{V}_{X}=\frac{1}{N_{X}-1}\sum_{i=1}^{N_{X}}||\mathbf{S}_{i}^{(x)}-\hat{% \boldsymbol{\nu}}_{X}||^{2}. (10.51)

A set of spike trains where spikes always occur at the same times has low variability. When the spikes occur with some jitter around a given time, the variability is larger. Variability relates to reliability. While variability is a positive quantity that can not exceed LXL_{X}, reliability is usually defined between zero and one where one means perfectly reliable spike timing: R^X=1-V^X/L^X\hat{R}_{X}=1-\hat{V}_{X}/\hat{L}_{X} .

Finally, we come to a measure of match between the set of spike trains XX and YY. The discussion in Chapter 7 about the neural code would suggest that neuron models should reproduce the detailed time structure of the PSTH. We there define

M^=2(𝝂^X,𝝂^Y)R^XL^X+R^YL^Y.\hat{M}=\frac{2\left(\hat{\boldsymbol{\nu}}_{X},\hat{\boldsymbol{\nu}}_{Y}% \right)}{\hat{R}_{X}\hat{L}_{X}+\hat{R}_{Y}\hat{L}_{Y}}. (10.52)

We have MM (for match) equal to one if XX and YY have the same instantaneous firing rate. The smaller MM the greater the mismatch between the spiking processes. The quantity RXLXR_{X}L_{X} can be interpreted as a number of reliable spikes. Since (𝝂^X,𝝂^Y)\left(\hat{\boldsymbol{\nu}}_{X},\hat{\boldsymbol{\nu}}_{Y}\right) is interpreted as a number of coincident spikes between XX and YY, we can still regard MM as a factor counting the fraction of coincident spikes. A similar quantity can be defined for metrics that can not be cast into a inner product (357).

If the kernel KΔ(s,s)K_{\Delta}(s,s^{\prime}) is chosen to be kg(s)kg(s)k_{g}(s)k_{g}(s^{\prime}) and kgk_{g} is a Gaussian distribution of width Δ\Delta, then MM relates to a mean square error between PSTHs that were filtered with kgk_{g}. Therefore, the kernel used in the definition of the inner product (Eq. (10.46)) can be related to the smoothing filter of the PSTH (see Exercises).