Bat echolocation call identification for biodiversity monitoring: a probabilistic approach

Bat echolocation call identification methods are important in developing efficient cost‐effective methods for large‐scale bioacoustic surveys for global biodiversity monitoring and conservation planning. Such methods need to provide interpretable probabilistic predictions of species since they will be applied across many different taxa in a diverse set of applications and environments. We develop such a method using a multinomial probit likelihood with independent Gaussian process priors and study its feasibility on a data set from an on‐going study of 21 species, five families and 1800 bat echolocation calls collected from Mexico, a hotspot of bat biodiversity. We propose an efficient approximate inference scheme based on the expectation propagation algorithm and observe that the overall methodology significantly improves on currently adopted approaches to bat call classification by providing an approach which can be easily generalized across different species and call types and is fully probabilistic. Implementation of this method has the potential to provide robust species identification tools for biodiversity acoustic bat monitoring programmes across a range of taxa and spatial scales.


Introduction
In the face of severe declines in populations of many wildlife species (Butchart et al., 2010;Tittensor et al., 2014) monitoring changes in ecological communities through time and space is critically important for conservation planning and decision making (Magurran et al., 2010). Bats (order Chiroptera) with over 1200 species are the second-largest order of mammals (Simmons, 2005). Bats are considered to be important indicators of wider environmental health as they play key roles in ecosystems, providing many ecosystem services such as pollination and regulation of insect populations . Bats also can be monitored non-invasively by acoustic sensors as many species actively use calls and the interpretation of their echos to detect, localize and classify objects (echolocation) (Griffin, 1944). Acoustic monitoring schemes are therefore useful conservation tools to monitor and determine the effect of anthropogenic activity on biodiversity (Barlow et al., 2015;Newson et al., 2015;Jones et al., 2013;Amorim et al., 2014). However, there are many challenges to implementing such schemes, not least of which is the paucity of software tools to detect and classify bat calls to species . Scaling up existing schemes from local and regional levels to international and global scales requires substantial expansion in call reference libraries for echolocating species alongside more sophisticated methods for detecting and classifying species.
Although echolocation is used by other taxa, the complexity and diversity of the design of echolocation calls is unparalleled within bats. Bats echolocate by emitting frequencies between 9 and 212 kHz and show a considerable diversity in call design. Calls can generally be categorized by their duration, bandwidth and use of harmonics (Maltby et al., 2009). Some of the variation in calls can be explained by a shared evolutionary history (closely related species have similar call designs) (Jones and Teeling, 2006). Whereas other call variation is due to adaptation to particular tasks (for example call frequency is influenced by the size of objects to be detected), call duration by how far away objects are, and bandwidth (the range of the frequencies of the call) influences how well the bat species deals with extremely cluttered environments (Jones and Holderied, 2007;Maltby et al., 2009). However, their calls also show great intraspecific variation and flexibility caused by habitat, geography, sex, gender and age, and in some cases call structures designs greatly overlap between species which makes species identification very challenging (Murray et al., 2001;Schnitzler et al., 2003). Previous research on bat echolocation call identification has approached the problem from a data classification perspective. The most studied methods use call parameters extracted from spectrograms and then discriminant function analysis, support vector machines (SVMs) or artificial neural networks are employed for supervised classification (Walters et al., 2012;Parsons and Jones, 2000;Fenton and Bell, 1981). These existing classification tools typically cover a small set of species and point estimates for the model parameters are obtained by using high quality recordings from well-curated collections of bat calls. Such methods do not associate species identification probabilities and since point estimates are used they do not generalize well to bat calls recorded in a diverse set of environments. Other studies, e.g. Skowronski and Harris (2006), have used hidden Markov and Gaussian mixture models. Although such methods provide probabilistic outputs they are generative in nature, trying to model the distribution of the bat calls in the collection. Ng and Jordan (2002) have shown that classification methods modelling the discriminant function directly generalize better than methods modelling the distribution of the training data unless there is substantial domain knowledge, something that is the case in the speech recognition community but not for bat echolocation calls. There is an opportunity to tackle this problem by applying other methods from the statistics and machine learning literature, such as generalized linear models (McCullagh and Nelder, 1989), Gaussian processes (GPs) (Rasmussen and Williams, 2005) and random forests (Dietterich, 2000). However, there are several factors which need to be considered before choosing a particular methodology.
Firstly, it is important that the uncertainty of species identification is determined correctly as this has conservation and planning implications as well as being a useful input for a number of habitat suitability analyses, such as species distribution models (Dudik et al., 2007). Therefore the output of the method should assign a species identification probability for each new sample to allow researchers to control accuracy by setting acceptance thresholds. Moreover, models that are trained on a call reference library with a data set of high quality recordings will have to generalize to noisy unstructured data collected under a diverse set of environments. To achieve this any method should quantify and take into account the uncertainty that is associated with model parameters estimated on a specific data set when making predictions. Generalized linear models and GPs are theoretically well understood with well-studied methods for parameter inference by using Markov chain Monte Carlo (MCMC) (Dobson and Barnett, 2008) and approximate Bayesian inference methods (Rasmussen and Williams, 2005). Moreover, GPs a priori do not require specification of the functional form of the predictor function, e.g. linear or polynomial. Rather a prior distribution on functions is specified via a GP prior. This is an important characteristic since bat calls are represented by their two-dimensional spectrograms which exhibit highly non-linear features ( Fig. 1 in Section 7).
In this study we use 1800 bat calls collected from 449 bats and 21 different species in five families from Mexico, which is a hotspot of global bat biodiversity representing a highly diverse call assemblage . The data are presented in Section 6 and the signal processing methodology is presented in Section 7. We apply a multinomial probit regression model with a GP prior (Girolami and Rogers, 2006), Section 2, and we discuss how inference for such a model can be performed in Section 3. The nature of the model and the large number of classes as well as the number of samples pose several problems on well-established methods for exact MCMC inference using the pseudomarginal algorithm of Andrieu and Roberts (2009). We discuss these issues in detail in Section 4. For the particular application we resort to approximate Bayesian inference methods, and particularly the expectation propagation (EP) algorithm of Minka (2001), in Section 5. Experiments and results are presented in Section 8 and conclude with a discussion and remarks on the challenges of Bayesian inference in large problems in Section 9.

Multinomial probit with latent Gaussian processes
Let y and X be observed data where y = .y 1 , : : : , y N / T with y n ∈ {1, : : : , C} and X = .x 1 , : : : , x N / with x n ∈ R D . The observed variables y n indicate bat species; for example y n = c indicates that the nth observation is from the cth species in the data set, and x n are respective covariates, usually D measurements from the nth call's spectrogram. Also let f : R D → R C with latent values f.x n / = f n = .f 1 n , f 2 n , : : : , f C n / T such that when transformed by a sigmoid-like function give the class probabilities p.y n |f n /.
The multinomial probit model assumes that the species indicator variables y n have a multinomial distribution with probabilities given by a transformation of the latent function values f n : p.y n |f n / = N .u n |0, 1/ C j=1,j =y n Φ.u n + f y n n − f j n / du n : . 1/ In equation (1) N denotes the Gaussian density function and Φ its cumulative distribution function. There are other alternatives sigmoid functions to equation (1), such as the softmax function exp.f y n n /=Σ C j=1 exp.f j n /. However, equation (1) is convenient for obtaining a Gibbs sampler (Albert and Chib, 1993) or approximate Bayesian inference algorithms (Girolami and Rogers, 2006;Riihimaki et al., 2013).
For the latent function values we assume independent zero-mean GP priors for each species similarly to Rasmussen and Williams (2005), i.e. a priori we assume that latent variables from different species are independent. However, in the posterior, latent variables will not necessarily be independent because of the interactions of the latent variables in the likelihood (1). Collecting all latent function values in f = .f 1 1 , : : : , f 1 N , f 2 1 , : : : , f 2 N : : : , f C 1 , : : : , f C N / T the GP prior is p.f|X, θ/ = N {f|0, K.θ/}, .2/ where K.θ/ is a CN × CN block covariance matrix with block matrices K 1 .θ/, : : : , K C .θ/, each of size N × N, on its diagonal. Elements K c i,j define the prior covariance between the latent function values f c i and f c j governed by a covariance function k.x i , x j |θ/ with unknown parameters θ. A different covariance can be used for each diagonal block of the covariance matrix corresponding to different values of the species indicator variables.
The only requirement for the covariance function k.x i , x j |θ/ is that it is a positive definite function. Here we describe the two types of covariance functions that we used in this study in more detail. One of the most frequently used covariance functions is the squared exponential function where σ 2 and l are the magnitude and length scale parameters and θ = .σ 2 , l/ T . The length scale parameter controls the smoothness of the function and measures, in units, how far on the input space one must move for the function to change value drastically. The magnitude parameter controls the magnitude of this change.
The squared exponential function can be generalized to have one length scale parameter for each dimension of the input space, i.e.
where θ = .σ 2 , l 1 , : : : , l D / T . Such covariance functions have a very interesting interpretation. The larger the length scale parameter the smoother the functions are. In the limit of l d → ∞ the function becomes constant and therefore informs us that there is no information in the particular input dimension x d for the task at hand. Thus once estimates or posteriors of the parameter values have been found we can inspect their values or distributions to assess their relevance in the problem at hand. Finally, we can combine covariance functions that are obtained on a different set of covariates or measurements for the same problem to form new valid covariance functions. For example, given two sets of measurements for the same data X .1/ and X .2/ we can construct a composite covariance by w 1 k 1 .x .2/ j |θ .2/ /, where the relative weights w 1 and w 2 sum to 1. The parameters of the covariance function then become θ = .w 1 , w 2 , θ .1/ , θ .2/ / T : This is an important property of GPs for applications such as that considered here since we often can have different representations of the data which would not be necessarily compatible. For example, in our application we have measurements directly from spectrograms of calls but we also have the whole spectrogram. These two measures capture different aspects of the information in the call and we would like to include in our analysis both representations of the data. Information that is contained in the measurements is also contained in the spectrogram but it is transformed in a non-linear way by using expert knowledge whereas some information from the spectrogram is not in covered by the measurements at all. Using a weighted combination of the kernel functions allows inference about which representation is more suitable for the problem after we obtain estimates or posterior samples for the weights.

Parameter and predictive inference
Inference for multinomial probit regression with latent GPs entails two aspects: parameter inference and prediction. The only free parameters in the model are the covariance function parameters θ which, as discussed in the previous section, are interpretable. Given observed data y and X, and a prior distribution p.θ/ inference about the parameters can be made by calculating the posterior where p.y|f/ = Π N n=1 p.y n |f n / and involves integrating latent function values f. By inspecting the posterior or by obtaining estimatesθ, e.g. by takingθ to be the maximum of the posterior, we can infer the most likely parameters for the data and interpret the results according to the previous section.
For making predictions for new, previously unobserved, data x Å we need to obtain the predictive distribution for the unobserved species indicator variable y Å , .4/ which involves integrating both latent variables f Å and f and parameters θ. In the predictive distribution (4) the second term in the integral is the GP predictive distribution for the new latent function values f Å . For simplicity of the notation let K be the covariance matrix evaluated at θ, Q Å be a CN × C matrix k 1 Å 0 : : : 0 0 k 2 Å : : : 0 : : : : : : : : : : : : where k c Å is the vector such that the nth element is equal to k c .x n , x Å |θ/, with c indexing values of the species indicator variables y.
From the properties of multivariate normal variables f Å is also normally distributed with The last term in the integral (4) is the joint posterior of the latent function values p.θ, f|X, y/ ∝ p.y|f/p.f|X, θ/p.θ/.
The necessary integrations for the parameter posterior and predictive distribution cannot be computed in closed form since the likelihood function p.y|f/ is not conjugate with the normal distribution. Therefore, we must resort either to Monte Carlo integration or approximate algorithms for GPs. Both options are discussed in the subsequent sections.

Markov chain Monte Carlo inference for the multinomial probit model
To perform posterior parameter inference for the multinomial probit model with latent GPs we must obtain samples from the posterior distribution (3). Sampling directly from the marginal posterior for the parameters θ is not possible since the integration cannot be carried out analytically but we can set up a Markov chain to sample from the joint posterior p.θ, f|X, y/. Once converged we can obtain the marginal for θ by simply discarding samples for f. We can then inspect the histograms and posterior plots for the parameters and interpret as discussed above.
For predictive inference we require N S samples from the joint posterior p.θ, f|X, y/. Denoting the ith sample from the joint posterior by θ .i/ , f .i/ a Monte Carlo estimate of distribution (4) can be obtained by . 5/ An important aspect for inference is to design efficient MCMC sampling schemes to obtain samples from the joint posterior. Obtaining samples with joint proposals on both f and θ is problematic since it is extremely unlikely to propose a set of latent function values f and parameters θ that are compatible. This is due to the strong dependence of f on θ in the covariance function as well as the high dimensionality of f. Such proposals will therefore have very high rejection rates, making the chain very slow to converge.
An alternative is to design a Gibbs sampler (Robert and Casella, 2005) which at each iteration samples from the conditionals p.f|X, y, θ/ and p.θ|X, y, f/. Gibbs samplers for latent Gaussian models have been extensively studied in the literature (Papaspiliopoulos et al., 2007;Yu and Meng, 2011;Murray and Adams, 2010;Filippone and Girolami, 2014) and an extensive comparison with regard to their efficiency and properties can be found in Filippone et al. (2013). Despite the large interest for MCMC inference for latent Gaussian models there is little research on multinomial models for classification and especially for applications with large numbers of classes and observations such as the application that we consider in this paper.
A common problem with Gibbs samplers for latent Gaussian models is that there is a strong dependence of the parameters and the latent function values which results in slow converging and poorly mixing chains (Filippone et al., 2013). Also Filippone et al. (2012) reported chains of 5 million samples even with the reparameterizations of Papaspiliopoulos et al. (2007) and Yu and Meng (2011). This renders Gibbs sampling a non-viable approach for our application where the number of observations is so high that even a single sample from the posterior requires long computation time.
This strong dependence can be avoided by sampling θ from its marginal distribution by integrating f. Unfortunately this is not possible when the likelihood function p.y|f/ is not conjugate with the multivariate normal distribution. Andrieu and Roberts (2009) showed that we can obtain samples from the exact posterior by using an estimate of the marginal provided that it is unbiased. This has been exploited by Filippone and Girolami (2014) for binary probit classification with GPs and has been shown to be the most efficient method compared with other Gibbs sampling algorithms. In the remainder of this section we show how we can design such a sampler for multinomial probit classification and discuss its feasibility on the application that is considered in this paper.

Feasibility of the pseudomarginal Gibbs sampling
To sample from the marginal posterior for the parameters θ we can design a Metropolis-Hastings sampler (Robert and Casella, 2005). Using a proposal distribution π.θ |θ/, we can generate proposed samples θ which are accepted with probability where p.y|X, θ/ = p.y|f/p.f|X, θ/df. Andrieu and Roberts (2009) showed that we can obtain a chain that converges to the correct posterior by replacing the marginal p.y|X, θ/ with an unbiased estimatep.y|X, θ/. Such an estimate can be computed by an importance sampling algorithm with N I samples from an approximating distribution q.f|y, X, θ/. An importance sampling estimate for the marginal can be computed by p.y|f .i/ /p.f .i/ |X, θ/ q.f .i/ |y, X, θ/ : An important aspect for the efficiency of the pseudomarginal algorithm is the variance of the estimatep.y|X, θ/, since the larger the variance the larger the rejection rate for proposed samples. The variance of the estimate is reduced to 0 when q.f|y, X, θ/ is proportional to p.y|f/p.f|X, θ/ (Robert and Casella, 2005). This can be achieved when q.f|y, X, θ/ is the posterior of f which we cannot sample directly for the multinomial probit model. Therefore we need to find an easyto-sample distribution which approximates as closely as possible the posterior of f. Both the EP (Minka, 2001) algorithm and the Laplace approximation (LA) (Williams and Barber, 1998) approximate the posterior for the latent variables with a multivariate normal distribution. Kuss and Rasmussen (2005) evaluated different approximations for the binary probit classification with GPs whereas Filippone and Girolami (2014) evaluated pseudomarginal algorithms based on the LA and EP approximations also for the binary probit model. The results from both studies indicate that the EP algorithm better approximates the posterior and it also results in an efficient pseudomarginal algorithm compared with the LA.
The variance of the pseudomarginal estimate is greatly affected by the total number of species, C, and the number of observations, N, as both affect the dimension of the latent function values f. The larger the dimension of f the more samples from the approximate distribution will be needed to reduce the variance. An additional complication that arises in our application is that the one-dimensional integral in the multinomial probit function (1) is also not analytically tractable. In the binary probit function the integral can be computed analytically since there is only one normal cumulative distribution function term in the integral. Since this is a onedimensional integral we use Gaussian quadrature to obtain an estimate. However, this increases the computational overhead of the overall algorithm. We should mention here that instead of the multinomial probit function we could have used the softmax function; however, this will not allow us to derive the EP approximation for the posterior of f and will restrict us to the LA.
In Table 1 we show the variance of the log-pseudomarginal for various sample sizes. We also compare the variance on the full data set that is used in this study (21 species and 1432 observations) with the variance on a subset with only three species and 200 observations. The approximating distribution was obtained by the EP algorithm that is described in the next section. We can see that even with 1000 samples the variance is significantly high, leading to very high rejection rates for our sampler. Although there are several methods for reducing the variance of Monte Carlo estimates, e.g. Assaraf and Caffarel (1999) and Mira et al. (2013), we do not pursue this further. The reason is that such methods will further complicate implementation and increase an additional computational overhead per iteration which as discussed below is already prohibitive. The computation time for calculating a single estimate of the pseudomarginal for N I = 500 using the full data set of 21 species and 1432 observations is approximately 4.9 min on a fourcore Intel i7 processor with 16 Gbytes of random-access memory. The computation time is largely dominated by the EP approximation which scales as O.CN 3 / with the sampling of the approximate posterior and the Monte Carlo estimate of the likelihood adding significant constant terms. We have tried parallel and graphics processing unit implementations for the EP approximation. However, the benefits of both implementations were diminished by the large communication overheads, either between different nodes in the cluster or between the main memory and graphics processing unit memory. Additionally, for predictive inference we must sum over all posterior samples and also integrate latent function values f Å for the new observations x Å which is also performed by using Monte Carlo sampling. All the above indicate that full Bayesian inference for high dimensional latent variable models with large data sets remains a very challenging problem both computationally and statistically. Moreover, recently Chopin and Ridgway (2015) showed that for binary classification problems the EP approximation provides very good results and is sometimes preferable to full MCMC inference because of its low computational cost and ease of implementation.

Approximate Bayesian inference with expectation propagation
Approximate Bayesian inference for latent variable models relies on point estimates of the parameters θ instead of obtaining samples from the posterior. Such estimates are usually obtained by finding the parameters that maximize the posterior or some approximation to the posterior, i.e.θ = arg max θ p.θ|X, y/. Predictive inference is then conditional on the parameter estimate: p.y Å = c|x Å , X, y/ = p.y Å = c|f Å /p.f Å |x Å , X, f,θ/p.f|X, y,θ/ df Å df: . 6/ As discussed in the previous section the posterior (3) for the parameters cannot be obtained analytically since it requires integration of the latent variables f. Similarly, we must also integrate over f in equation (6) where the last term in the integral is the posterior of the latent variables conditioned on the parameters: p.f|X, y, θ/ = 1 Z p.f|X, θ/ N n=1 p.y n |f n /: .7/ Z is the normalizing constant or the marginal likelihood: which is proportional to the posterior of the parameters. Thus if we find a convenient approximation of the posterior for f and its normalizing constant we can obtain a parameter estimate by maximizing the marginal and analytically integrate f in the predictive distribution. Following Riihimaki et al. (2013), using the EP method the posterior is approximated by q EP .f|X, y, θ/ = 1 Z EP p.f|X, θ/ N n=1t n .f n |Z n ,μ n ,Σ n /, .8/ wheret n .f n |Z n ,μ n ,Σ n / =Z n N .f n |μ nΣn / are local likelihood approximate terms with parame-tersZ n ,μ n andΣ n . In other words, each multinomial probit likelihood term in equation (7) is approximated by a scaled Gaussian function.
After initializing the parameters of the approximate terms, each term is then updated sequentially by first removing the ith term from the marginal posterior to give the cavity distribution q −n .f n / := q EP .f n |X, y, θ/t n .f n |Z n ,μ n ,Σ n / −1 : . 9/ Distribution (9) is then combined with the exact ith likelihood term to form the tilted distribution p n .f n / :=Ẑ −1 n q −n .f n /p.y n |f n /: .10/ Then a Gaussian approximationq n .f n / top n .f n / is obtained by matching their moments. The parameters of the ith approximate term are then updated such that the mean and covariance of the approximate marginal q EP .f n / are consistent withq n .f n /, i.e.
t new n ∝q n .f n /q −n .f n / −1 : At every update step the approximate posterior (8) is also updated.
Unlike the binary probit case, where the tilted distribution is univariate and thus its moments are easy to compute, the tilted distribution for the multinomial probit model is C dimensional. Previous work on EP approximations for the multinomial probit model (Girolami and Zhong, 2007) further approximated the moments of the tilted distribution using the LA. This assumes that the distributions can be closely approximated by a multivariate normal distribution. Seeger and Jordan (2004) used C-dimensional numerical integration which for large values of C is problematic. Kim and Ghahramani (2006) replaced the multinomial probit function with a threshold function which results in an EP algorithm that is similar to the EP algorithm for the binary problem. However, the algorithm scales as O{.C − 1/ 3 N 3 } which again for large C becomes prohibitively expensive.
The recent work of Riihimaki et al. (2013) proposed to approximate the tilted distribution also with EP. They called their method nested EP since an inner EP algorithm is used at each iteration of the outer EP algorithm to approximate the moments of the tilted distribution. However, they showed that the algorithm can be seen as a standard EP algorithm with N.C − 1/ approximate terms and thus the inner EP algorithm requires only a single iteration provided that the approximate parameters are not initialized and they are started from their previous values. Furthermore they showed that the structure of the site precision matricesΣ −1 n is such that an efficient implementation which scales as O{.C − 1/N 3 } is possible. This is similar to the computational complexity of the LA even though EP gives a better approximation to the posterior. Details of the approximation to the tilted distribution and the EP algorithm can be found in the on-line supplementary material of this paper or Riihimaki et al. (2013).
Given the EP approximation to the posterior of f and its marginal, we can use them to estimate parameters and to obtain the predictive distribution (6). The integration over f in expression (6) is now analytically tractable if we replace the last term in the integral with the EP approximation. However, we still need to integrate out f Å . We can approximate the integral by using the EP approximation of the tilted distribution and its equivalent with obtaining the moment Zq Å or the normalizing constant of the tilted distribution. More details and an efficient implementation are provided in the supplementary material.

Description of the data set
Bat echolocation calls were recorded across north and central Mexico from June to November 2012 and from February to May 2013. We used 10 mist nets erected at ground level (0-3 m) set from sunset to sunrise to capture bats. Live trapped bats were measured and identified to species level by using field keys (Medellín et al., 2008;Ceballos and Oliva, 2005) and bat taxonomy followed (Simmons, 2005). We constructed an echolocation call library by recording the calls of captured individuals using two different techniques: Echolocation calls were recorded with a Pettersson 1000x bat detector (Pettersson Elektronik AB, Uppsala, Sweden). The bat detector was set to record calls manually in realtime, full spectrum at 500 kHz. Each recording consists of multiple calls from a single individual bat.
In total our data set consists of 21 species in five families, 449 individual bats and 8429 calls; Table 2. Care must be taken when splitting the data to training and test sets during cross-validation to ensure that calls from the same individual are not in both sets. For this reason we split our data set by using recordings instead of calls. For species with fewer than 100 recordings we include as many calls as possible up to a maximum of 100 calls per species. After this selection process each fold has approximately 1400 calls for training and 200 calls for testing from 21 species. The raw data as well as the post-processed and fivefold cross-validation sets are available to download as supplementary material for this paper from http://www.engage-project.org/.

Signal processing and data representation
The vector representation x n for each call is constructed by extracting call parameters from the spectrograms following Walters et al. (2012) using Sonobat version 3.0 software (Szewczak, 2010). The spectrogram of a call is calculated by using a Hamming window of size 256 with 95% overlap and a fast Fourier transform length of 512; Fig. 1. The frequency range of the spectrogram is thresholded by removing frequencies below 5 kHz and above 210 kHz. This is done to reduce the size of the spectrogram and also to remove low and high frequency noise since we know that there are no bats echolocating outside this frequency range. In total 31 parameters are calculated including call duration in milliseconds, the highest and lowest frequencies of the call, total frequency spread, the frequency with maximum amplitude and the frequencies at the start and end of the call (see the on-line supplementary material section 6 for full details). All 31 Fig. 1. Example of spectrograms of calls from four different species: see the text for details on spectrogram computation call parameters are extracted by using the Sonobat version 3.0 software. All 31 call parameters are concatenated in the vector x n and a squared exponential kernel with individual length scales for each parameter is used for the GP classifier.
Although extracting call parameters from call spectrograms captures some of the call characteristics and shape, a large amount of information is discarded, e.g. harmonics. An alternative to characterizing a call by using predefined parameters is to utilize its spectrogram directly. However, owing to the differences in call duration the spectrograms will need to be normalized to have the same length by using some form of interpolation. In this work we borrow ideas from speech recognition (Sakoe and Chiba, 1978) and previous work on bird call classification (Damoulas et al., 2010) and employ the dynamic time warping (DTW) kernel to compare two calls' spectrograms directly.
Given two calls i and j from the library and their spectrograms S i and S j , where S i ∈ C F ×W with F being the number of frequency bands and W the number of windows, the dissimilarity matrix D i,j ∈ R W×W is constructed such that . 12/ DTW uses the dissimilarity matrix to stretch or expand spectrogram S i over time to match S j by calculating the optimal warping path with the smallest alignment cost by c i,j by (a) (c) (b) Fig. 2. Example of the DTW optimal warping path for two call spectrograms from the same species: (a), (b) call spectrograms S i ; (c) dissimilarity matrix D i,j and optimal warping path using dynamic programming. Fig. 2 illustrates the optimal warping path for two calls in the library.
For each call we construct a vector representation x n by computing the optimal warping paths with all N calls from the library and concatenating the alignment costs such that x n = .c n,1 , : : : , c n,N /. We then use the squared exponential covariance function for the covariance matrix of the GP classifier.

Results and interpretation
We compare the classification accuracy of the multinomial probit regression with GP prior classifier by using the two representations that were discussed in Section 7. The values of the call parameters are normalized to have zero mean and standard deviation 1 by subtracting the mean and dividing by the standard deviation of the call parameters in the training set. For the 32 covariance function parameters, σ 2 and λ 1 , : : : , λ 31 we use independent gamma priors with shape parameter 1.5 and scale parameter 10. For the DTW representation each call vector of optimal alignment costs is normalized to unit length and independent gamma (1.5, 10) priors are used for the magnitude and length-scale covariance function parameters. We also combine the covariance functions for both representations by using a linear combination as discussed in Section 2. The weights for the linear combination of the DTW and call parameters kernel functions are restricted to be positive and to sum to 1 and a flat Dirichlet prior is used.
As a baseline we also compare with a multiclass SVM classifier by using the LibSVM software library (Chang and Lin, 2011). For the SVM we use the call parameters and the DTW representations with automatic relevance determination and squared exponential kernels respectively, and we also combine the two kernels similarly to the GP model that was discussed above. We use the same MATLAB code to precompute the kernels for the SVM as that used for the GP models. In that way we ensure as fair a comparison as possible. All kernel parameters and the relaxation parameter C of the SVM were optimized with Bayesian optimization (Snoek et al., 2012) by using the MATLAB code that was provided by Gardner et al. (2014). Bayesian optimization for SVMs allows us to exploit kernels with many parameters, such as automatic relevance determination, in contrast with the commonly used method of grid search over a prespecified set of values. We have also trained SVMs with a squared exponential kernel for call shape parameters and DTW by using a grid search. However, the cross-validated error rate is almost the same albeit with higher variance.  Table 3 compares the missclassification rate by using various data representations and two different classification methods. Results are averages of a fivefold cross-validation. In all cases multinomial probit regression with GP priors is superior to SVMs. Therefore, although we sacrifice in computational time and complexity of the algorithm we do not sacrifice in accuracy whereas we gain significantly with regard to interpretation. We can see that the DTW representation is significantly better for characterizing the species variations achieving a better classification accuracy irrespectively of the classifier. However, results are improving by also considering information from the call parameters. This highlights the importance of combining information from different representations while it highlights the importance of prior knowledge in constructing call parameters. Additionally, the optimized weights for the kernel combination significantly favour the DTW covariance function with a weight of about 0.9 in contrast with the call parameters with weight about 0:1: Fig. 3. If we fix the weight parameters to equal values we obtain a classification error rate of 0:22 ± 0:031, highlighting the importance of the DTW kernel matrix.
The independent length scales allow us also to interpret the discriminatory power of the 31 call parameters (Fig. 4). We believe that such an analysis is not only useful for validating the design of acoustic classification tools but also helps to understand how different call designs have evolved within bats (Maltby et al., 2009). The call parameters with estimated values below 0.5 are parameters which describe call slope (PrcntMaxAmpDur, LedgeDuration, PrcntKneeDur, TimeFromMaxToFc and EndSlope; on-line supplementary material section 6). Call slope reflects the relative amount of the call that is at a constant frequency (a horizontal or narrowband call) or is frequency modulated (a vertical or broadband call). Different echolocating bat species have calls that are a mixture of constant frequency and frequency-modulated components which are adaptive to their particular environments. For example, one way of distinguishing targets from other objects in cluttered environments is to emit a narrowband call with a long duration, enabling detection of the fluttering wings of an insect against a still background. Alternatively, the bandwidth of the echolocation call can be increased (either through the bandwidth of a single call or through the use of harmonics), which helps to resolve different size classes (Maltby et al., 2009). Thus slope parameters might reflect the interspecific evolution of bat calls for different habitats and different targets and may provide an effective way to distinguish between species in acoustic classification tools. Recent studies of bat classification have also suggested that call shape might be a potential focus for new classification tools (Obrist et al., 2007;Walters et al., 2013;MacLeod et al., 2013;Lundy et al., 2011) and our study adds further support to this idea. In Fig. 5 the confusion matrix from the best of the fivefold classification results, 15% misclassification rate, are shown. Similar results are obtained by the remaining folds but are not shown. There is an overall high accuracy for all classes. Misclassification rates are higher for species within different families than for species of different families, which is consistent with the idea that some variation in call design is due to evolutionary constraints (closely related species are more similar) (Jones and Teeling, 2006;Jung et al., 2014). This suggests that including an evolutionary constraint, e.g. determined by phylogenetic relationships, might aid with bat classification tools. We found higher misclassification rates with species within Vespertilionidae, which are well known to have species with similar calls (Walters et al., 2012). For example, the vespertilionid Lasiurus xanthinus, class 18, had a relatively high misclassification rate and was often misclassified as Antrozous pallidus, class 13, which needs to be further investigated. However, we also found that the very similar calls of other species within Vespertilionidae,  Table 2; bold boxes group species of the same family; the two Myotis species are also grouped to highlight the good misclassification rates obtained; the number in the ith row and jth column is a count of the calls in the test set known to be in species i but predicted in species j the Myotis species, are easily discriminated. This is a group whose species are traditionally extremely challenging to classify (Walters et al., 2012). However, as we used only a limited number of species, the success of our approach should be explored with larger numbers of Myotis species.

Conclusion
Automatic bat call classification methods can significantly impact the way that bioacoustic surveys are designed and open up new opportunities for global, long-term biodiversity monitoring programmes. However, for such methods to be applicable they need to provide interpretable probabilistic outputs and to characterize properly the underlying parameter uncertainty to be generalizable in different operational environments (e.g. different geographic regions or diverse bat assemblages). In this paper we show that multinomial probit regression with GP priors has such attributes. Although exact Bayesian inference is still a challenging problem for this model efficient approximate inference algorithms are available and show promising results. Also recent results (Chopin and Ridgway, 2015) show that the EP approximation that is used in this paper is a suitable alternative for cases where exact inference is not possible. The high accuracy that was obtained in this study to separate difficult-to-classify species (e.g. Myotis) sets the ground for a further development of an automatic identification tool and it suggests promising applications to a bigger set of species. Additionally, the fully probabilistic output better quantifies the uncertainty in classification, making species monitoring and subsequent conservation planning more accurate.