Dirichlet Process Mixtures of Order Statistics with Applications to Retail Analytics

The rise of"big data"has led to the frequent need to process and store datasets containing large numbers of high dimensional observations. Due to storage restrictions, these observations might be recorded in a lossy-but-sparse manner, with information collapsed onto a few entries which are considered important. This results in informative missingness in the observed data. Our motivating application comes from retail analytics, where the behaviour of product sales is summarised by the price elasticity of each product with respect to a small number of its top competitors. The resulting data are vectors of order statistics, due to only the top few entries being observed. Interest lies in characterising the behaviour of a product's competitors, and clustering products based on how their competition is spread across the market. We develop nonparametric Bayesian methodology for modelling vectors of order statistics that utilises a Dirichlet Process Mixture Model with an Exponentiated Weibull kernel. Our approach allows us added flexibility for the distribution of each vector, while providing parameters that characterise the decay of the leading entries. We implement our methods on a retail analytics dataset of the cross-elasticity coefficients, and our analysis reveals distinct types of behaviour across the different products of interest.


Introduction
The field of retail analytics is concerned with understanding the purchasing behaviour of consumers, for purposes such as sales prediction, inventory management and coupon personalization (Silver et al., 2013;Rudin et al., 2013;Gunawardana and Shani, 2009;Huh and Rusmevichientong, 2009;Bajari et al., 2015;Ferreira et al., 2015). Retail analytics are a particularly challenging area for machine learning, since vast amounts of data are collected at various operational levels, ranging from individual customer transaction data to aggregated sales data across whole sectors. This means that companies are interested in developing efficient summaries of their data, to mitigate storage and computational costs (Akcay, 2013;Intel, 2014).
A particularly important example involves the price elasticity coefficients that are generated by sales prediction models. Given a set of products which are offered to consumers, the sales of each product typically depend on both its own price as well as the price of each of its competitors. The cross-elasticity of two products is a measure of the dependence that their prices have on their respective sales. In theory, companies would store this information as a matrix which contains the cross-elasticities for each pair of products. However, in practice, computing price elasticities can be computationally prohibitive when there are a large number of products, so companies often instead make use of highly tailored black box sparse regression sales models (Liu et al., 2013;Beheshti-Kashi et al., 2015) and only measure the cross-elasticity for a small number of each product's competitors, with the remaining entries of the matrix treated as missing or negligible.
Although the resulting output matrix of coefficients can be successful at providing accurate predictions of future sales, the inherent informative missingness implies that a global interpretation and understanding of the behaviour of the market may not be directly available. Here we are particularly interested in clustering groups of similar products together according to the distribution of their competition in the market, but also in identifying products for which potentially important competitors may have been missed out of the observed cross-elasticity matrix.
Formally, the form of the cross-elasticity data at hand is such that, for each product i, we observe a decreasing set of entries (i.e. observed order statistics) of a larger vector, that have been censored for sparsity to only the top few entries. Mathematically speaking, the data are in the form X = {x i,1:n : x i,n−l i +1 x i,n−l i +2 : : : x i,n , with x i,j censored to 0 for 1 j n − l i }, where x i is the cross-elasticity vector of dimension n for product i, which has l i uncensored ordered entries, with the remaining being censored. Heterogeneity among products stems both from the rate in which competition decays, as well as from the actual number of uncensored cross-elasticities. Existing price sensitivity analyses focus on reducing these vectors to summary statistics of elasticity coefficients from fitted demand models (Andreyeva et al., 2010;Oliveira et al., 2007). To our knowledge, there is currently no methodology to cluster these entire vectors with minimal information loss, which would allow flexibility in handling the varying lengths of cross-elasticity vectors as well as incorporating the censoring mechanism to provide information about the censored entries.
In this paper we develop non-parametric Bayesian models by interpreting our observed data as realizations of variable length order statistics sequences. We shall show that this succinctly handles partial censoring and enables computationally straightforward inference on the unobserved entries of the cross-elasticity matrix. Our approach uses tools from survival analysis to address inherent censoring mechanisms, together with non-parametric Bayesian Dirichlet process mixture models (DPMMs) that enable products to be clustered into distinct groups for analysis. Using the exponentiated Weibull (EW) distribution as a mixture kernel (Mudholkar and Srivastava, 1993), we can account for both light and heavy tail behaviour that is apparent in the data. As we shall discuss later, the EW distribution has several unique properties which make it ideal for modelling order statistics. We develop efficient sampling mechanisms by adapting algorithm 8 of Neal (2000) and provide interpretation and visualization tools for summarizing and presenting the output. Our approach fully characterizes sales sensitivities by incorporating all the information from the entire cross-elasticity vector, offering two distinct benefits. Firstly, by interpreting these elasticity vectors as order statistic sequences we can directly cluster products in terms of their entire cross-elasticity vectors and conveniently handle their varying length nature. Secondly, it provides a framework for predicting censored entries which can shed light on potentially important competitors which have been censored.
Although we focus on the retail analytics application, our methodology is general and is relevant in any situation with informative missingness where only the top few order statistics of each observation vector are observed. This includes applications such as sports analytics, where only the top few performances, athletes or teams are observed (Malcata and Hopkins, 2014) or the stylometry analysis of literary texts which often focuses on analysing the frequencies of the top few most common words (Narayanan et al., 2012).
The rest of the paper is organized as follows: Section 2 outlines our motivation for clustering elasticity coefficients in the retail analytics setting. Section 3 covers the properties of uniform order statistics that are relevant to our model and reviews the relevant literature, in particular drawing parallels to survival and reliability analysis. Section 4 provides a background of the pertinent characteristics of the EW distribution and its relevance as a kernel to variable length order statistics sequences. Section 5 covers the theory of DPMMs and further provides the non-parametric mixture model of variable length order statistics sequences along with prior distributions specification. We outline the algorithm that is used in posterior inference in Section 6. Section 7 illustrates our methods on a simulated and a real retail analytics data set of cross-elasticity vectors. dunnhumby ltd, a customer science company, allowed us access to the anonymized cross-elasticity coefficient output of a set of products derived from the loyalty card transactions of a leading UK supermarket retailer. Section 8 provides a summary of our methods with potential extensions and applications of the work.
The programs that were used to analyse the data and some synthetic data can be obtained from https://rss.onlinelibrary.wiley.com/hub/journal/14679876/seriesc-datasets

Motivation
It is common in retail analytics to characterize products on the basis of how sensitive their sales are to the prices of their competitors and how customers interact with their product range. Analytics teams are constantly striving to develop models and inference methods that provide insight into understanding how price fluctuations that propagate throughout stores will impact the sales of products whose prices have not changed (Persson, 1995;Ferreira et al., 2015). Clustering products on the basis of their price sensitivity profile can provide a segmentation of a retailer's product range. This ultimately aids store planners in deciding on the value of a given pricing or display combination, as it provides information on how a product's sales are likely to react to the deviations of prices of other products. For instance, a poor display combination could be one that consists entirely of products that are characterized by their sales being primarily driven by the prices of its competition. This would lead to margin cannibalism-where profit made on one product is offset by the loss of profit of another product. The information of product clusterings would enable better pricing and display optimization.
One approach to such a sales sensitivity analysis is to cluster products in terms of their direct and cross-elasticity coefficients. Existing work on analysing sales sensitivities of retail goods, and using these to partition product ranges, has focused on defining summary statistics which capture many of the important aspects of price sensitivity profiles (Andreyeva et al., 2010). For example, Oliveira et al. (2007) investigated the heterogeneity in direct (but not cross-) elasticities across products and across consumer groups. Similar work investigated variations in category Tommy's puffs −0:05 0.04 0 0 Chef's BBQ crisps †For each product we have columns of order elasticity coefficients ϕ i and ϕ i η ij along with the respective sequences of η ij , which demonstrates the decreasing nature of data from model (1). The number of potential cross-competitors is set to n i = 6, although the number of terms censored to 0 differs. Importantly, the set of competitors can differ for each of the products and, in instances where there is a shared competitor (as with Supermarket puffs in this case), the value of ϕ i η ij , as well as its position in the ordering, need not be consistent across products. level summaries of cross-elasticities coefficients and the effect of sales sensitivity across store, demographic and product category levels (Hoch et al., 1995;Guerrero-López et al., 2017). However, much of the important information on price sensitivity profiles lies in the entire crosselasticity vectors and cannot be captured in summary statistics.
The data set that we have access to through dunnhumby ltd comprises cross-elasticity vectors for a set of products from a leading UK supermarket chain. Although the precise mechanics of how these estimates are obtained are highly engineered within their proprietary model, the general form of the model is given by refined versions of the Working-Leser regression (Working, 1943;Leser, 1963): log.S i,t / = −ϕ i log.Q i,t / + n i j=1 ϕ i η ij log.P i,j,t / + f.Q i,1:T , P i,1:n i ,1:T / + i,t , .1/ where, for each product i and discrete time t (in days), S it denotes its sales, Q it its price, P ijt the price of its jth competitor product, ϕ i its direct elasticity and η ij product j's relative crosselasticity with product i (as a multiple of the direct elasticity). The term i,t represents timedependent error. We use the notation 1:n to denote the set 1, : : : , n. The function f.·/ involves data aggregation and seasonality patterns that are relevant to retail sales, as well as additional information on display combinations and promotions; our focus here is not on this function, but rather on the post-processing of the output of the regression model. Here n i is the number of competitor products of product i, which are preselected by using expert knowledge that is encoded in an algorithm, to avoid using the entire set of products which is computationally prohibitive because of the complexity of f.·/. For the purposes of this study and to ease notation in later sections, we assume that competitor products are labelled such that product i's crosselasticity coefficients η ij are increasing in magnitude and that all products have the same potential number of competitors, i.e. n i = n, i = 1, : : : , N. The cross-elasticity coefficients are estimated by using shrinkage methods for sparsity, so that only l i ηs are non-zero, with the remaining exactly equal to 0. Table 1 provides two toy examples of variable length order statistic sequences in the context of cross-elasticities. Fitting this model to data provides us with a vector of cross-elasticities for each product, where some of the entries may be 0 because of sparsity. Our goal is then to cluster products according to these cross-elasticity vectors. Although, in theory, one can perform clustering alongside the regression, this is computationally prohibitive in the current context because of the highly tailored model fitting that is involved, so we treat the regression fitting as 'black box' fitting and work with the cross-elasticity vectors directly. To address the fact that, because of computational limitations, competitor products are preselected by using expert knowledge and are subject to error, we treat the 0 entries as missing minor competitors (with smaller crosselasticity coefficients than the observed coefficients). This results in a clustering framework whereby observation vectors have different numbers of non-missing entries.
Since cross-elasticity vectors arise as the outcome of penalized regression, it is natural to assume that coefficients are shrunk to 0 as the result of a penalization threshold. For example, in the simplest case of best-subset selection with an orthonormal design matrix, non-zero coefficients are exactly equal to the top order statistics of the corresponding ordinary least squares estimates. With this in mind, we treat the observed non-zero cross-elasticity coefficients as the top order statistics of an underlying vector of length n. The l observed coefficients of the leading product competitors are thus modelled as the top l order statistics of a set of n independent and identically distributed (IID) observations from an unknown underlying distribution. To account for the fact that a different number l of entries may be observed in each vector, we assume that l also follows a probability distribution, independently of the actual entries.

Order statistics of continuous distributions
The order statistics of a random sample are the reordered observations in terms of increasing size. More concretely, given a continuous distrbution variable X and observations x 1:n ∼ IID X, the order statistics x .1/ , : : : , x .n/ are given by x .1/ < x .2/ < : : : < x .n/ : . 2/ The jth order statistic of order (2) is denoted as x .j/ and, thus, x .1/ and x .n/ are the smallest and largest observations respectively. Given a density function f.x/ of a continuous random variable X, the density of the jth order statistic x .j/ , denoted by f .j/ .x/, is given by (Arnold et al., 1992) . 3/ One of our key modelling assumptions is that a partially observed cross-elasticity vector of length n with l non-zero entries in fact corresponds to the top l order statistics of a random sample of size n. We term each of these vectors of the top l order statistics variable length order statistics sequences, and denote them as x = .x .n/ , : : : , x .n−.l−1// /. We also denote the jth order statistic of sequence x by x .j/ . The density of x|l is denoted as f .n/:.n−l+1/ and is given by f .n/:.n−l+1/ .x|l/ = f .n/:.n−l+1/ .x .n/ , : : : , f.x .n+j−l/ /: .4/ By the independence of x .n−j/ |x .n−j+1/ ⊥ ⊥ x .n/ , x .n−1/ , : : : , x .n−j+2/ and by equation (4), the density of the conditional distribution of x .n−j/ |x .n−j+1/ , l for j < l (denoted as f .n−j/|.n−j+1/ ) is given by and thus the density of the joint sample x|l can also be expressed in hierarchical format: f .n/:.n−l+1/ .x|l/ = f.x .n/ / l−1 j=1 f .n−j/|.n−j+1/ .x .n−j/ |x .n−j+1/ , l/: .6/ Finally, the joint distribution of x can be combined with the n − l zero entries of x through f.x/ = p.l/ f .n/:.n−l+1/ .x|l/, .7/ where p.l/ is the probability mass function over the length of the sequence. Here we assume that l and the magnitude of the non-zero entries of x are independent. Much work has been done in the study of the theoretical properties of order statistics (Beutner and Kamps, 2009) and has been applied to areas such as modelling software reliability (Wilson and Samaniego, 2007), reliability of propulsion systems of aircraft (Warr and Collins, 2014) and recommender systems (Caron and Teh, 2012). A particularly relevant field of order statistics which bears resemblance to our problem set-up lies in the field of reliability analysis, known as k out of n systems. A k out of n system models the failure of k out of n components within a finite time horizon. The set of k ordered values of the time until failure (censored or not) can then be modelled as the observed order statistics of a base distribution. Much of the relevant non-parametric work has focused on flexibly learning the underlying base distributions (Wilson and Samaniego, 2007;Barghout et al., 1998) and building hierarchical versions of these models (Ghosh and Tiwari, 2007). In the k out of n framework, a standard assumption is that each sequence or system produces the same marginal order statistic, whereas we would like to allow additional flexibility.
In the current context, we observe the top few order statistics of the cross-elasticity vector, with the remaining entries treated as missing. This type of data is akin to the format of models in survival analysis, where the probability of survival decreases over time and may be right censored. One aspect that is important to the success of Bayesian non-parametric models in survival analysis is the choice of kernel, as it impacts whether the relevant statistics and survival functions are recoverable. As a consequence, much attention is paid to the choice of kernel. Notably, a hierarchical structure in the base measure was introduced by De Iorio et al. (2004), whereas Hanson (2006) and Kottas (2006) used gamma and Weibull kernels within a DPMM framework respectively. The EW distribution was shown to be the first distribution that could model nonmonotone hazards (Mudholkar and Srivastava, 1993), which in our context correspond to order statistics terms whose modes exist but are not necessarily light tailed.

Exponentiated Weibull distribution
Following the formulation of our observations as order statistics of random samples, the choice of the underlying distribution of X will determine the behaviour of the corresponding order statistics. Here we are interested in a distribution which can allow for a range of light and heavy tail behaviour and provide interpretable analytical expressions for the distribution of its order statistics. We thus assume that these random samples are distributed according to the EW distribution. A random variable X is distributed according to the EW distribution, denoted as X ∼ EW.α, β, λ/, if its probability density and distribution function are given by  The EW distribution is an extension to the standard Weibull distribution through the inclusion of the additional parameter α, which enables the distribution to have a wide range of tail behaviours. Similarly to the Weibull distribution, λ is a scale parameter whereas β controls the tail behaviour of the distribution; distributions are heavy tailed for β < 1 and light tailed otherwise. Furthermore, decreasing β monotonically increases the mean and variance, kurtosis and skew of the EW distribution. The effect of α depends on both the value αβ and whether α < 1; increasing α increases symmetry around the mean and mode. These different modal, asymptotic and tail behaviours (Nassar and Eissa, 2003) are summarized in Table 2. Fig. 1 demonstrates various density plots for differing combinations of .α, β, λ/; various asymptotic, modal and tail behaviours are observed.

Exponentiated Weibull distribution application to order statistics
There are some key properties of the EW distribution that lead to useful applications to order statistics and variable length order statistics sequences. The joint density (4) under the EW distribution for fixed order sequences of lengths l is given by where f is the EW density function (8).
The EW distribution handles censoring naturally, since the censored, joint and conditional densities under the EW distribution belong to the same family. For example, if x i ∼ IID EW.α, β, λ/ for i = 1, 2, : : : , n, then x .j/ ∼ EW.jα, β, λ/. Similarly, the conditional distributions are also readily available. This means that the properties and interpretability of the EW distribution transparently carry over to its order statistics. Finally, the EW distribution can account for both light and heavy tails, enabling us to capture different types of decay behaviours of the elasticity vectors. Fig. 2 provides some examples of order statistics sequences, which demonstrate various decay behaviours and tail behaviours that can be produced under the EW kernel.

Non-parametric mixture model of variable length order statistics sequences
As outlined in Section 2, our ultimate goal is to cluster variable length order statistics sequences, here arising through the behaviour of the cross-elasticity vectors of different products. For this, we use the EW distribution as a representation of cross-elasticity decay behaviour. However, to account for different behaviour across products, we additionally cluster products that potentially correspond to the same EW distribution. We opt for a mixture modelling framework to allow fully model-based uncertainty to propagate through the clustering inference. To account for an unknown number of underlying components, we use a Bayesian non-parametric mixture modelling formulation, guiding the number of clusters through a prior on the base distribution of a component. We thus model the entire set of cross-elasticity vectors non-parametrically as a DPMM (Antoniak, 1974).
A DP is a distribution over random probability measures (Ferguson, 1973), parameterized by base distribution G 0 over a measurable space Θ, and concentration parameter ν, denoted as G ∼ DP.νG 0 /. Realizations from DP.νG 0 / are centred near the base distribution G 0 , i.e., for any measurable set A ⊂ Θ, E[G.A/] = G 0 .A/. The concentration parameter ν controls the degree to which realizations from DP.νG 0 / are close to G 0 . Sethuraman (1994) established a convenient formulation of a DP known as a stick breaking construction, which expresses a distribution G ∼ DP.νG 0 / as where δ x is the Dirac measure of mass centred at x and θ i ∼ IID G 0 . A DPMM was proposed by Antoniak (1974) as a mixture model with a DP prior over the random mixing distribution. A DPMM can be expressed hierarchically: x i |θ i ind: ∼ π.x i |θ i /, i = 1, : : : , N, where π is the response distribution of x and F 1 and F 2 are independent priors of parameters ν and ω respectively. The unique values of a vector θ are referred to as θ Å and we use θ Å C i to denote θ i given an allocation of observation i to cluster C i . DPMMs can be used to estimate a density p.x/ by a countably infinite mixture of kernels functions by placing a DP prior over the mixing distribution. Thus given a density p.x/ to estimate, and the family of density f.x|θ/ parameterized by θ, then

The model
We now propose a DPMM of variable length order statistics sequences on mixtures of distributions satisfying equation (10). Placing a DP.νG 0 / on the distributions (10) is an attractive approach to handling the complex multimodalities, decay rates and variable lengths that order statistics sequences can exhibit as discussed in Section 4. Thus, the DPMM of variable length order statistics sequences can be expressed in hierarchical format (6) by  (14) can also be expressed through the iterative formulation x i,.n/ ∼ EW.nα i , β i , λ i /: .15/ which follows from equation (6). We treat the lengths l and observations x i,.j/ of x as indepen-dent to allow detection of competitor omissions and to ease computation. Since cross-elasticity coefficients are identically distributed a priori, each individual coefficient has the same probability of being censored, leading to a binomial prior on l i ; to avoid the degenerate case of empty cross-elasticity vectors, we force one of the Bernoulli trials to be 1. The base distribution G 0 is a key aspect of the DP.νG 0 / as it specifies the prior over .α, β, λ, ω/ atoms which defines the cluster structure of the model; here we specify G 0 as G 0 .α, β, λ, w/ = gamma.α|α 1 , α 2 / gamma.β|β 1 , β 2 / gamma.λ|λ 1 , λ 2 / beta.w|a, b/: .16/ The hyperparameters .a, b, α 1 , α 2 , β 1 , β 2 , λ 1 , λ 2 / are treated as fixed, chosen depending on the modelling context and reflecting prior expertise. The prior for ν is assumed to be gamma.τ 1 , τ 2 /, allowing the relationship Escobar and West, 1995) (where N Å is the number of occupied clusters) to inform our prior expectation of the number of clusters.

Posterior inference
We now present an efficient Markov chain Monte Carlo (MCMC) procedure for obtaining samples from the posterior of p.α, β, λ, w, ν|X/ from the model that is proposed by expression (14) with X = {x i,1:n : x i,n−l i +1 x i,n−l i +2 : : : x i,n , with x i,j censored to 0 for 1 j n − l i }, where x i includes the variable length order statistics sequence of length l i (uncensored ordered entries), with the remaining n − l i being censored. There are three steps to obtaining samples from p.α, β, λ, w, ν|X/ for each MCMC iteration: sampling the atoms .α, β, λ, w/ of the DP.νG 0 / for each order statistics sequence, sampling the clusterwise atoms for each of the unique clusters (as induced by DP.νG 0 /) and, finally, sampling the ν scale parameter. We initiate by using the Polya urn exposition of a DP (Blackwell and MacQueen, 1973) by taking a Gibbs sample of θ i = .α i , β i , λ i , w i / atoms associated with observation x i by using Here f.x i |θ/ = f .n/:.n−l i +1/ .x i |l i , α, β, λ/p.l i |w/, where f .n/:.n−l i +1/ is specified in equation (10) and the conditional distribution p.l i |w/ = . n−1 l i −1 /w l i −1 .1 − w/ n−l i . H i is the posterior distribution for θ based on the prior distribution G 0 (16) with likelihood f.x i |θ, ν/. Here θ −i denotes the vectorized atoms of θ excluding the ith atom θ i , {θ Å 1 , : : : , θ Å N Å } denotes the unique values of θ i , N Å the number of unique clusters that are induced by the DP and N Å k the number of points that are assigned to atom θ Å k . As calculating the integral q Å 0 is intractable, we use algorithm 8 (Neal, 2000) to approximate q Å 0 by a weighted mixture of likelihoods by taking c auxiliary components sampled from the prior distribution G 0 . Concretely, samples of θ j ∼ IID G 0 for j = N Å + 1, : : : , N Å + c are drawn, which then reduces equation (17) to taking a sample from the multinomial distribution given by P.θ i = θ Å k |θ −i , x i , θ Å 1 , : : : , θ Å N Å +c /, which corresponds to The number of auxiliary components c that is chosen determines the level to which q Å 0 is approximated. Finally, the θ k atoms are then updated for each of the unique clusters k = 1, : : : , N Å to avoid inefficiencies that are associated with having to pass through extremely low probability states to reach higher probability states. This is achieved by taking a single sample from the posterior p.θ k |ν, x {i:C i =k} / for each k = 1, : : : , N Å . As taking exact samples from p.θ k |ν, x {i:C i =k} / is intractable for our choice of kernel (10) and prior G 0 (16), the Metropolis-Hastings algorithm is used to sample from p.θ k |ν, x {i:C i =k} / for k = 1, : : : , N Å . This involves taking sufficient burnin samples until convergence to the stationary posterior distribution is satisfactory, at which point θ k is then taken as the last sample from the Metropolis-Hastings procedure. Finally, ν is updated in line with Escobar and West's (1995) auxiliary variables approach. Further details of the posterior inference steps are included in Appendix A.

Data examples
We now illustrate how our methodology works in practice by performing two simulation studies, before proceeding to a real world retail analytics data set. In the first example (Section 7.1) we generate data from our model. In the second example (Section 7.2) we generate data by using a gamma distribution as a kernel (rather than an EW distribution). We then fit our model to both data sets by using vague priors.

Simulated data 2
This simulated example differs from the first simulation study in that the data are simulated from a mixture of gamma distributions rather than a mixture of EW distributions. The purpose of fitting our model to a mixture of gamma distributions instead of a mixture of EW distributions is to test the inference in a less optimistic setting and to establish whether the EW kernel is sufficiently flexible to capture the decay of order statistics sequences from a set of mixtures that are not a mixture of EW distributions. The mixture components θ Å 1 , θ Å 2 and θ Å 3 are selected to produce simulated mixtures that imitate the mixtures (18).
We use the steps that were outlined in Section 6 for parameter inference and perform 10000 MCMC iterations with 200 burn-in, and we thin every 10 samples after the subsequent burn-in samples have been discarded. We present the MCMC output based on the inference methodology of Section 6 on the simulated data of Sections 7.1 and 7.2. Since individual clusters are not identifiable (up to permutations), an additional identifiability criterion is required to perform clusterwise inference. We implement algorithm 2 of Lau and Green (2007) on the posterior samples to derive an 'optimal' partition. Lau and Green (2007) selected the partition of the observations into clusters C Å (with some permutation) which minimize a linear loss function of the posterior expected loss of the posterior marginal coincidence probabilities. This is equivalent to maximizing where M = {.i, j/ : i < j; i, j ∈ {1, : : : , N}}, K = b=.a + b/ ∈ [0, 1] where b is the penalty of misclassifying two points into different clusters (when they should be) and a is the penalty of misclassifying two points into the same cluster (when they should not be), C is a given clustering of the observations (up to permutation) and ρ ij is the posterior coincidence probability between points i and j. This optimal partition defines cluster assignments C Å i for each observation x i and is used in future sections to compute clusterwise point estimates of various quantities of interest. As equation (20) needs to be maximized over K, we maximize equation (20) over each K ∈ {0:1, 0:2, 0:3, 0:4, 0:5, 0:6, 0:7, 0:8, 0:9} and select K and C Å that give the maximal value.
The partition C Å that is obtained through the above algorithm determines an optimal number of clusters and an optimal allocation of each observation into a cluster. Posterior distributions of clusterwise parameters are then obtained by sweeping through MCMC samples and, at each iteration, averaging over all parameter values that are associated with each observation within a cluster of C Å .
Figs 3 and 4 present plots of the MCMC output which include the histogram of the number of occupied clusters N Å , a heat map of the posterior marginal coincidence probabilities (cluster co-  (19): these density plots demonstrate that the EW kernel successfully describes the mixture of decay sequences although the data are generated from a mixture of gamma distributions membership) and density estimates of the order statistics sequences x i . The maximum a posteriori estimate for the number of clusters is indeed 3, with a clear separation of the observations into three clusters, and the marginal density estimates closely match the corresponding histograms of the data. Table 3 summarizes our MCMC output for the simulated study in section 7.1 using the clusters that are defined by C Å . The estimates are very close to the true parameter values, which are also contained within the 95% credibility intervals, indicating that the inference is working effectively.

Retail analytics data set
We apply our method of order statistics clustering to a retail analytics data set from a leading UK supermarket chain. Access to the anonymized data was provided by dunnhumby. The data set consists of the cross-elasticities for a category of supermarket products of the format that was described in Section 2: X = {η i,1:n : η i,.n−l i +1/ : : : η i,.n/ , with η i,.j/ = censored to 0 for1 j n − l i }, where we have observed only the top l i order statistics of each cross-elasticity vector η i . To enable a straightforward interpretation we focus on the snacks category which consists of N = 275 (out of thousands) products, so that observations consist of N = 275 vectors of cross-elasticity coefficients. For this study, a maximum of n = 10 competitors is considered a priori to reflect a product's most significant competitors. The snack category consists of the following product line breakdown: 22.5% traditional flavoured crisps (salted, cheese and onion, and salt and vinegar), 33.1% exotic flavoured crisps (crisps excluding traditional flavours), 8.73% tortillas, 8.00% popcorn, 7.64% nuts, 4.73% dips, 2.18% pretzels and 13.1% other peripheral quick snack products. Fig. 5 shows summary plots for the snacks category in this study, although other categories will show different behaviour. Specifically, we show histograms of the lengths l i of η i as well as the top two terms of the sequences. The histogram of the top order statistics demonstrates spikes centred near 0.0 and 1.0, suggesting possible multimodality.

Omitted competitors and aggregate competition
We introduce two statistics that are relevant to the retail analytics setting; omitted competitors and mean aggregate competition. These notions have key interpretations in the retails analytics context and will enable us to assess model fit. As discussed in Section 2, censoring of lower order statistics in the cross-elasticity vector occurs through penalized regression. However, it is possible for potentially important competitor products to have been inadvertently omitted from the regression equation, meaning that the cross-elasticity vector should have included additional uncensored entries. The objective of the omitted competitors statistic OC is to assess whether the truncation has occurred prematurely by predicting the subsequent term of the observed order statistics sequence (i.e. η i,.n−l i / of η i ) and assessing whether this predicted value is sufficiently large. Concretely, we say that an elasticity vector contains omitted competitors if its variable length order statistics sequence satisfies for some truncation constant > 0 and whereη n−l represents the random quantity of the .n −l/th order statistic of n IID EW.α, β, λ/ samples withl ∼ 1 + binomial.n − 1, w/. In other words,η has the same distribution as η, but without any censoring. Thus the OC-statistic represents the expected value of the first censored term of a cross-elasticity vectorη, were we to have observed it. The value of should be chosen to represent a 'small value' within the modelling context. We set = 0:05 as a sensible value to deem truncation (and it will be fixed for our subsequent analysis) as it implies that, if log-price deviations of the next competitor are expected to account for more than 5% of equivalent log-price changes of the product's own cross-elasticity coefficient ϕ i , we conclude that this is a significant omission in the sales model. One of the benefits of interpreting the cross-elasticities as variable length order statistic sequences is the utility that they provide with respect to defining the OC-statistic by casting censored observations into a missing data framework. The OC-statistic crucially relies on being able to make a prediction of the subsequent value of a cross-elasticity vector were it to be observed. The variable length order statistic sequence model, by capturing the sequential decay of these decreasing sequences, enables inferences on subsequent entries of these cross-elasticity vectors that flexibly incorporates the rates of decay across the previous entries.
7.4.1.2. Definition 2: aggregate competition. One of the primary interests of the analysis is characterizing products in terms of their sales sensitivities with respect to their competitors' prices. We introduce the notion of aggregate competition AC to summarize the total effect of competition on a product's sales through its competitors' price changes. We achieve this by defining the aggregate competition of product i as the sum of the top l cross-elasticity coefficients. Concretely, AC for a cross-elasticity vector distribution is given by AC can be thought of as the total percentage effect that log-price deviations of the top l elasticity terms (where l is the expected number of competitors terms) has with respect to the equivalent price changes of the product's own log-price. For example, if a product's AC is 0.25, it means that if the log-price decrease across each of its competitors was 1 unit, then the product's log-price would need to decrease by 0.25 to offset the loss of sales that its competitors' price changes would have had on the product's sales. Thus a large AC indicates that a product's sales are significantly impacted by its competitors' prices.

Markov chain Monte Carlo output
We present the MCMC output of the real retail cross-elasticity data set and use the following priors for .α, β, λ, w/ and ν (DP scale parameter): .α, β, λ, w/ ∼ gamma.7, 7=10/ gamma.0:5, 1/ gamma.1, 1/ beta.2, 3/, ν ∼ gamma.5, 1/ respectively. These priors are selected to reflect a prior expectation of the decay and typical length of the cross-elasticity vectors in the retail analytics context. Specifically, the priors over .α, β/ are selected to reflect prior knowledge of the modal nature of the coefficients and the expected heavy-tailed nature of the cross-elasticity coefficients. In addition, these priors were chosen to be more restrictive than in the simulated examples, to overcome the strong dependence between α and β which, for small data sets such as this, leads to weak identifiability. In particular, the prior for α is centred near 10 as before, but with a variance of about 14. The complementary parameter β is then centred at 0.5 (corresponding to a heavy-tailed EW mixture), with a variance of 0.5. In other words, mixtures are 'shrunk' towards smaller values of β, i.e. towards assuming that there are no omitted competitors, unless the data strongly suggest otherwise. The prior w ∼ beta.2, 3/ is selected to prefer observed cross-elasticities of length 4-5. The prior for λ (purely a scale  parameter) is uninformatively chosen as before, and the prior for ν is chosen such that, a priori, between three and 30 clusters (roughly) are expected. Fig. 6 presents a histogram of the number of unique clusters N Å and a heat map of the posterior marginal coincidence probabilities (cluster co-membership). We see that the maximum a posteriori number of clusters is 3, with two large and one small cluster. Table 4 provides the category breakdown of each cluster, together with the number of observations in each as well as OC-and AC-values. It also includes the posterior mean and 2.5% and 97.5% posterior credible intervals of .α, β, λ, w/ for each of the optimal clusters. Fig. 5 shows density estimates of the number of observed entries of the cross-elasticity vector, as well as the top two observed values in each vector, showing that our model is capturing these observed quantities very well. In particular, in Fig. 5 we observe a spike of very small values for the top order statistics η .10/ and η .9/ , which the model can accommodate through a small value of αβ. Fig. 7 provides heat maps of pairwise posterior distributions of the parameters which demonstrate a neat separation between pairwise atoms. Interestingly, we observe that larger values of λ (corresponding to a smaller mean) are associated with larger values of w; instead, β-values are inversely associated with values of w, suggesting that the censoring in this case is largely driven by β.
To assess model fit, we calculate the posterior predictive p-values (Meng, 1994) of AC for each of the clusters that are defined by C Å . Posterior predictive p-values involve generating repetitions X rep from the predictive distribution p.X rep |α, β, λ, w/ for each MCMC sample and calculating the p-value 2[1 − p{T.X rep / > T.X/|X}] for some test statistic T.X/: in this case the aggregate competition. Fig. 8 provides predictive posterior p-value plots on the observed aggregate competition AC over each cluster, compared against histograms of generated AC-statistics over predictive replicates of X. These all comfortably fall within the 95% prediction intervals. Trace plots of the .α, β, λ, w/ atoms across the unique clusters are included in Appendix B.

Retail analytics discussion
Considering the clusters that are given by C Å and linking them to the corresponding categories, we see interesting breakdowns. Firstly, the first cluster has a high concentration of traditional flavoured crisps and no nut products, whereas the second cluster has a significantly underaverage representation of traditional crisps. Finally the third cluster comprises nuts, pretzels and the other product categories.
The first and second clusters appear not to have competitor products omitted from their regression models since OC 1 = 0:038 and OC 2 = 0:031 < and thus indicate that we do not expect any of the unobserved cross-elasticities to be of any significance. However, the third cluster exhibits competitor omission since OC 3 = 0:96 > . This implies that, according to the model, we expect to find at least one more competitor with a non-negligible crosselasticity.
The posterior mean values of parameters of the first cluster are α 1 = 0:16 and β 1 = 1:51 with w 1 = 0:34 and an aggregate competition of AC 1 = 0:55, which point to a light-tailed distribution of cross-elasticities whose probability density diverges at 0. This is in line with the fact that this cluster largely consists of traditional crisps, which are a fiercely competitive product line, where products have multiple substitutes and thus a high degree of sales sensitivity is expected. The second cluster exhibits similar behaviour, with posterior mean parameters α 2 = 0:06, β 2 = 1:71, w 2 = 0:24 and an aggregate competition of AC 2 = 0:42, also implying a light-tailed distribution whose probability density diverges at 0. The third cluster is rather different; its posterior mean parameters α 3 = 5:73 and β 3 = 10:88 suggest a light-tailed distribution with mode away from 0. It largely consists of vectors with only a single cross-elasticity entry (through w 3 = 0:016), although the model suggests that an additional competitor may have been missed (or does not exist). Finally, its aggregate competition (despite the missing competitor) is AC 3 = 1:11, so price changes of these leading competitor products can account for 1:11 of equivalent prices changes of the product's own price changes. These parameters suggest that these products are substitutes, i.e. products only bought as an alternative because other equivalent products are unavailable or too expensive. With respect to the expected values of the order statistic entries themselves, we observe similar order statistic patterns between the first and second clusters; each of the first order statistics entries accounts for a roughly similar amount of its leading direct elasticity (28% and 30% respectively); however, the decay rate between the subsequent order statistics of the first cluster is significantly slower than that of the second cluster (roughly 55-60% of their previous value compared with 40-45%). This decay rate observation between subsequent order statistics entries supports the discrepancy between each of the first and second cluster's AC-statistics as well as the first cluster comprising food items which traditionally have a higher number of competitors than in the second cluster. Similarly to before, the third cluster differs significantly from the first and the second. Its first order statistic entry accounts for 98% of its leading direct elasticity and has a slower decay rate between successive order statistic sequences, each of these artefacts being significantly different from that of the previous clusters.
Retailers also wish to understand the behaviour of their product range at a less granular level, e.g. at a category level. Clustering of cross-elasticity profiles provides a means to extract a new summary profile for a subset of products through a principled data-driven approach. Crucially, these can aid store planners and business specialists in the retail analytics domain to understand the optimal pricing and display combinations better. For example, products in the third cluster are highly sensitive to specific competitor products but otherwise are unaffected by the bulk of products around them. In contrast, products in the first and second clusters are cannibalized by their competitor products, meaning that increasing the sale of one product decreases the sale of another, but with the second cluster being more robust to these prices changes than the first.

Summarizing remarks
We have presented a Bayesian non-parametric mixture model for censored ordered data, using the EW distribution as a kernel. Our approach enables flexible modelling of cross-elasticity coefficients without the need to specify the number of components and lends itself to meaningful interpretation. We implemented our methods on a data set of cross-elasticities, focusing on quantities of interest in the retail analytics context, such as aggregate competition and potential omitted competitors. Our model could capture several interesting features in the data through the corresponding clustering.
These methods can potentially be extended in several directions. Firstly, one could introduce structure between the distribution of the length of the order statistics sequences and the kernel distribution. This may enable borrowing of information between these two sources of information, although it will become more computationally cumbersome. Secondly, one could relax the assumption of ordered observations to account for observations only ordered in expectation. Although in the cross-elasticity context this was not appropriate, in applications such as sports analytics it may be more reflective of the data. For example, the best athlete will not always have the best performance at a competition; instead, the ranking corresponds to average performance. Finally, we would like to explore combinations of different product categories to investigate similarities in market behaviour between otherwise disparate products. on dunnhumby's cross-elasticity data of the snack category at iteration t and the trace of N Å t . We plot the square-root traces of .α, β, λ/ to induce similar scales for graphical convenience. All plots indicate sufficient mixing and satisfactory convergence.