A general framework for updating belief distributions

Summary We propose a framework for general Bayesian inference. We argue that a valid update of a prior belief distribution to a posterior can be made for parameters which are connected to observations through a loss function rather than the traditional likelihood function, which is recovered as a special case. Modern application areas make it increasingly challenging for Bayesians to attempt to model the true data‐generating mechanism. For instance, when the object of interest is low dimensional, such as a mean or median, it is cumbersome to have to achieve this via a complete model for the whole data distribution. More importantly, there are settings where the parameter of interest does not directly index a family of density functions and thus the Bayesian approach to learning about such parameters is currently regarded as problematic. Our framework uses loss functions to connect information in the data to functionals of interest. The updating of beliefs then follows from a decision theoretic approach involving cumulative loss functions. Importantly, the procedure coincides with Bayesian updating when a true likelihood is known yet provides coherent subjective inference in much more general settings. Connections to other inference frameworks are highlighted.

1. Introduction.Data sets are increasing in size and modelling environments are becoming more complex.This presents opportunities for Bayesian statistics but also major challenges, perhaps the greatest of which is the requirement to define the true sampling distribution, or likelihood, for the data generator f 0 (x), regardless of the study objective.So even if the task is inference for a low-dimensional statistic of the population, the analyst is required to model the complete data distribution and, moreover, assume that the model is "true".We propose a coherent procedure for general Bayesian inference that does not require complete knowledge of f 0 (x) and which connects information in the data to the value of an unknown object or parameter of interest via the use of loss functions.By "coherent" we mean that all relevant information is contained in the posterior probability distribution.For ease of exposition we shall use the terminology "parameter of interest" and "statistic of interest" interchangeably.We show how the approach leads to conventional Bayesian updating when the true likelihood is known but allows for rational updating of beliefs in much more general settings.
The central tenet of our paper is this: as applications get more complex Bayesian analysts will increasingly be forced to forsake the notion that they can precisely model all aspects of the data.Settling for a misspecified model undermines the traditional Bayesian approach leading to interpretability problems along with the reliability of the posterior distribution.If the analyst acknowledges this then they should seek an alternative coherent way to proceed.We aim to contribute to this task.
1.1 The idea.Let θ denote a parameter or statistic of interest, for example the mean or median of a population F 0 (x), and let x denote a set of observables x from F 0 (x), with F 0 unknown.We are interested in a formal, optimal, way to update prior beliefs π(θ) to posterior beliefs π(θ|x) given x.
Bayesian inference proceeds through knowledge of a complete, true, model for f 0 (x).This is often parameterised via a sampling distribution f (x; β) and a prior π(β), such that, and following de Finetti we know that all exchangeable distributions can be modelled in such form, see for example Bernardo and Smith (1994).Then inference for the statistic of interest, θ, can occur via, where g[•] defines the statistic; for example, if θ is the mean then g[f (•; β)] = xf (x; β) dx; or if θ denotes the median then g[f (•; β)] = F −1 β (0.5).Following the Savage axioms (Savage, 1954) the Bayesian update can be shown to be the rational way to proceed.However, f 0 (x) may be unknown, x may contain a vast number of data points and β might be high-dimensional.Taken together, this makes the Bayesian approach somewhat cumbersome.
We are interested in the rational updating of beliefs, π(θ) → π(θ|x), under more realistic and manageable assumptions.To do so we relax the assumption that f 0 (x) is known and make use of loss functions to connect information in data to parameters of interest.Informally for now, we write such loss functions as l(θ, x), and we will discuss specific types later in the paper.We shall consider the reporting of subjective beliefs, π(θ|x), as an action made under uncertainty and use decision theory to guide the optimal action.See for example Hirshleifer and Riley (1992).
To outline the theory, let ν denote a probability measure on the state space of θ.We shall construct a loss function to select an optimal posterior measure ν(θ) given a prior π(θ) and data x.To achieve this we construct a loss-function L(ν; π, x) on the space of probability measures on θ, and then present ν = arg min ν L(ν; π, x), as the optimal "honest" representation of beliefs about the unknown value of θ given the prior information represented via the belief distribution π and data x.As it is widely assumed data x is an independent piece of information to that which gave rise to the prior, it is appropriate to consider an additive, or cumulative, loss function of the form where h 1 and h 2 are themselves loss functions representing fidelity-to-data and fidelity-to-prior, respectively.See, for example, Berger (1993) for more about ideas on uses of loss functions within decision theory.Under this approach the analyst needs to specify h 1 and h 2 in such a way that they proceed in an optimal, rational, and coherent manner.We can deal immediately with the loss function h 2 (ν, π).Somewhat remarkably, as proved later, for coherent inference h 2 must be the Kullback-Leibler divergence, Kullback and Leibler (1951), and given by Regarding h 1 , since ν(θ) is a probability measure representing beliefs about θ, the only choice is to take the loss-to-data h 1 (ν, x) as the expected loss (see von Neumann and Morgenstern, 1944) of l(θ, x); that is with the particular types of the loss-function l(θ, x) to be discussed later.In general there the form of l(θ, x) will be problem specific as discussed in Section 3.
Substituting in h 1 and h 2 , the cumulative loss function is then given by L(ν; π, x) = Θ l(θ, x) ν(dθ) + d KL (ν, π). (2) Surprisingly, but quite easy to show, the minimizer of L(ν; π, x) is given by Θ exp{−l(θ, x)}π(dθ) . (3) This has the form of a Bayesian update but where the complete log-likelihood, log f (x; β), is replaced by a loss function l(x, θ) targeting the parameter of interest.As is usual in decision problems involving the use of loss function, it is incumbent on the decision maker to ensure solutions exist.So l(θ, x) needs to be constructed so that 0 < Θ exp{−l(θ, x)}π(dθ) < +∞.
Whereas the Bayesian approach requires the construction of a probability model for all possible outcomes conditional on all unknown states of nature, the approach here requires the construction of loss functions given the outcomes for only the parameter of interest.This allows the decision maker to concentrate on modeling only those quantities that are important to the task to hand.

Connections with other work.
There is a vast amount of literature on procedures for robustly estimating a parameter of interest by minimizing the cumulative loss See, for example, Hüber (2009), where we note that the primary aim is not modeling the data but rather estimating a statistic.This is an advantage when a probability model for the data is too hard to formulate.We are presenting a Bayesian extension of this idea.Since we are interested in a belief distribution for θ given data, and have further information provided by π, we claim the appropriate Bayesian version is given by (2).Some of the ideas presented in the paper have been considered in a less general setting by Zhang (2006aZhang ( , 2006b) ) and Jiang and Tanner (2008).In Zhang (2006a) an estimation procedure, named Information Risk Minimization, also known as a Gibbs posterior, which has the same form as (3), is described in Section IV of his paper.This is our procedure when data is regarded as stochastic.Zhang then concentrates on the properties of the Gibbs posterior.
Further theoretical work is done in Zhang (2006b).In Jiang and Tanner (2008) a Gibbs posterior is studied in comparison with a true Bayesian posterior where the model is assumed to be misspecified.The claim is that posterior performance of a Bayesian model can be unreliable when misspecified, whereas a Gibbs posterior which targets points of interest can have better performance.The comparison involves variable selection for highdimensional classification problems involving a logit model.
We build on the work of Zhang (2006aZhang ( , 2006b) ) and Jiang and Tanner (2008) in a number of important directions.The first is that we develop an approach for inference and statistical applications rather than studying the theoretical properties of the posterior under misspecification.We provide a principled approach to scale the relative information in the data to information in the prior; that is left as an arbitrary free parameter in Zhang (2006aZhang ( , 2006b) ) and Jiang and Tanner (2008).We show that in order to remain coherent, the modeller must adopt the Kullback-Leibler divergence as the loss between prior π and ν.Finally, we demonstrate how to incorporate nonstochastic information into the cumulative loss function, which provides a definition of a conditional probability in the presence of non-stochastic information.
Another similar construct to L(ν; π, x) is provided by Zellner (1988), who presents what is essentially a loss function for the posterior distribution using ideas of information processing from prior to posterior.The motivation is different and relies on notions of information present in log probabilities and log likelihoods, which may not be compatible as noted by J.M. Bernardo in the discussion of Zellner's paper.Furthermore, our derivation of the loss function allows a broader interpretation of the elements, which does not require the existence of a probability distribution for the observation.
Concerns that the specification of a complete model for the data generating distribution is unachievable date back to de Finetti (1937) and the notion of "prevision".In his work de Finetti considers conditional expectation as the fundamental primitive, or statistic, of interest on which prior beliefs are expressed and updated.Recently other researchers have further developed this approach under the field of Bayesian linear statistics, see Goldstein and Wooff (2007).
There has been increasing awareness of the restrictive assumptions that formal Bayesian analysis entails.Royall and Tsou (2003) describe procedures for adjusting likelihood functions when the model is misspecified.More recently, Doucet andShepherd (2012), andMuller (2012) consider formal approaches to pseudo-Bayesian methods using sandwich estimators to update subjective beliefs, motivated by robustness to model misspecification, see also Hoff and Wakefield (2013).Ribatet et al (2009) consider pseudo-Bayesian approaches with composite likelihoods.
Several authors have considered issues with Bayesian updating with proxy models, f (x; θ), for example, Key et al. (1999), when (x i ) is known not to arise from f (x; θ) for any value of θ.That is, there is no θ conditional on which x is from f (x; θ).This is referred to as the M-open case in Bernardo and Smith (1994).One suggested solution is to use methods based on approximations and Key et al. (1999) describe one such idea using a crossvalidation approach.While this may be a pragmatic it does have some shortcomings.Most serious is that there is little back-up theory and this has repercussions in that the update suffers from a lack of coherence Another approach is to ignore the problem.That is, assume the observations are coming from f (x; θ) even though it is known they are not.According to Goldstein (1981), "there is no obvious meaning for Bayesian analysis in this case".The disaster of making horribly wrong inference can be protected to some extent by model selection; that is, postulating a number of models for f 0 (x), say f j (x; θ j ), with corresponding priors π j (θ j ), and model probabilities (p j ), for j = 1, . . ., M .But as Key et al. (1999) point out, how does one construct π j (θ j ) and p j when one knows none of the postulated models are correct.So the Bayesian update breaks down in that nothing has any interpretation.
A recent popular idea is to use Bayesian nonparametrics.See Ghosh andRamamoorthi (2003), andHjort et al. (2010) for reviews.The idea here is making the choice of modeling density f (x) so large by constructing a prior directly on a space of density functions, and written as π(df ), which has such a large support that it is reasonable to assume f 0 (x) lies in the support.A well known model is the infinite mixture model, whereby π(df ) is generating random density functions of the type where K is a density for each z, often the normal density and z denotes the mean and variance, and P is a random distribution function, usually of the type and the prior is assigned to (w l , z l ) ∞ l=1 .Here the (w l ) are weights and sum to unity.The Dirichlet process, Ferguson (1973), is widely used in such contexts; see Lo (1984) and Escobar (1988) for the origins of the model and sampling based algorithms for estimating the model.While this methodology has made rapid developments in recent years, including the development of sampling algorithms, for complex data structures there are still issues about just how large the supports are and indeed how complicated inference can be and how to construct priors which capture reasonable beliefs about dependencies in the data.Moreover this still requires the specification of complete beliefs on f 0 (x) even when the objective is inference for a summary statistic of the data distribution.
Finally, we note that it is informative to view the selection of ν, i.e.
as trading off fidelity to the data and fidelity to the prior.This highlights connections with penalised likelihood and regularized regression, see for example Hastie et al (2009).But whereas in penalised likelihood the objective is to select a single parameter estimate θ, the general Bayesian approach (4) selects a probability distribution ν(θ).
The layout of the remainder of the paper is as follows.In Section 2 we discuss how (3) arises as the unique minimiser of expected loss.In Section 3 we discuss forms for the loss-to-data functions and calibration.Section 4 then considers general forms of data, such as partial information and non-stochastic information.Section 5 provides some numerical illustrations including inference based on the Cox proportional hazards model and inference about the median of a distribution function.Section 6 concludes with a discussion on a number of points.
2. Information in the prior.Here we discuss the choice of the Kullback-Leibler divergence as being appropriate for quantifying the loss-to-prior h 2 (ν, π) in (1).With n independent pieces of information x = (x 1 , . . ., x n ) we take the cumulative loss as where h 1 will be taken in the integral form, i.e. the average or expected loss: Now, adhering to the "likelihood principle" (see Bernardo and Smith 1994), for any 0 < m < n, all the information contained in (x 1 , . . ., x m ) is to be found in ν m , where ν m minimizes and hence it follows that, where ν m now serves as the prior for future information (x m+1 , . . ., x n ).
For coherence, the solution from L for all cases of m must be the same.To derive the form of h 2 we start with the family of g-divergences, that is where g is a convex function from (0, ∞) to the real line and g(1) = 0. See Ali and Silvey (1966).For this coherence to be in force, it is necessary that the discrepancy h 2 is the Kullback-Leibler divergence.To be more precise, the following theorem can be stated: Theorem.Let the loss L(ν; π, [x 1 , x 2 ]) be defined by (5) and (6).Moreover, let ν (π,x 1 ,x 2 ) be the probability measure that minimizes the loss among the probability measures on Θ that are absolutely continuous with respect to π.Similarly, let ν (π,x 1 ) and ν ( ν (π,x 1 ) ,x 2 ) be the probability measures minimizing the loss L(ν; π, x 1 ) and L(ν; ν (π,x 1 ) , x 2 ), respectively.Assume that for every probability measure π on Θ and for every choice of the loss functions h 1 (ν, x 1 ) and h 2 (ν, , are all properly defined.Then h 2 is the Kullback-Leibler divergence.
In virtue of this Theorem, which is proved in Appendix A, for coherence it is required to take the Kullback-Leibler divergence.So, in the case of m = 0, we have where π is the initial choice of probability measure representing beliefs about θ in the absence of x.
The solution to this minimization problem is easy to find and is given by , and this is the solution since one can see that 3. Information in the data.In this section we will consider the form of h 1 in (1) that connects information in the data to the value of the unknown θ.We shall consider three broad situations, first when the analyst really believes they know the complete family of distributions from which (x i ) arose, the so called M-closed scenario.Second when f 0 (x) is unknown but where a complete likelihood f (x; θ) is being used as a proxy model, the so called M-open perspective.Finally, when the statistic θ does not index a complete sampling distribution or proxy model for x.
3.1 M-closed and self-information loss.When the analyst knows the family from which (x i ) arose then the Bayesian approach to learning is fully justified, well known and widely used as a statistical approach to inference; the book of Bernardo and Smith (1994) is comprehensive.Here we recall the essence of it: A parameter of a density function f (x; θ), θ ∈ Θ, is unknown and beliefs about it are encapsulated via a prior distribution π(θ).
Once (conditionally) independent samples (x 1 , . . ., x n ) are observed from the density function f (x; θ), the prior is updated to the posterior distribution via Bayes' Theorem; given by , is the likelihood function.The posterior then represents revised beliefs taking into account both the prior distribution and the observations.Mathematically, it is an application of Bayes' Theorem via the standard definition of conditional probability.
So the Bayesian update works and is applicable in the case when the (x i ) come from the density f (x; θ) for some θ ∈ Θ.In Bernardo and Smith (1994) this is referred to as the M-closed view.To see how Bayes arises in our framework, we would need to construct a loss function for l(θ, x) with the knowledge that x came from f (x; θ).It is well known that the "honest" loss function in this case is the self-information, or logarithmic loss function, given by l(θ, x) = − log f (x; θ).
See Bernardo (1979) and Merhav and Feder (1998) for more on the self information loss function.This amounts to the use of proper scoring rules to ensure that the analyst remains honest in expressing subjective beliefs, under which we recover the Bayesian updating rule.However, there are different ideas behind our derivation of this rule, with different assumptions being made.Most crucially, we need the (x i ) to provide independent pieces of information to maintain the credibility of the cumulative loss function.

M-open and the use of proxy models.
As has been mentioned by many authors, for example, Key et al. (1999), issues with the Bayesian rule arise when f (x; θ) is known not to be the family of densities from which the (x i ) come.Equivalently, there is no θ conditional on which x is from f (x; θ).This is referred to as the M-open case in Bernardo and Smith (1994).In many situations, the correct sampling density, f 0 (x), is unknown or unavailable or too complex to work with.There are a number of ways to attempt to resolve this issue from a Bayesian perspective.
One idea is to use methods based on approximations and Key et al. (1999) describe one such idea using a cross-validation approach.While this may be a suitable idea which can work in practice it does have some shortcomings.Most serious is that there is little back-up theory and this has repercussions in that the update suffers from a lack of coherence Another approach is to ignore the problem.That is, assume the observations are coming from f (x; θ) even though it is known they are not.According to Goldstein (1981), "there is no obvious meaning for Bayesian analysis in this case".The disaster of making horribly wrong inference can be protected to some extent by model selection; that is, postulating a number of models for f 0 (x), say f j (x; θ j ), with corresponding priors π j (θ j ), and model probabilities (p j ), for j = 1, . . ., M .But as Key et al. (1999) point out, how does one construct π j (θ j ) and p j when one knows none of the postulated models are correct.So the Bayesian update breaks down in that nothing has any interpretation.
We show in Appendix B that it is possible to learn about this θ 0 since an infinite collection of (x i ) yields f 0 (•) via the empirical distribution function and so θ 0 will be found with sufficient samples.Then we would wish the sequence of ν(dθ) to accumulate about θ 0 .So what is the appropriate loss l(θ, x) in the case where we're trying to learn about the value of θ 0 ?The loss function l(θ, x) = − log f (x; θ) is still the right choice.For the standardized cumulative loss based on a sequence of observations is given by which is minimized by θ 0 .
When an approximate model f (x; θ) has been supposed, it is often prudent to consider a number of models, say f j (x; θ j ) for j = 1, . . ., M , as we have mentioned previously.We can deal with this in a simple way.So let θ = (θ 1 , . . ., θ M ) and let π(θ) be the prior distribution for θ on Θ = ∪ M j=1 Θ j .This would be constructed by considering beliefs about which θ j from f j (•; θ j ) takes this family closest to f 0 (•).The model f (x; θ) would then be given by and the (p j ) would now be the probabilities describing beliefs about which model provides the closest density to f 0 (•).Hence, unlike the Bayesian ap-proach to model selection in the M-open case, all the quantities to be specified have clear interpretation.We can recover the Bayesian update when we take, for each So while the Bayesian approach has some issues to deal with whether the M-open or M-closed view hold, for us it is irrelevant.If one adopts θ 0 as the parameter value taking the family closest to f 0 (•) then one does not need to worry if in M-open or M-closed, since if f (•; θ) is the true family then obviously θ 0 reverts to the true parameter value.This point is crucial, since for the Bayesian it may be that one simply does not know if one is in the M-open or M-closed view (though strictly speaking this puts you in M-open) and then one needs a framework in which the same approach is adopted and justified regardless of which view is held.We have provided such a framework.
3.3 M-free.Often the analyst might not wish to express a full probability model for the data, either as it's too cumbersome or too problematic.A motivating example is inference for the median of a population of iid data.However, the analyst knows the object or statistic θ that they wish to express beliefs about.It is incumbent on them to choose a specification for l(θ, x) that provides greatest information on the unknown value.The literature on this is in the area of Robust Statistics and loss functions can be found in the literature pertaining to M -estimation and estimating equations.See, for example, Hüber (2009).We refer to this setting as M-free to highlight the model free aspect of inference.
An important class of loss functions is provided by the M estimators for a location parameter, Hüber (1964).So rather than using the loss function − log f (x i ; θ), a ρ(x i ; θ) is used in an attempt to obtain robust estimation, rather than the traditional maximum likelihood estimator, which can be suspect if the model is incorrect.This idea has been generalized to the class of estimating equations, whereby the estimate of θ is obtained by minimizing n i=1 ρ(x i ; θ).
Our approach, which mirrors this classical robust procedure, would use the loss function with solution provided by For example, one possible application would be the Generalized Estimating Equations, see Liang and Zeger (1986).For the grouped observations (x i1 , . . ., x in i ), where θ = (β, φ, α) and for some link function g, g(µ ij (β)) = x ij β, and for some correlation matrix R i (α) and diagonal matrix A i , with j entry given by a(µ ij (β)), with a a specified variance function, i , with φ a scale parameter.There is by now an abundance of literature on M -estimation, estimating equations and generalized estimating equations.Our point is that all such equations can be viewed as loss functions connecting independent units with parameters of interest.Hence, all fit within our framework and we would extend the loss function to include the prior π and we obtain an explicit expression for ν(dθ).In cases when the parameter estimation is done via iterative methods, which is typically the case, Markov chain Monte Carlo methods would substitute for our sampling strategies from ν(dθ).
In essence, this is the practical innovations of the framework we are proposing.We are claiming that any loss function of the type n i=1 ρ(x i , θ) can be extended to the Bayesian type updating mechanism.The θ 0 of interest is implicitly assumed to be the limit of the sequence of minimizers of the cumulative losses.This would be the minimizer of ρ(x; θ) dF 0 (x) and hence the prior beliefs are being expressed about this unknown value.Then the loss function l(θ, x) = ρ(x; θ) is ensuring the updates are indeed "moving towards" θ 0 .To complete the picture, it would have been that the decision maker would be happy to make a decision given the minimizer of ρ(x; θ) dF 0 (x).
3.4 M-free calibration of relative losses.In the M-closed and M-open settings the use of the self-information loss l(θ, x) = − log f (x; θ) results in a fully specified form for (3).However in the M-free setting there is an issue about the scale of the loss function h 1 which is a consequence of the apparent arbitrariness in the weight of l(ν, x) relative to l(ν, π), in that we are free to multiply either by an arbitrary factor.So equivalently we are interested in the loss function w l(θ, x) for some w > 0. The question is how to select w, noting that w controls the relative weight of loss-to-data to loss-to-prior.Of course, such an issue does not arise in the classical literature on estimation using such loss functions since there is no combining with different styles of loss functions.However the calibration of different types of loss function is not a unique problem.It arises in many applied contexts; possibly the most well known be in health economics where losses pertaining to costs need to be balanced against losses pertaining to health benefits.There are a number of ideas for the choice of w and we discuss them here.
3.4.1 Annealing.In the literature on Gibbs posteriors, the weighting parameter is labelled as a "temperature" and selected subjectively.There are clear connections here with the use of "power priors" (Ibrahim & Chen, 2000) where Such an idea has also been discussed in Walker and Hjort (2001).It is evident what w achieves; if 0 < w < 1 then the loss-to-prior is given more prominence than in the Bayesian update and the data will be less influential.
In the extreme case when w = 0 we retain the prior throughout.On the other hand, when w > 1 the loss − log f (x; θ) is given more prominence than in the Bayesian update and in the extreme case when w is very large the ν is accumulating about the maximum likelihood estimator for the model; that is ν(dθ) ≈ δ θ (dθ), where θ maximizes n i=1 f (x i ; θ).Alternative ideas for setting w include a data dependent assignment based on cross-validation and a random assignment once the parameter has appeared in the Gibbs posterior.That is, one considers for some probability measure π w (dw).

Unit information loss.
Here we discuss a subjective assignment but a more orientated and direct allocation.The subjective choice is based on a prior evaluation of the expected value of l(θ, x).
To aid in the calibration of the loss functions and the selection of w we can consider the following.Write the loss function with an additional term log π( θ), which is a constant, and where θ maximizes π(θ), so that the cumulative loss becomes In order to calibrate the information in the data relative to the prior we now assume that both loss functions, l(θ, x) and log{π( θ)/π(θ)} are nonnegative, and we standardise l(θ, x) such that min θ l(θ, x) = 0 for any x.If this is not the case then we replace l(θ, x) by l(θ, x) − l(θ x , x) where now θ x minimizes l(θ, x).Hence, we can regard as a loss function for θ with information provided by x and π.So, assuming that l(θ, x) > 0, we want to calibrate the two loss functions given by w l(θ, x) and log{π( θ)/π(θ)}.
These are two loss functions for θ and to adhere with the notion that at the outset before we have data, there is a single piece of information, we can calibrate the two losses by making the expected losses to match.That is, whether someone takes a θ and is penalized by the loss log{π( θ)/π(θ)}, or takes a (θ, x) and is penalized by the loss wl(θ, x), at the outset, the expected losses should match.They are confronted by two choices of loss with one piece of information and thus the losses can be calibrated by ensuring their expected losses coincide.The connection between expected information and expected loss can be found in Bernardo (1979).
Thus w can be set by ensuring Here E is with respect to a joint belief in x and θ; say m(x, θ), the marginal for θ of which is π(θ).So One empirical choice is then given by w = log{π( θ)/π(θ)} π(dθ) l(θ, x) π(dθ) dF n (x) .
Let us consider an example, where l(θ, x) = (θ − x) 2 with π(θ) = N(θ|0, τ 2 ) with m(x|θ) being any density with mean θ and variance σ 2 .Then we can evaluate Hence, this calibration idea yields the "correct" value of 1/(2σ 2 ) in this case.This construction requires the user specification of a joint density m(dx, dθ) which in some circumstances may prove difficult.One further suggestion is to replace the prior evaluation of the expected datum-loss with the observed unit information loss given x, where θx = arg min θ [ i l(θ, x i )] is the data-loss estimate of θ and p is the dimension of θ.For instance, in the above example this leads to, 3.4.3Hierarchical loss.Another way to proceed is to extend the loss function to include w as an unknown parameter.Standard ideas here would suggest we take for some ξ > 0. We would appear to be making no progress since we now have a ξ to assign.However, this is akin to the hierarchical Bayesian model where uncertainty is propagated via hyper-prior distributions to robustify the ultimate prior choice at some level.Hence, the allocation of a ξ would not be as crucial as the assignment of a w.For example, as w is a scale parameter on loss-to-data, taking l(w) = log w the solution is given by ν(θ, w|x, π) ∝ w ξ exp{−w l(θ, x)} π(θ, w) and given that w ξ can be absorbed in to the prior π it is perfectly reasonable to assess ξ subjectively.That is, it seems unreasonable to accept that π can be chosen subjectively but that ξ can not.

Operational Characteristics.
The idea here is to set w so that the posterior quantiles match up at some level of error to frequentist confidence intervals based on the estimation of θ via minimizing the loss So, if C α (w, x 1 , ..., x n ) is the 100(1 − α)% level confidence interval for θ, then we would select the w such that the posterior distribution of θ, with parameter w, is such that See, for example, the review article by Datta and Sweeting (2005) for references to probability matching priors and posteriors, and Ribatet et al (2009) for ideas in pseudo-Bayesian approaches with composite likelihoods.
4. General forms of information.In this Section we discuss more general forms of information to condition on, rather than a complete stochastic data sample x from unknown F 0 (x).In particular we provide a definition of conditional probability when non-stochastic information is available, and updating using partial-information in a data set.
4.1 Conditional probability distributions and non-stochastic data.The theory of conditional probability distributions is a well-established mathematical theory that provides a procedure to update probabilities taking into account new information.Such a procedure is available only if the information which is used to update the probability concerns stochastic events; that is, events to which a probability is assigned.In other words, such information needs to be already included into the probability model.In this section, we shall show how the approach can be used to define conditional probability distributions based on non-stochastic information.
Information about θ may arrive in the form of non-stochastic data; such if an expert declares "θ is close to 0".This type of information has been discussed by a number of authors and is known to be problematic for the Bayesian especially when such information arises after or during the arrival of stochastic observations (x i ).We cite the paper by Diaconis and Zabell (1982) and in particular refer the reader to example in Section 1.1 of their paper.
If we denote by I a piece of information for which no probability model for each θ is possible.In other words it is not and can not be determined to be stochastic in any way.However, a loss function l(θ, I) can be assigned.Our theory does not preclude such a loss function based on such a piece of information.The answer ν(θ) based on I and π can then be considered as a means of defining a conditional probability distribution in the presence of non-stochastic information.This section develops this argument.
Before proceeding, we introduce the notation for this section, being different to put the discussion in a more broader context than simply a Bayesian statistical style updating.Let Y be a random variable on a probability space (Ω, F , P), which will be the outcome of interest, and valued into a measurable space (Y, Y ) with probability distribution P Y .Hence, P Y represents initial belief about the outcome concerning Y .Now, assume that that the outcome of another random variable, say X, is known.So, let X be a random variable from (Ω, F , P) into (X, X ) with probability distribution P X and the additional information I about Y will be assumed to be an outcome of X.Then it is possible to update the unconditional distribution of Y to the probability distribution of Y given X.
In probability theory, a conditional distribution of Y given X is a map p from Y × X into R such that: • for each x in X, p(•, x) is a probability measure on Y , where P X denotes the probability distribution of X.
The conditional distribution is known to be essentially unique, i.e. unique only up to almost sure equality.This is a consequence of X being stochastic.
In fact, as (Feller, 1971, p. 160) points out, if, for instance, the distribution of X is concentrated on a subset X 0 of X, no natural definition of p(B, x) is possible for x outside X 0 .Nevertheless, in individual cases, there usually exists a natural choice dictated by regularity requirements.Moreover, it is well known that conditional distributions do not always exist unless some conditions are satisfied by the spaces (X, X ) and (Y, Y ).For more information about conditional probability distributions, see, for instance, Feller (1971) or Billingsley (1995).
Here, we will consider the case in which there are two σ-finite measures µ 1 and µ 2 on F such that the probability distribution of (X, Y ) is absolutely continuous with respect to µ 1 ×µ 2 .Denote its density by f .This is a general framework which includes most applications and enables us to find easily an expression for the conditional distributions.Generally, X and Y are subsets of R k , for some k, and µ 1 and µ 2 are the corresponding Lebesgue measures.
If f is the density of the probability distribution of (X, Y ) with respect to µ 1 × µ 2 , then one can take for every B in Y and every x in X such that Note that p(•, x) is absolutely continuous w.r.t.µ 2 and its density is for every x in X satisfying (11).The density (12), which is called the conditional density of Y given X, is what is used in most application to find an expression for the conditional distribution.Therefore, (10) deserves to be considered as the "practical definition" of conditional probability distribution.Indeed, it is the natural version of the conditional distribution of Y given X whenever a joint density f exists for X and Y .
Clearly, this approach relies on the joint distribution of X and Y and therefore is not available when X is replaced by some non-stochastic information I.Moreover, even if I coincides with an outcome of the random variable X, to define the conditional distribution of Y given X, it is required to know all the possible alternatives of I, that is, all the outcomes of X.It is also required to assess the joint distribution of X and Y or the conditional distribution of X given Y .This is quite easy if, for instance, I is known to be an outcome of some well-defined random experiment.In many situations, one has seen the outcome X and in order to establish an update of the distribution of Y , one needs to retrospectively ponder and imagine a joint probability model.This difficulty arises in different puzzles such as, for instance, Freund's puzzle of the two aces, introduced by Freund (1965).For other puzzles about conditional probabilities, see, for instance, Gardener (1959).
These puzzles have been widely used to discuss the concept of conditional probability.Hutchison (1999Hutchison ( , 2008) ) emphasizes that the updating process needs to take into account the circumstances under which the truth of I was conveyed.Also, Bar-Hillel and Falk (1982) claim that to know how the knowledge was obtained is "a crucial ingredient to select the appropriate model".These scholars present different views about the concept of conditionalization, but all agree on the fact that there would not be a problem if it was known how the information I became available, and therefore one could build a model including I.
The concept of conditional probability distributions is certainly appropriate as a procedure to update probabilities on the basis of any new information that was already included in the probability model.But it can be difficult to construct a model that considers all possible relevant information that in the future could become available.Therefore, the problem arises when one obtains some new and possibly unexpected information and wants to use it to update a probability distribution.Indeed, it does not seem appropriate to assess the probability of something which has been already observed.
First, we shall now show that if instead I is the outcome of a random variable X and there is a joint density f for (X, Y ), then one can recover as particular case the conditional distribution of Y given X.
If there is a joint density f for (X, Y ), then the conditional distribution (10) of Y given X arises as the solution of a decision theoretic problem related to a loss function of the form (2). For every x in X satisfying (11), such a loss function is: where S is the set of all y in S such that 0 < f Y (y) < ∞, P Y is the probability distribution of Y , ν is a probability measure on Y absolutely continuous w.r.t.P Y .The loss ( 13) is of the form (2) with where I S (y) is equal to 1 or 0 depending on whether y belongs to S or not.For every x in X satisfying (11), the conditional distribution p(•, x) given by ( 10) minimizes the loss (13).
If the random variable X is replaced by some non-stochastic information I, then the self-information loss ( 14) cannot be defined, but one can still resort to a loss function of the form (2), by choosing a different loss l(θ, I).So, the approach introduced in Section 2 provides a general definition of conditional distributions based on non-stochastic information.

Partial information.
We now consider a partial information problem.
Here the parameter of interest is θ yet the information I collected is more informative; it is possible to identify I Θ ⊂ I which provides all the information about θ.One is therefore interested in constructing the loss function l(θ, I Θ ).A particular example to be looked at in detail is the proportional hazards model.If the model is that the hazard function is g 0 (t) exp(g(θ, z i )) for individual i, where z i is the covariate value for individual i, g 0 is the baseline hazard function and g(θ, z) is the regression function, then the information about θ is provided by individual i failing from the set of possible failures S i = {j : t j ≥ t i }, where t i is the time of failure of individual i, and these are assumed to be different for each individual.The assumption for us to use the partial information is that information of the failure times provide information about θ only through the sets {S i }.Hence, there are k ≤ n pieces of information, where k is the number of individuals whose failure time is known, and it is usual to denote this by setting δ i = 1.
Using the partial self-information or logarithmic loss function, we have and so the solution to the decision problem is given by This is a new approach and not taken on by Bayesians due to the lack of motivation.In Appendix C we consider other stylized models.

Illustrations.
In this section we discuss the application of our approach to two important inferential problems.The first is an analysis of variation in survival times of colon cancer patients incorporating genetic information as potential predictors.The second is for joint inference on a set of quantiles.In both cases we claim that the choice of loss function is well founded (and unique) and that there is no traditional Bayesian interpretation of the updates we are implementing.Yet the updates we employ do allow us to learn about the specified parameters of interest.All of the models used to generate results are available as open source code in R.
5.1 Colon cancer genetic survival analysis.Colon cancer is a major worldwide disease with increasing prevalence particularly within western societies.Exploring the genetic contribution to variation in survival times following incidence of the cancer may shed light into the disease eitiology and underlying disease heterogeneity.To this aim collaborators at the Wellcome Trust Centre for Human Genetics, University of Oxford, obtained survival times on 918 cancer patients with germline genotype data at 100,000's of markers genome-wide.For demonstration purposes we only consider one chromosome's worth of data containing 15,608 genotype measurements.
The data table X then has n = 918 rows and p = 15, 608 columns, where (X) ij ∈ {0, 1, 2} denotes the genotype of the i'th individual at the j'th marker.Alongside this we have the corresponding (n × 2) response table of survival times Y with a column of event-times, y i1 ∈ R + and a column of indicator variables y i2 ∈ {0, 1}, denoting whether the event is observed or right-censored at y i1 .
To explore association between genetic variation and time-to-event we employ a loss function derived under proportional hazards, treating the loss to the baseline hazard as a nuisance parameter.This is based on the Cox proportional hazard (PH) model, one of the most widely used methods in survival analysis since its introduction in Cox (1972).In this log-linear model the hazard rate at time t for an individual with covariate x = {x 1 , . . ., x p } is defined as, where h 0 (t) is a baseline hazard function.In the seminal work of Cox (1972), h 0 (t) is treated as a nuisance parameter (or process) that does not enter into the partial-likelihood for estimating the parameters of interest β.
Using our construction we can consider only the order of events as partialinformation relevant to the regression coefficients, β, via the cumulative loss function, , where R i denotes the risk set, those individuals not censored or at time t i , and in this way obtain a conditional distribution π(β|x).

Single marker association.
As is standard practice, e.g.Balding (2006), we initially investigate the evidence of genetic association by testing each of the 15,608 markers in turn using a univariate model with loss, , for each of the j = 1, . . .15, 608 genetic makers.An advantage of our approach is the incorporation of prior information into the analysis.In most modern genome-wide genetic association studies we expect a priori that the coefficient values of predictive markers will be small, as otherwise we would have detected association of the marker using historic linkage based methods with lower resolution but higher power.Hence, we have additional information on the coefficient values.For unknown markers truly associated with survival we assume, and set v j = 0.5 for our study, reflecting beliefs that associated coefficients will be modest.For each marker we now include an indicator variable, γ j ∈ 0, 1 that specifies whether there is any association at the corresponding marker or not.This defines a hierarchical prior with, otherwise, and our prior π(δ j ) reflects beliefs about whether the corresponding β j will be zero or not.For now we shall simply assume π(δ j = 1) = 0.5, although we note it is straightforward to incorporate genetic prior information here.
In this way we can use our framework to calculate a posterior measure π(δ j , β j |x, y) for each marker.Interest lies in the evidence for a non-zero effect, i.e., in the marginal, In particular we can define the general Bayes Factor of association at the j th marker as, The one-dimensional integral in the numerator is simple to evaluate using quadrature or Monte Carlo methods.However, with a large sample size and over 15,000 integrals to calculate it is convenient to adopt a Laplace approximation to the integral, namely, where βj is the MAP estimator, mode of the posterior π(β j |δ j , x, y), and Σj is an estimate of the Hessian at the mode.Both the MAP estimate and the Hessian can be calculated efficiently under our loss and normal prior for β j .We calculated the general Bayes Factors for each marker and in Fig ( 1) we plot the log Bayes Factors over the chromosome.While there is considerable variation we observe strong evidence of association around marker 10,000.
To test if the Laplace approximation is accurate we selected 500 markers at random and ran a Monte Carlo importance sampler with N ( βj , Σ−1 j ), and 500 samples.Fig ( 2) indicates that the Laplace approximation appears accurate.This is not so surprising given we have 918 observations and a single parameter.
It is interesting to compare the evidence of association provided by the Bayes Factor Fig ( 1) in comparison to that obtained using a conventional Cox PH partial-likelihood based test.In Fig ( 3) we plot the log Bayes Factors versus − log 10 p-values obtained from a likelihood ratio test.We can see general agreement especially at the markers with strongest association as one would expect for a large sample size.Interestingly there appears to be greater dispersion at markers of weaker association.In Fig ( 4) we highlight the region of weaker association and colour the points by the standard error of the maximum likelihood estimate.We can see a tendency for markers with less information, greater standard error, to get attenuated towards a logBF of 0 under the general Bayesian approach.This is further highlighted in Fig ( 5) where we plot the standard error against log Bayes Factors.Markers with high standard error relate to genotypes of rarer alleles and the attenuation reflects a greater degree of uncertainty for association at these markers that contain less information.
Returning to the "hit region" showing strongest association around marker 10,000, in Fig ( 6) we see the portion of the graph from Fig ( 1) containing 800 makers around the marker of strongest association.Due to high colinearity between markers it is not clear whether the signal of association arises from a single effect correlated with others, or from multiple independent association signals.In order to investigate this we developed multiple marker methods.
R code to calculate Bayes Factors for single marker association using Laplace and Monte Carlo Importance Sampling is available.

Multiple marker variable selection.
With the aim of determining if there are multiple markers underlying the signal of association in Fig ( 6) we consider a model using potentially all 800 makers in the region and phrase the problem as a variable selection task under a partial-likelihood (loss), in which the user suspects that some of the p = 800 recorded covariates (15) may not be relevant to variation in survival times.
In the non-Bayesian paradigm, variable selection can proceed by defining a cost function, such as AIC or BIC, that adjusts fit to the data by the number of covariates in the model.Inference proceeds using an optimization algorithm, such as forward or stepwise selection, to find a model that minimises the cost.More recently, penalized-likelihood methods have proved popular (Tibshirani, 1997;Fan and Li, 2002) where the partial-likelihood is maximised subject to some constraint on the norm of the regression coefficients defined by some appropriate sparsity inducing metric.
Despite the enormous impact of Cox PH models and the importance of variable selection, the Bayesian literature in this area is very limited.This is because of the lack of a theoretical foundation to treat h 0 (t) as a nuisance parameter, leading to either ad hoc methods or the full specification of a joint probability model.For instance, Faraggi and Simon (1998) and Volinsky et al. (1997) adopt pseudo-Bayesian approaches.The paper of Volinsky et al. (1997) take the BIC as an approximation to the marginal likelihood and they use a branch and bound algorithm to find a set of models with differing sets of covariates with high BIC scores.The difficulty here is that, while the methods are important and well motivated, they are ultimately ad hoc.Moreover, prior information on π(β) does not enter into the calculation of the BIC, meaning that an important aspect of the Bayesian approach is lost.
In contrast, Ibrahim et al. (1999) consider variable selection within a full joint model using a prior specification of a gamma process for the baseline hazard.This provides a formal Bayesian solution but inference is then conditional on, and sensitive to, the specification of the prior on h 0 (t), something the partial-likelihood model explicitly avoids.
Here we use the partial-information relevant to the regression coefficients β via the cumulative loss function, where R i denotes the risk set, those individuals not censored or at time t i .As in Section 6.1.1 we assume proper priors, π(β) on the regression coefficient, where δ j ∈ {0, 1} is an indicator variable on covariate relevance with, where Bn(•) denotes the Bernoulli distribution but we now treat {δ 1 , . . ., δ 800 } as a vector in a joint model.In this way the posterior π(δ|x) quantifies beliefs about which variables are important to the regression.We use Markov chain Monte Carlo (MCMC) to draw samples approximately from π(β, δ|x) from which the marginal distribution on δ can be examined.In particular we make use of an efficient joint updating proposal, q(δ , β |δ), within the MCMC as q(δ , β |δ) = q(δ |δ)q(β |δ ) where q(δ |δ) proposes a local move to add, remove, or swap one variable per MCMC iteration in or out of the current model indexed by δ, and q(β |δ ) is a joint independence Metropolis update proposal, where { βδ , Ṽδ } are the MAP and approximate Information Matrix obtained from the combination of log-partial-loss and normal prior.The joint proposal is then accepted with probability, We ran our MCMC algorithm for 100,000 iterations with prior parameter settings, {v j = 0.5, a j = 1/800}, for all j = 1, . . ., p, equivalent to a prior assumption of a single associated marker.In Fig ( 7) we show the marginal inclusion probability, after discarding 10,000 samples as a burn in.The algorithm showed an overall acceptance rate of 8% for proposed moves.The model suggest overwhelming evidence for a single marker in the region of index 10200 but also weaker evidence of independent signal in a couple of other regions.R code to perform the reversible jump MCMC multiple variable sampling for the Cox PH partial-likelihood with normal priors is available on request.
5.2 Joint inference for quantiles and the Bayesian Boxplot.We discuss this illustration for three reasons.The first is that there is a unique loss function for learning about a set of quantiles, countering the notion that loss functions are arbitrary, and second there is no traditional Bayesian version for updating a set of quantiles which can coincide with our approach.Finally, we show how boxplots, one of the most widely used exploratory graphical tool, can be enhanced by taking into account uncertainty in the plot due to a finite sample size.
Let us start with the median solely.The unique loss function for learning about the median of a distribution function is given by l(θ, x) = w|θ − x| for some w > 0. Hence, the posterior distribution is given by One might be tempted to argue that this is merely a Bayesian update using the Laplace distribution and hence falls within the Bayesian paradigm.This is correct but it would put the Bayesian in an awkward quandary if she knew, for example, the observations were coming from a normal distribution.
In fact we are, as we have stated previously, not assigning a probability model for x.To make this distinction more explicit let us consider the situation where we want to learn about the three quartiles (θ 1 , θ 2 , θ 3 ) jointly, where θ 1 is the lower quartile, θ 2 the median, and θ 3 the upper quartile.The prior will be denoted by π(θ 1 , θ 2 , θ 3 ) which would obviously include the constraint θ 1 < θ 2 < θ 3 .The loss function l(θ, x) in this case, treating the learning of the quartiles with equal importance, is given by l(θ, x) = w {0.25(θ 1 − x) + + 0.75(x − θ 1 ) + + +0.5|θ 2 − x| + 0.75(θ 3 − x) + + 0.25(x − θ 3 ) + } for some w > 0. Then the posterior distribution is given by This can not be obtained by any Bayesian model that has currently been proposed.It is certainly therefore not classifiable as a Bayesian update.
We can illustrate the utility of this by considering a boxplot.In Fig ( 8) we show a boxplot of data taken from the example used in MATLAB help file for the function boxplot.m, in the statistics toolbox.The plot illustrates the distribution of miles per gallon (MPG) from records of a selection of cars taken in the 1970s, broken down by manufacturing country.The data set is available as carbig.mat in MATLAB, we have omitted the 'England' group which contains only 1 observation.
The boxplot is one of the most important and widely used graphical tool applied to summarise the distribution of data and highlight potential differences in the distributions across groups, but there is traditionally no uncertainty displayed in the summary statistics of the distributions used in the boxplot.In fact, for this data there are only 13 observations for "French" cars while there are 249 observations for the "USA", yet the conventional boxplot fails to inform on this.
We then implemented a Metropolis-Hastings MCMC algorithm to sample from the posterior π(θ 1 , θ 2 , θ 3 |x), for each of the 6 groups of cars shown in Fig ( 8), using 100,000 samples with a 50,000 sample burn-in.
In Fig ( 9) we show our "Bayesian boxplot" which includes the original boxes (empirical estimates) overlaid with 95% credible intervals for (θ 1 , θ 2 , θ 3 ).Credible intervals are shown as extended dotted lines from the empirical estimates with a small diamond denoting the edge of the interval.In comparison with Fig ( 8) we see that Fig ( 9) contains much more information.For example, we see that while in Fig ( 8) the median MPG of Italian and Swedish cars look different, in fact the 95% credible intervals overlap in Fig ( 9).In addition we see that there is considerable overlap in the distribution of medians between Sweden and the USA; and in general, comparison of medians or distributions in the conventional boxplot are obscured and confounded by sample size.
The MCMC samples approximately from π(θ 1 , θ 2 , θ 3 |x) for France and USA are shown in Figs ( 10), (11).The data set for France contains 13 observations and hence there is much greater uncertainty in the posterior marginals.Moreover, looking at the joint densities of (θ 1 , θ 2 ) and (θ 2 , θ 3 ) we can see the constraints imposed by the prior.In contrast, due to the higher sample size the posterior samples for the USA are tighter and hence exhibit less dependence.An interesting extension would be to include hierarchical priors on the quartiles whereby one could borrow strength across groups.6. Discussion.We have provided a basis for general learning and the updating of information using belief probability distributions.Loss functions constructed on spaces of probability measures allow for coherent updating.Specifically, information is connected to the parameter of interest via a loss function and this is the fundamental concept, replacing the restrictive connection based on probability models.We can recover precisely the traditional updating rules such as the Bayes rule when we select the selfinformation loss function, when it is appropriate to do so.
The assumptions we make are minimal.That information can be connected to unknown parameters via loss functions and that individuals then act rationally by minimizing their expected loss.If information is assumed to come from some probability model then we can accommodate this within our framework by appealing to the self-information loss function equivalent to the negative log-likelihood and so we can argue that loss functions are sufficient for learning mechanisms currently in use.
The scope of our findings provides extensive generalizations to the Bayes updating rule.For the Bayesian, when it is problematic to construct a probability model with all the implications about assigning probability one to events, can be compared to the ease of introducing a loss function which has no further implications.A probability model needs to assert a sample space with alternatives and assign probabilities to all outcomes.On the other hand, a loss function can be constructed after the information has been received and determined solely for the known information without need to consider which alternative information could have been received.Yet, surprisingly, both approaches can coincide which suggests the Bayesian support theory is more than is really needed.
More generally, we can use loss functions currently employed in a classical context for robust estimation; for example, generalized estimating equations.We can also deal appropriately with partial information where it is only a part of some observed information is useful or relevant for learning about the decision making process based on a particular relevant parameter of interest.
We have developed a rigorous approach to updating beliefs where we are required only to think about which is the best parameter from a chosen model needed to make a decision rather than have to think about a nonexistent true model parameter which coincides with the true data generating mechanism.
6.1 Optimal Decisions.Let us now recap the story from a slightly different perspective when observations are independent and identically distributed from F 0 (x) and action a ∈ A is to be made.The decision maker is happy to make an action if the minimizer, θ 0 , of l(θ, x) dF 0 (x) is known, for some loss function l(θ, x).This action is based on the utility function u(a, θ) and hence the action would be the one maximizing u(a, θ 0 ).
With θ 0 not being known, as F 0 is not known, a prior distribution π(θ) is constructed expressing beliefs about the location of θ 0 .Then, with data (x i ) n i=1 , the loss function picking out the appropriate probability measure ν(θ), with which to provide an action a through the maximization of expected utility, i.e.U (a) = Θ u(a, θ) ν(dθ), itself minimizes the loss function In this way it is seen that the sequence of ν(θ) should accumulate about θ 0 .
To us, now, there seems to be no reason whatsoever why l(θ, x) should be exclusively based on a probability distribution.For example, if we want the median then l(θ, x) = |θ − x|; if we want the mean then l(θ, x) = (θ − x) 2 ; whereas if we want the θ taking us closest in Kullback-Leibler divergence to f 0 , then l(θ, x) = − log f (x; θ).

Conclusion.
We acknowledge we have presented a general framework which at first sight might appear to sanction "anything goes".This is wrong.We have replaced a subjective probability model with an objective loss function, since the parameter of interest is typically defined by the statistical problem.In this case, the loss function connecting the information to the parameter is unique.See, for example, Section 5.2, in the case of the parameter of interest being the median.On the other hand, there is no unique probability distribution to use to first model the data and then use this to estimate the median.
When the interest is in a parameter indexing a family of densities and the parameter to target is the one which makes this family closest to the true model, then the unique loss function in this case is the self-information loss, which yields the Bayesian update.
We believe it is more fundamental to identify parameters of interest through loss functions and the corresponding information available.The alternative route through a probability model is, we argue, highly restrictive and leads to narrow types of Bayesian updating and, moreover, is more arbitrary.The necessary supporting theory for us is minimal, the construction and minimization of loss functions.Whereas for the use of probability models it is also more intricate and restrictive.
For our idea, two assumptions in order to achieve this almost sure accumulation are: 1.The likelihood ratio satisfies n −1 n i=1 log{f (x i ; θ)/f (x i ; θ 0 )} → 0 a.s where θ is the maximum likelihood estimator; that is, θ maximizes n i=1 f (x i |θ).We of course assume that this exists in the first place.2. The best parameter θ 0 is in the support of the prior, so π(θ : 0 < D(f 0 (x), f (x; θ)) < δ + η) > 0 for all η > 0.
The first is that the maximum likelihood estimator converges to the best parameter θ 0 .The topic is dealt with by White (1982) and gives conditions under which θ → θ 0 a.s., and the additional assumptions under which condition 1. is satisfied.
Condition 2. is clearly a support condition, so that the prior actually does put mass in a suitable neighborhood of θ 0 .
It is sufficient to consider the following problem.Take out a neighborhood N about θ 0 so that now the parameter closest to f 0 has a Kullback-Leibler distance δ * > δ, and label the parameter as θ * 0 .We will now show that I n1 /I n2 → +∞ a.s where and where, for example, π N is π restricted to the set N .Now, using assumption 2., and following ideas in Barron, Schervish and Wasserman (1999), it can be shown that I n1 ≥ e −nc a.s for all large n, for any c > δ.Also, based on assumption 1., C2 Hierarchical model.A random effects hierarchical model will be similar to the above described regression model; yet here we would have the f (x|z, θ) in a particular form given by f (x i |z i , θ) = B f (x i |z i , β i , θ) f (β i |z i , θ) dβ i .
We retain I i = (x i , z i ).One determines here that there is a θ to be learnt about which involves an unobserved set of (β i ).We can define θ 0 as in the regression case; that is the θ which minimizes In this case it would be quite challenging to find an alternative to the l(θ, (x i , z i )) = − log f (x i |z i , θ) loss function.For inference one can solve for ν(d θ) and then set up the joint measure ν(dθ, β 1 , . . ., to allow inference via Markov chain Monte Carlo methods, for example. C3 Time series model.Now let us consider a time series setting whereby it is deemed that x i depends on (x i−1 , . . ., x i−p ); that is a p-autoregressive model.In this case I i = (x i , x i−1 , . . ., x i−p ) and if we model the observations through f (x i |x i−1 , . . ., x i−p , θ) then the Bayesian update arises by taking l(θ, I i ) = − log f (x i |x i−1 , . . ., x i−p , θ).In this case the target θ 0 will be the parameter minimizing Of course this assumes that the order is known to be p and typically this will be unknown.Writing the true model as f 0 (x i |x i−1 , . . ., x 1 ) we can construct a general model as where w p is the probability that the correct order is p and θ = (θ 1 , θ 2 , . ..) so π(θ) = ∞ p=1 π p (θ p ) and π p (θ p ) represents the beliefs about which θ p takes f p (•|x i−1 , . . ., x i−p , θ p ) closest to f 0 (•|x i−1 , . . ., x i−p ), conditional on the truth of p being the correct order.The Bayesian update now arises by taking l(θ, (x i , x i−1 , . . ., x 1 )) = − log f (x i |x i−1 , . . ., x i , θ).
C4 Grouped data model.Here we consider the case when we have repeated observations on independent units; so I i = (x i1 , . . ., x in i , z i ), where the z i are unit specific covariates.If it assumed that the (x ij ) j are conditionally independent given an unobserved parameter β i then we would have a model of the type and we recover the Bayesian update when we take l(θ, (x i , z i )) = − log f (x i |z i , θ) and the interpretation for the prior π(θ) is again to do with beliefs about where the θ taking this model closest to the true model is to be located.
This section shows that it is possible to undertake Bayesian inference with models in the M-open view by taking the logarithmic loss functions associated with these models.The interpretation of θ is different however.We construct prior distributions and learn about the best parameter θ 0 which takes us closest to the true model.It is assumed the data can give the true model completely and therefore there is access to θ 0 .

Figure 9 :
Figure 9: General Bayesian Boxplot of cars MPG data using Unit Information Loss