Of quantiles and expectiles: consistent scoring functions, Choquet representations and forecast rankings
Summary
In the practice of point prediction, it is desirable that forecasters receive a directive in the form of a statistical functional. For example, forecasters might be asked to report the mean or a quantile of their predictive distributions. When evaluating and comparing competing forecasts, it is then critical that the scoring function used for these purposes be consistent for the functional at hand, in the sense that the expected score is minimized when following the directive. We show that any scoring function that is consistent for a quantile or an expectile functional can be represented as a mixture of elementary or extremal scoring functions that form a linearly parameterized family. Scoring functions for the mean value and probability forecasts of binary events constitute important examples. The extremal scoring functions admit appealing economic interpretations of quantiles and expectiles in the context of betting and investment problems. The Choquet‐type mixture representations give rise to simple checks of whether a forecast dominates another in the sense that it is preferable under any consistent scoring function. In empirical settings it suffices to compare the average scores for only a finite number of extremal elements. Plots of the average scores with respect to the extremal scoring functions, which we call Murphy diagrams, permit detailed comparisons of the relative merits of competing forecasts.
1 Introduction
Over the past two decades, a broad transdisciplinary consensus has developed that forecasts ought to be probabilistic in nature, i.e. they ought to take the form of predictive probability distributions over future quantities or events (Gneiting and Katzfuss, 2014). Nevertheless, a wealth of applied settings require point forecasts, be it for reasons of decision making, tradition, reporting requirements or ease of communication. In this situation, a directive is required about the specific feature or functional of the predictive distribution that is being sought.
to the real line
, with the mean or expectation functional, quantiles and expectiles being key examples. Competing point forecasts are then compared by using a non‐negative scoring function S(x,y) that represents the loss or penalty when the point forecast x is issued and the observation y realizes. A critically important requirement on the scoring function is that it be consistent for the functional T relative to the class
, in the sense that
(1)
, all t ∈ T(F) and all
. (Throughout the paper, the notation
indicates that the expectation is taken with respect to Y∼F.) If equality in expression (1) implies that x ∈ T(F), then the scoring function is strictly consistent. Thus, under a strictly consistent scoring function, a forecaster optimizes her expected score by giving a truthful and accurate assessment of the functional T(F).
is strictly consistent for the mean or expectation functional relative to the class of probability distributions with finite variance. However, there are many alternatives. In a classical paper, Savage (1971) showed that, subject to weak regularity conditions, a scoring function is consistent for the mean functional if and only if it is of the form
(2)
; squared error arises when
. Holzmann and Eulert (2014) proved that, when forecasts make ideal use of nested information bases, the forecast with the broader information basis is preferable under any consistent scoring function.
However, in real world settings, as pointed out by Patton (2015), forecasts are hardly ever ideal, and the ranking of competing forecasts might depend on the choice of the scoring function. This had already been observed by Murphy (1977), Schervish (1989) and Merkle and Steyvers (2013), among others, in the important special case of a binary predictand, where y=1 corresponds to a success and y=0 to a non‐success, so that the mean of the predictive distribution provides a probability forecast for a success. As there is no obvious reason for a consistent scoring function to be preferred over any other, this raises the question which one of the many alternatives to use.


. Here and in what follows, we write
for the positive part of
and
for the indicator function of the event A. Thus every scoring function consistent for the mean can be written as a weighted average over elementary or extremal scores
. As an important consequence of the mixture representation, a point forecast that is preferable in terms of each
is preferable in terms of any consistent scoring function. The elementary score
is the distance of y from θ if x is on the opposite side from y, and otherwise it is zero. It can be seen as the loss, relative to an oracle, in an investment problem with cost basis θ and revenue y. If x and y are on the same side of θ, the forecast x entails the same decision that would be taken by an oracle, yielding a loss of zero. Otherwise, the deficit relative to the oracle is |y−θ| if losses due to the two types of error are equally weighted. In the generalization to be discussed in this paper, they may be weighted unequally, which yields scoring functions for expectile functionals; see Section 2.3.
for i=1,…,n, where
and
are competing point forecasts and
is the subsequent outcome. We may compare the two forecasts graphically, by plotting the respective empirical scores
(3)
at finitely many values of θ; see Section 3.4. An example of this type of display, which we term a Murphy diagram, is shown in Fig. 1, where we consider point forecasts of wind speed at a major wind energy centre.

) or auto‐regressive (
) technique (Gneiting et al., 2006) (the functional considered is the mean of the predictive distribution): (a) empirical scores
in expression (3) versus θ; (b) score differences along with pointwise 95% confidence intervals (a negative difference means that the regime switching forecast is preferable; for details, see Sections 3.3 and 3.4)
More generally, for both quantiles and expectiles the apparent wealth of consistent scoring functions can be reduced to a one‐dimensional family of readily interpretable elementary scores, in the sense that every consistent scoring function can be represented as a mixture from that family. The case of the mean or expectation functional, which includes probability forecasts for binary events as a further special case, is nested by the expectile functional. Traditionally, the expectile at level α ∈ (0,1) is the weighted centre of mass of a probability distribution when the probabilities to the right are weighted by α and the probabilities to the left by 1−α. Equivalently, the expectile is the number with respect to which the weighted squared deviation is minimized, where the squares of deviations to the right are weighted by α and those to the left by 1−α.
The remainder of the paper is organized as follows. Section 2 is devoted to the key theoretical development, in which we state and discuss the mixture representations, relate to Choquet theory and order sensitivity, and provide economic interpretations of the elementary scores and the associated functionals. In Section 3, we apply the mixture representations to study forecast rankings and propose the aforementioned Murphy diagram for forecast comparisons. Illustrations on data examples follow in Section 4, where we revisit meteorological and economic case‐studies in the work of Gneiting et al. (2006), Rudebusch and Williams (2009) and Patton (2015). The paper closes with a discussion in Section 5. Proofs and computational details are deferred to appendices.
The data that are analysed and the programs that were used to analyse them can be obtained from
2 Consistent scoring functions for quantiles and expectiles
Before focusing on the specific cases of quantiles and expectiles, we review general background material on the assessment of point forecasts, with emphasis on consistent scoring functions.
2.1 Consistent scoring functions
We first introduce notation and explain conventions. Let
denote the class of the probability measures on the Borel–Lebesgue sets of the real line
. For simplicity, we do not distinguish between a measure
and the associated cumulative distribution function (CDF). We follow standard conventions and assume that CDFs are right continuous. A function S defined on a rectangle
is called a scoring function if S(x,y)⩾0 for all (x,y) ∈ D with S(x,y)=0 if x=y. Here, S(x,y) is interpreted as the loss or cost that is accrued when the point forecast x is issued and the observation y realizes. The scoring function is regular if it is jointly measurable and left continuous in its first argument x for every y.
In point prediction problems, it is rarely evident which functional of the predictive distribution should be reported. Guidance can be given implicitly, by specifying a loss function, or explicitly, by specifying a functional. The notion of consistency originates in this setting.
on a class
on which the mapping is well defined. Usually, the functional is single valued, as in the case of the mean functional where we take
as the class
of the probability measures with finite first moment. More generally, the expectile at level α ∈ (0,1) of a probability measure
is the unique solution t to the equation

corresponds to the mean functional (Newey and Powell, 1987). In the case of quantiles, the functional might be set valued. Specifically, the quantile functional at level α ∈ (0,1) maps a probability measure F to the closed interval
, with lower limit
and upper limit
. The two limits differ only when the level set
contains more than one point, so typically the functional is single valued. Any number between
and
represents an α‐quantile and will be denoted
.
if
(4)
, all t ∈ T(F) and all point forecasts
. A functional T that admits a strictly consistent scoring function is called elicitable and can then be represented as the solution to an optimization problem, in that

In what follows, we restrict attention to the quantile and expectile functionals. These are critically important in a gamut of applications, including quantile and expectile regression in general, and least squares (i.e. mean) and probit and logit (i.e. binary probability) regression in particular.
2.2 Mixture representations
The classes of the consistent scoring functions for quantiles and expectiles have been described by Savage (1971), Thomson (1979) and Gneiting (2011), and we review the respective characterizations in the setting of Gneiting (2011), where further detail is available.
if and only if it is of the form
(5)
(6)
if and only if it is of the form
(7)
. The key example arises when
, where
(8)
of ordinary least squares regression.
In view of the representations (5) and (7), the scoring functions that are consistent for quantiles and expectiles are parameterized by the non‐decreasing functions g and the convex functions ϕ with subgradient
respectively. In general, neither g nor ϕ and
are uniquely determined. We therefore select special versions of these functions. Furthermore, in the interest of simplicity we generally assume that
, adding comments in cases where there are finite boundary points. Let
denote the class of all left‐continuous non‐decreasing real functions g, and let
denote the class of all convex real functions ϕ with subgradient
. This last condition is satisfied when
is chosen to be the left‐hand derivative of ϕ, which exists everywhere and is left continuous by construction.
In what follows, we use the symbol
to denote the class of the scoring functions S of the form (5) where
. Similarly, we write
for the class of the scoring functions S of the form (7) where
. For all practical purposes, the families
and
can be identified with the classes of the regular scoring functions that are consistent for quantiles and expectiles respectively. These classes appear to be rather large. However, in either case the apparent multitude can be reduced to a one‐dimensional family of elementary scoring functions, in the sense that every consistent scoring function admits a representation as a mixture of elementary elements.
Theorem 1.
- (Quantiles): any member of the class
admits a representation of the form
where H is a non‐negative measure and
(9)The mixing measure H is unique and satisfies dH(θ)=dg(θ) for
(10)
, where g is the non‐decreasing function in the representation (5). Furthermore, we have H(x)−H(y)=S(x,y)/(1−α) for x>y.
- (Expectiles): any member of the class
admits a representation of the form
where H is a non‐negative measure and
(11)The mixing measure H is unique and satisfies
(12)
for
, where
is the left‐hand derivative of the convex function ϕ in the representation (7). Furthermore, we have
for x>y, where
denotes the left‐hand derivative with respect to the second argument.
Note that the relationships (9) and (11) hold pointwise. In particular, the respective integrals are pointwise well defined. This is because for
the functions
and
are right continuous, non‐negative and uniformly bounded with bounded support, and because the non‐decreasing functions g and
define non‐negative measures dg and
that assign finite mass to any finite interval. In particular, given any non‐negative measure H that assigns finite mass to any finite interval, the representations (9) and (11) generate members of the classes
and
respectively. Strict consistency is obtained in case H assigns positive mass to any finite interval.
In the case of quantiles, the asymmetric piecewise linear scoring function corresponds to the choice g(t)=t in equation 5, so the mixing measure H in the representation (9) is the Lebesgue measure. The elementary scoring function
arises when
, i.e. when H is a one‐point measure in θ.
recovers the mean or expectation functional, for which existing parametric subfamilies emerge as special cases of our mixture representation. Patton's (2015) exponential Bregman family,

in equation 7. The mixing measure H in the representation (11) then has Lebesgue density h(θ)= exp (aθ) for
. For Patton's (2011) family

, remarkably with no case distinction being required. The elementary scoring function
emerges when
in equation 7; here the mixing measure in representation (11) is a one‐point measure in θ.
From a theoretical perspective, a natural question is whether the mixture representations (9) and (11) can be considered Choquet representations in the sense of functional analysis (Phelps, 2001). A Choquet representation is a special, non‐redundant type of mixture representation. Specifically, a member S of a convex class
is an extreme point of
if it cannot be written as an average of two other members, i.e. if
with
implies
. Our mixture representations qualify as Choquet representations if the elementary scores
and
form extreme points of the underlying classes of scoring functions. This cannot possibly be true for our classes
and
because they are invariant under dilations and hence admit trivial average representations built with multiples of one and the same scoring function. Therefore, the families
and
need to be restricted suitably. Specifically, let the class
consist of all functions
such that
and
. Similarly, let
denote the family of all
such that ϕ(0)=0 and
. These classes are convex and so are the associated subclasses of the families
and
, which we denote by
and
respectively. The elementary scores
and
evidently are members of these restricted families.
Proposition 1.
- (Quantiles): for every α ∈ (0,1) and
, the scoring function
is an extreme point of the class
.
- (Expectiles): for every α ∈ (0,1) and
, the scoring function
is an extreme point of the class
.
reduces to

(13)
(14)
(15)
,
(16)
. This relationship can facilitate computations, particularly in synthetic settings, as we exemplify in Appendix B.
2.3 Economic interpretation
Our results in the previous section give rise to natural economic interpretations of the extremal scoring functions
and
, along with the quantile and expectile functionals themselves. In either case, the interpretation relates to a binary betting or investment decision with random outcome y.
in expression (10), the pay‐off takes only two possible values, relating to a bet on whether or not the outcome y will exceed the event threshold θ. Specifically, consider the following pay‐off scheme, which is realized in spread betting in prediction markets (Wolfers and Zitzewitz, 2008).
- If Quinn refrains from betting, his pay‐off will be 0, independently of the outcome y.
- If Quinn enters the bet and y⩽θ realizes, he loses his wager,
.
- If Quinn enters the bet and y>θ realizes, his winnings are
, for a gain of
.
. To demonstrate this, we shift attention from positively oriented pay‐offs to negatively oriented regrets, which we define as the difference between the pay‐off for an oracle and Quinn's pay‐off. Here the term oracle refers to a (hypothetical) omniscient bettor who enters the bet if and only if y>θ realizes, which would yield an ideal pay‐off
if y>θ and 0 otherwise. Quinn's regret equals the extremal score
except for an irrelevant multiplicative factor. This is illustrated in the bottom left‐hand matrix in Table 1 and corresponds to the classical, simple cost–loss decision model (Richardson, 2012). In decision theoretic terms, the distinction between pay‐off and regret is inessential, because the difference depends on the outcome y only. In either case, the optimal strategy is to choose
, where
(17)
is computed from Quinn's predictive CDF F for the future outcome y. (For simplicity, we assume that F is strictly increasing.) In summary, Quinn is willing to accept the bet if
.
| Quantiles | Expectiles | |||
|---|---|---|---|---|
| y⩽θ | y>θ | y⩽θ | y>θ | |
| Monetary pay‐off | ||||
| x≤θ | 0 | 0 | 0 | 0 |
| x>θ |
|
|
|
|
| Score (regret) | ||||
| x⩽θ | 0 |
|
0 |
|
| x>θ |
|
0 |
|
0 |
-
†Monetary pay‐offs are positively oriented, whereas scores are negatively oriented regrets relative to an oracle. For quantiles, the regret equals the extremal score
, where
, up to a multiplicative factor. For expectiles, the regret is
, where
, again up to a multiplicative factor.
in expression (7), the pay‐off is real valued. Specifically, suppose that Eve considers investing a fixed amount θ in a start‐up company, in exchange for an unknown future amount y of the company's profits or losses. The pay‐off structure then is as follows.
- If Eve refrains from the deal, her pay‐off will be 0, independently of the outcome y.
- If Eve invests and y⩽θ realizes, her pay‐off is negative, at
. Here, θ−y is the sheer monetary loss, and the factor
accounts for Eve's reduction in income tax, with
representing the deduction rate. (In financial terms, the loss acts as a tax shield. The linear functional form that is assumed here is not unrealistic, even though it is simpler than many real world tax schemes, where non‐linearities may arise from tax exemptions, progression, etc.)
- If Eve invests and y>θ realizes, her pay‐off is positive, at
, where
denotes the tax rate that applies to her profits.
, we again shift attention to regrets relative to an omniscient investor or oracle who enters the deal if and only if y>θ occurs, which would yield the ideal pay‐off
. As seen in the bottom right‐hand matrix, Eve's regret equals the extremal score
, up to a multiplicative factor. This implies that Eve's optimal decision rule is to enter the deal if and only if the expectile at level
(18)Therefore, expectiles induce optimal decision rules in investment problems with fixed costs and differential tax rates for profits versus losses. The mean arises in the special case when
in expression (18). It corresponds to situations in which losses are fully tax deductible (
) and nests situations without taxes (
). Tough taxation settings where
shift Eve's incentives towards not entering the deal and correspond to expectiles at levels
. For example, if losses cannot be deducted at all
, whereas profits are taxed at a rate of
, Eve will invest only if the expectile at level
of her predictive CDF F exceeds the deal's fixed costs θ. Note that we permit the case θ<0, which may reflect subsidies or tax credits, say.
The elementary score
for probability forecasts of a binary event in expression (14) is obtained as the further special case that arises when
and y ∈ {0,1}. Then |y−θ| ∈ {θ,1−θ}, so the pay‐offs in the bottom right‐hand matrix of Table 1 attain only two possible values. Hence, θ can be interpreted as a cost–loss ratio. We emphasize that this interpretation is specific to the binary case. In the general setting where y is continuous, θ takes the role of an event threshold, whereas α governs the costs of underprediction versus overprediction relative to this threshold.
The above interpretation of expectiles attaches an economic meaning to this class of functionals, which thus far seems to have been missing; for example, Schulze Waltrup et al. (2015), page 434, noted that ‘expectiles lack an intuitive interpretation’. In a notable exception, Bellini and Di Bernardino (2015) offered a succinct financial interpretation of expectiles which is of the same spirit as ours. The foregoing discussion may also bear on the debate about the revision of the Basel protocol for banking regulation, which involves contention about the choice of the functional of in‐house risk distributions that banks are supposed to report to regulators (Embrechts et al., 2014). Recently, expectiles have been proposed as potential candidates, as it has been proved that expectiles at level
are the only elicitable law invariant coherent risk measures (Ziegel, 2014; Bellini and Bignozzi, 2015; Delbaen et al., 2015). See McNeil et al. (2015), for a recent treatment of these concepts and Fissler et al. (2016) for a discussion of the use of consistent scoring functions in financial regulation.
2.4 Order sensitivity
and
are not only consistent for their respective functional; they in fact also enjoy the stronger property of order sensitivity. Generally, a scoring function S is order sensitive for the functional F↦T(F) relative to the class
if, for all
, all t ∈ T(F), and all
,


and
are strict. As before, we denote the class of the Borel probability measures on
by
, and we write
for the subclass of the probability measures with finite first moment.
Proposition 2.
- (Quantiles): for every α ∈ (0,1) and
, the extremal scoring function
is order sensitive for the α‐quantile functional relative to
.
- (Expectiles): for every α ∈ (0,1) and
, the extremal scoring function
is order sensitive for the α‐expectile functional relative to
.
Owing to the mixture representations (9) and (11), the order sensitivity of the extremal scoring functions transfers to all regular consistent scoring functions. Strict order sensitivity applies if the functions g and
in equations 5 and 7 respectively are strictly increasing. For suitably large classes
, the respective condition is also necessary. Analogous relationships hold in regard to (strict) consistency.
Recent studies of elicitability have revealed that (strict) order sensitivity and (strict) consistency are equivalent in quite general settings (Nau, 1985; Lambert, 2013; Steinwart et al., 2014; Bellini and Bignozzi, 2015). These results rely on continuity conditions on the scoring function and do not readily apply in our framework.
3 Forecast rankings
In this section, we turn to the task of comparing and ranking forecasts. Before applying our mixture representations to this problem, we introduce the prediction space setting of Gneiting and Ranjan (2013) and define notions of forecast dominance.
3.1 Prediction spaces
(19)
utilize information sets
respectively, with
being a σ‐field on the sample space Ω. In measure theoretic language, the information sets correspond to sub‐σ‐fields, and
is a CDF‐valued random quantity measurable with respect to
. The joint distribution of the quantities in expression (19) is encoded by a probability measure
on
. In this setting, a predictive distribution
is ideal relative to
if it corresponds to the conditional distribution of the outcome Y under
given
. An extended, more realistic notion of prediction space that allows for serial dependence between forecast–observation tuples has recently been introduced by Strähl and Ziegel (2015), along with far‐reaching generalizations of the concepts of ideality and calibration.
In a nutshell, a prediction space specifies the joint distribution of tuples of the form (19). To give an example, Table 2 revisits a scenario studied by Gneiting et al. (2007) and Gneiting and Ranjan (2013). The only difference is that we let the random variable τ attain the values −2 and 2, rather than the values −1 and 1. Here, the outcome is generated as
where
. The perfect forecaster is ideal relative to the σ‐field that is generated by the random variable μ. The unfocused and sign‐reversed forecasters also have knowledge of μ but fail to be ideal. The climatological forecaster, issuing the unconditional distribution of the outcome Y as predictive distribution, is ideal relative to the uninformative σ‐field that is generated by the empty set.
| Forecaster | Predictive distribution | α‐quantile | Mean | Prob(Y>y) |
|---|---|---|---|---|
| Perfect |
|
|
μ | 1−Φ(y−μ) |
| Climatological |
|
|
0 | 1−Φ(y/√2) |
| Unfocused |
|
|
μ+τ/2 |
|
| Sign reversed |
|
|
−μ | 1−Φ(y+μ) |
-
†The outcome is generated as
, where
. The random variable τ attains the values −2 and 2 with probability
, independently of μ and Y. For α ∈ (0,1) and τ ∈ {−2,2}, we let
,
and
, where Φ denotes the CDF of the standard normal distribution.
Any predictive distribution F can be reduced to a point forecast by extracting the sought functional T(F). In what follows, we focus on quantiles, the mean or expectation functional and probability forecasts of the binary event that the outcome exceeds a threshold value. The respective point forecasts for the perfect, climatological, unfocused and sign‐reversed forecaster are shown in Table 2.
, where the elements of the sample space Ω can be identified with tuples of the form
(20)
represent point forecasts and utilize information sets
respectively. For simplicity, we let
be single valued. Extensions to set‐valued random quantities, as might occur in the case of quantiles, are straightforward. The joint distribution of the point forecasts and the observation in expression (20) is specified by the probability measure
. Similarly, it is sometimes useful to consider a mixed prediction space, by specifying the joint distribution
of tuples of the form
(21)
represent CDF‐valued random quantities, and
represent point forecasts.
3.2 Notions of forecast dominance
is a suitably measurable function that assigns a loss or penalty when we issue the predictive distribution F and y realizes. A scoring rule
is proper if
(22)
induces a proper scoring rule, by defining
for
and
.
Definition 1.(predictive CDFs). Let
and
be probabilistic forecasts, and let Y be the outcome, in a prediction space. Then
dominates
relative to a class
of proper scoring rules if
for every
.
We now turn to quantiles and expectiles and the respective families
and
of the regular consistent scoring functions for these functionals.
Definition 2.
- (Quantiles): let
and
be point forecasts, and let Y be the outcome, in a point prediction space. Then
dominates
as an α‐quantile forecast if
for every scoring function
.
- (Expectiles): let
and
be point forecasts, and let Y be the outcome, in a point prediction space. Then
dominates
as an α‐expectile forecast if
for every scoring function
.
It is important to note that the expectations in the definitions are taken with respect to the joint distribution of the forecasts and the outcome. To cover time series settings, as commonly encountered in practice, the definitions can be applied to the more general prediction space setting for serial dependence that was introduced and studied by Strähl and Ziegel (2015). The dominance notions provide partial orderings for the predictive distributions
in expression (19) and
in expression (20). In the special case of probability forecasts of a binary event, related notions of sufficiency and dominance have been studied by DeGroot and Fienberg (1983), Vardeman and Meeden (1983), Schervish (1989), Feuerverger and Rahman (1992), Krämer (2005) and Bröcker (2009). Essentially, a probabilistic forecast that dominates another is preferable, or at least not inferior, in any type of decision that involves the respective predictive distributions. (To see this, note that any utility function induces a proper scoring rule via the Bayes act. Details of the construction are given in section 3 of Dawid (2007) and section 2.2 of Gneiting and Raftery (2007).) In the case of quantiles or expectiles, a point forecast that dominates another is preferable, or at least not inferior, in any type of decision problem that depends on the respective predictive distributions via the considered functional only. Adaptations to functionals other than quantiles or expectiles are straightforward.
Under which conditions does a forecast dominate another? Holzmann and Eulert (2014) recently showed that, if two predictive distributions are ideal, then the one with the richer information set dominates the other. Furthermore, the result carries over to ideal forecasters’ induced point predictions, including but not limited to the cases of quantiles and expectiles that we consider here. To give an example in the setting of Table 2, the perfect and the climatological forecasters are ideal relative to the σ‐fields that are generated by μ and generated by the empty set respectively. Therefore, the perfect forecaster dominates the climatological forecaster, in any of the above senses.
Tsyplakov (2014) went on to show that, if a predictive distribution is ideal relative to a certain information set, then it dominates any predictive distribution measurable with respect to the information set. Again, the result carries over to the induced point forecasts. In the setting of Table 2, the perfect forecaster is ideal relative to the σ‐field generated by the random variables μ and τ. The climatological, unfocused and sign‐reversed forecasters are measurable with respect to this σ‐field, and so they are dominated by the perfect forecaster, in any of the above senses.
In the practice of forecasting, predictive distributions are hardly ever ideal, and information sets may not be nested, as emphasized by Patton (2015). Therefore, the above theoretical results are not readily applicable, and distinct scoring rules, or distinct consistent scoring functions, may yield distinct forecast rankings, as in empirical examples given by Schervish (1989), Merkle and Steyvers (2013) and Patton (2015), among others. Furthermore, in general it is not feasible to check the validity of the expectation inequalities in the definitions, for any proper scoring rule
, or consistent scoring function
, or
.
Fortunately, in the case of quantile and expectile forecasts, the mixture representations in theorem 1 reduce checks for dominance to the respective one‐dimensional families of elementary scoring functions.
Corollary 1.
- (Quantiles): in a point prediction space,
dominates
as an α‐quantile forecast if
for every
.
- (Expectiles): in a point prediction space,
dominates
as an α‐expectile forecast if
for every
.
, and let
denote its α‐quantile. Suppose furthermore that
and
are measurable with respect to
. By corollary 1, in concert with proposition 1 and a conditioning argument,
dominates
as an α‐quantile forecast if with probability 1 either


In the scenario of Table 2, the argument can be put to work in the case
that corresponds to median and mean forecasts. Specifically, let F be the perfect forecast, which has median and mean μ, let
be the σ‐field that is generated by μ, and let
and
. Invoking the order sensitivity argument, we see that the climatological forecaster dominates the sign‐reversed forecaster for both median and mean predictions.
3.3 The Murphy diagram as a diagnostic tool
denote point forecasts for the outcome Y, and the probability measure
represents their joint distribution. In the case of probability forecasts, we use the more suggestive notation
for the forecasts.
- For quantile forecasts at level α ∈ (0,1), we plot the graph of the expected elementary quantile score
,
for j=1,…,l. By corollary 1, part (a), forecast
(23)
dominates forecast
if and only if
for
. The area under
equals the expected asymmetric piecewise linear score (6).
- For expectile forecasts at level α ∈ (0,1), we plot the graph of the expected elementary expectile score
,
for i=1,…,l. By corollary 1, part (b), forecast
(24)
dominates forecast
if and only if
for
. The area under
equals half the expected asymmetric squared error (8).
- For probability forecasts of a binary event, we plot the graph of the expected elementary score
,
for i=1,…,l. By corollary 1, part (b), the probability forecast
(25)
dominates
if and only if
for θ ∈ (0,1). The area under
equals half the expected Brier score (15).
In the context of probability forecasts for binary weather events, displays of this type have a rich tradition that can be traced to Thompson and Brier (1955) and Murphy (1977). More recent examples include Schervish (1989), Feuerverger and Rahman (1992), Richardson (2000), Wilks (2001), Mylne (2002) and Berrocal et al. (2010), among many others. Murphy (1977) distinguished three kinds of diagram that reflect the economic decisions involved. The negatively oriented expense diagram shows the mean raw loss or expense of a given forecast scheme; the positively oriented value diagram takes the unconditional or climatological forecast as reference and plots the difference in expense between this reference forecast and the forecast at hand and, lastly, the relative value diagram plots the ratio of the utility of a given forecast and the utility of an oracle forecast. The displays that were introduced above are similar to the value diagrams of Murphy, and we refer to them as Murphy diagrams. Our Murphy diagrams are by default negatively oriented and plot the expected elementary score for competing quantile, expectile and probability forecasters. For better visual appearance, we generally connect the left‐ and right‐hand limits at the jump points of the empirical score curves.
Fig. 2 shows Murphy diagrams for the perfect, climatological, unfocused and sign‐reversed forecasters in Table 2. We compare point predictions for the mean or expectation functional, and the quantile at level α=0.90, along with probability forecasts for the binary event that the outcome exceeds the threshold value 2. Analytic expressions for the expected scores are given in Appendix B. As proved in the previous section, the perfect forecaster dominates the other forecasters for all functionals considered. The expected score curves for the climatological and the unfocused, and for the unfocused and the sign‐reversed forecasters, intersect in all three cases, so there are no order relationships between these forecasters. Finally, the Murphy diagrams suggest that the climatological forecaster dominates the sign‐reversed forecaster for all three functionals and, in the case of the mean functional, the order sensitivity argument in the previous section confirms the visual impression. In the cases of the quantile and probability forecasts, final confirmation would need to be based on tedious analytic investigations of the asymptotic behaviour of the expected score functions.

, perfect;
, climatological;
, unfocused;
, sign reversed
(26)3.4 Murphy diagrams for empirical forecasters
(27)
are the jth forecaster's point predictions, for j=1,…,l, and
are the respective outcomes. Thus, we have l competing forecasters, and each of them issues a set of n point predictions. A convenient interpretation of the empirical setting is as a special case of a point prediction space, in which the tuples
in expression (20) attain each of the values in expression (27) with probability 1/n. Then the probability measure
is the corresponding empirical measure and, with this identification, the (average) empirical scores

is either
,
or
, become the expected elementary scores from expressions (23), (24) and (25) respectively. Accordingly, we say that forecaster
empirically dominates forecaster
if
for all
. When comparing the two forecasters
and
, it is convenient to show a Murphy plot of the equivalent of the difference (26), namely

(28)
is either
,
or
.
Murphy diagrams can be used efficiently to show a lack of domination when forecasters’ expected elementary score curves intersect. However, in general it is not possible to conclude domination, unless the visual impression is supported by tedious analytic investigations of the behaviour of the expected score functions as θ→±∞. Fortunately, these complications do not arise in the empirical case, where dominance can be established by comparing the empirical score functions at a well‐defined finite set of arguments only, as follows.
Corollary 2.
- (Quantiles): the forecast
empirically dominates
for α‐quantile predictions if
for
.
- (Expectiles): the forecast
empirically dominates
for α‐expectile predictions if
for
and in the left‐hand limit as
. In the case
evaluations at
can be omitted.
To see why these results hold, note that in either case the score differential
is right continuous, and that it vanishes unless
. Furthermore, in the case of quantiles
is piecewise constant with no other jump points than
or
. Similarly, in the case of expectiles
is piecewise linear with no other jump points than
and
, and no other change of slope than at
. The change of slope disappears when
. Fig. 3 illustrates the behaviour of
in the cases of the median and the mean.

in expression (28) for the median (
) and mean (
) functionals

is forecaster j's stated probability for world event i to materialize, and
is the respective binary realization. By corollary 2, part (b), dominance relationships can be inferred by evaluating
at the forecasters’ stated probabilities. We note that ID 3 empirically dominates ID 6 and ID 8, and that ID 5 empirically dominates ID 10. The remaining pairwise comparisons do not give rise to dominance relationships. The induced partial order between the IDs applies to comparisons under any proper scoring rule, as reflected by the rankings in Table 1 of Merkle and Steyvers (2013). Fig. 4(b) considers joint comparisons. We see that ID 3 attains the lowest score over a wide range of θ. However, ID 2, ID 5, ID 7 and ID 9 show the unique best empirical score under
for other values of θ and, therefore, have superior economic utility under the associated cost–loss ratios.

, ID 3;
, ID 6;
, ID 8;
, ID 5;
, ID 10;
, others) and (b) best forecast ID(s) under
(
, unique best score;
, shared best score) (for example, ID 9 attains the unique best score for θ ∈ [0.02,0.04) and ID 10 attains the shared best score for θ ∈ [0.91,1))
It seems desirable to complement a Murphy diagram by formal tests of forecast dominance in the underlying population, with possible null hypothesis
that forecast
dominates forecast
. Intuitively, large positive values of the mean score difference
speak against
. Tests thus could be based on any functional T defined on the paths of the stochastic process
that is monotone, in that
implies that
, where
and
are functions of θ. For example, one might choose
or
and reject
if
exceeds some critical value, which in general is difficult to determine. One possibility is to utilize randomization, drawing on the idea that in a one‐sided test it should suffice to control the error of the first kind at the boundary of the null hypothesis, where there is no difference in the predictive performance, so that an exchange of labels would not change the distribution of the test statistic. The critical value can then be determined from the distribution of
, where
with independent random signs
. Although permutation tests of this type have intuitive appeal and are easy to implement, there remain conceptual issues, concerning, for example, the specifics of the null hypothesis being tested. In any case, tests for forecast dominance based on extremal scoring functions, perhaps in concert with the bootstrap or the Westfall–Young method (Westfall and Young, 1993; Cox and Lee, 2008), deserve further investigation.
4 Empirical examples
We now demonstrate the use of Murphy diagrams in economic and meteorological case‐studies in time series settings. In each example, interest is in a comparison of two forecasts, and so we show Murphy diagrams for the empirical scores and their difference. The jagged visual appearance stems from the behaviour of the empirical score functions just explained and depends on the number n of forecast cases. We supplement the Murphy diagram for a difference by confidence intervals based on Diebold and Mariano (1995) tests with a heteroscedasticity and auto‐correlation robust variance estimator (Newey and West, 1987). The approach of Diebold and Mariano (1995) views empirical data of the form (27) as a sample from an underlying population and tests the hypothesis of equal expected scores. The confidence bands are pointwise and have a nominal level of 95%.
4.1 Mean forecasts of inflation
In macroeconomics, subjective expert forecasts often compare favourably with statistical forecasting approaches; see Faust and Wright (2013) for evidence and discussion. For the USA, the Survey of Professional Forecasters (SPF) run by the Federal Reserve Bank of Philadelphia is a key source of data; see, for example, Engelberg et al. (2009). Patton (2015) used SPF data to illustrate the use of various scoring functions that are consistent for the mean functional.
Motivated by Patton's analysis, we analyse quarterly SPF mean forecasts for the annual inflation rate of the consumer price index over the next 12 months in the USA. We compare the SPF forecasts with forecasts from another survey, the Michigan Survey of Consumers, based on data from the third quarter of 1982 to the third quarter of 2014, for a test period of 129 quarters. Our implementation choices are as in section 5 of Patton (2015), except that we update the data set to cover the observations for the second and third quarters in 2014, and that we use the slightly newer fourth quarter of 2014 vintage for the consumer price index realizations. Fig. 5(a) shows the forecasts along with the realizing values.

, SPF;
, Michigan survey;
, actual); (b) probability of recession (Rudebusch and Williams (2009); n=186;
, SPF;
, probit;
, actual recessions); (c) 90% quantile of wind speed (Gneiting et al. (2006); n=5136;
, regime switching space–time;
, auto‐regressive;
, actual), restricted to a subperiod in the summer of 2003
The Murphy diagrams are shown in Figs 6(a) and 6(d). In Fig. 6(a), the curves for the empirical elementary score
of the SPF and the Michigan survey intersect prominently, suggesting that neither of the two surveys empirically dominates the other. In Fig. 6(d), the confidence intervals for the score differences are fairly broad and include zero for all values of θ. Note that the SPF is preferred for smaller values, whereas the Michigan forecast is preferred for larger values of θ. To interpret these results, consider the event threshold θ=6. A forecast
attains a non‐zero extremal score
in expression (7) if
or
. Fig. 5(a) and the more detailed display in Fig. 7 identify five quarters when the SPF incurs a non‐zero penalty, compared with two quarters only for the Michigan survey. Interestingly, the threshold θ=6 has become less relevant over time, in that forecasts and realizations have remained below 6% from 1991 onwards.


, non‐zero extremal score
for both the SPF (
) and the Michigan (
) forecast;
, non‐zero score for the SPF forecast only;
, realization
4.2 Probability forecasts of recession
We now relate to the rich literature on binary regression and prediction and analyse probability forecasts of US recessions, as proxied by negative real gross domestic product growth. The SPF covers probability forecasts for this event since the fourth quarter of 1968. Following Rudebusch and Williams (2009), we compare current quarter probability forecasts from the SPF with forecasts from a probit model based on the term spread, i.e. the difference between long‐ and short‐term interest rates. We follow Rudebusch and Williams (2009) in all choices of data and implementation, except that we update their sample through the second quarter of 2014, for a test period of 186 quarters. Detailed economic and/or statistical justification for these choices can be found in Rudebusch and Williams (2009).
Fig. 5(b) shows the SPF and probit‐model‐based probability forecasts for a recession, with the grey vertical bars indicating actual recessions. During recessionary periods, the SPF tends to assign higher forecast probabilities than the probit model. Also, the SPF tends to assign lower forecast probabilities during non‐recessionary periods. The Murphy diagrams in Figs 6(b) and 6(e) show that the SPF attains lower empirical elementary scores
at all thresholds θ ∈ (0,1). The confidence intervals for the score differences exclude zero for small values of the cost–loss ratio θ and confirm the superiority of the SPF over the probit model for current quarter forecasts. This can partly be attributed to the fact that SPF panelists have access to timely within‐quarter information that is not available to the probit model. As demonstrated by Rudebusch and Williams (2009), the relative performance of the probit model improves at longer forecast horizons, where within‐quarter information plays a lesser role.
4.3 Quantile forecasts for wind speed
We return to the meteorological example in Fig. 1, but instead of the mean or expectation functional we now consider quantile forecasts at level α=0.90. We compare the regime switching space–time (RST) approach that was introduced by Gneiting et al. (2006) with a simple auto‐regressive (AR) benchmark for 2‐h‐ahead forecasts of hourly average wind speed at the Stateline wind energy centre in the Pacific Northwest of the USA. Gneiting et al. (2006) referred to the specifications considered here as RST‐D‐CH and AR‐D‐CH. This terminology indicates that the methods account for the diurnal cycle and conditional heteroscedasticity. The data set, evaluation period, estimation and forecast methods for this example are identical to those in Gneiting et al. (2006), and we refer to Gneiting et al. (2006) for detailed descriptions. Both methods yield predictive distributions, from which we extract the quantile forecasts. The evaluation period ranges from May 1st to November 30th, 2003, for a total of 5136 hourly forecast cases.
Fig. 5(c) shows the quantile forecasts and realizations. The quantile forecasts exceed the outcomes at about the nominal level, at 89.7% for the RST forecast and 90.9% for the AR forecast, indicating good calibration. However, the RST forecasts are sharper, in that the average forecast value over the evaluation period is 9.2 m s
, compared with 9.7 m s
in the case of the AR forecast. To see why the sharpness interpretation applies here, note that wind speed is a non‐negative quantity, so the lower prediction interval at level α ∈ (0,1) ranges from 0 to the α‐quantile, whence smaller quantiles translate into shorter, more informative prediction intervals and sharper predictive distributions. These observations suggest the superiority of the RST forecasts over the benchmark AR forecasts, and the Murphy diagrams for the empirical elementary scores
in Figs 6(c) and 6(f) confirm this intuition, in line with what we saw in Fig. 1 for the mean functional.
5 Discussion
We have studied mixture representations of Choquet type for the scoring functions that are consistent for quantiles and expectiles, including the case of the mean or expectation functional, and nesting probability forecasts for binary events as a further special case. A particularly interesting aspect of these results is that they allow an economic interpretation of consistent scoring functions in terms of betting and investment problems. Our interpretation of expectiles in the context of investment problems with fixed costs and differential tax rates appears to be original and may bear on the current debate about the revision of the Basel protocol for banking regulation.
From a general applied perspective, Gneiting (2011), page 757, had argued that, if point forecasts are to be issued and evaluated,
‘it is essential that either the scoring function be specified ex ante, or an elicitable target function be named, such as the mean or a quantile of the predictive distribution, and scoring functions be used that are consistent for the target functional’.
Patton (2015), page 1, took this argument a step further, by positing that
‘rather than merely specifying the target functional, which narrows the set of relevant loss functions only to the class of loss functions consistent for that functional … forecast consumers or survey designers should specify the single specific loss function that will be used to evaluate forecasts’.
This is a very valid point. Whenever forecasters are to be compensated for their efforts in one way or another, the scoring function ought to be disclosed. To give an example of this best practice, the participants of forecast competitions hosted on the Kaggle platform (www.kaggle.com) are routinely informed about the relevant scoring function before the start of the competition. See, for example, Hong et al. (2014) for a description of the global energy forecasting competition 2012.
However, many situations remain in which point forecasters receive directives in the form of a functional, without an accompanying scoring function being available. This might be because the forecasts are utilized by a myriad of communities, a situation that is often faced by national and international weather centres, because costs and losses are unknown or confidential, because the goal is general methodological development, as opposed to a specific applied task, because interest centres on an understanding of forecasters’ behaviours and performance or simply because of negligence of best practices. In such settings, our findings suggest the routine use of new diagnostic tools in the evaluation and ranking of forecasts, which we call Murphy diagrams. Interest sometimes centres on decompositions of expected or empirical scores into uncertainty, resolution, and reliability components, as studied by DeGroot and Fienberg (1983), Bröcker (2009) and Bentzien and Friederichs (2014), among others. Extensions of Murphy diagrams in these directions may be worthwhile.
As discussed in Section 3.2, nested information sets are sufficient for forecast dominance. However, the converse is not true, in that, if a forecaster dominates another, the respective information sets need not be nested. Specifically, if a forecaster has access to a highly informative explanatory variable, but not to a weakly informative variable, then she may dominate a competitor who can access the weakly informative variable only, even though the information sets are not nested. Explicit examples of this type can readily be constructed. From a broader perspective, it would be of interest to study any implications of forecast dominance on information sets.
Our results also bear on estimation problems, in that scoring functions connect naturally to M‐estimation (Huber, 1964; Koltchinskii, 1997). An interesting observation is that the loss functions that have traditionally been employed for estimation in quantile regression, ordinary least squares regression and expectile regression, namely the asymmetric piecewise linear and squared error scoring functions (6) and (8), correspond to the choice of the Lebesgue measure in the mixture representations (9) and (11) respectively. This is in contrast with binary regression, where estimation is typically based on the logarithmic score, which corresponds to the choice of the infinite measure with density
in the mixture representation (13), rather than the Lebesgue or uniform measure that yields (half) the Brier score (15). Quite generally, this raises the question of the optimal choice of the loss or scoring function to be used for estimation in regression problems. Focusing on the binary case, Hand and Vinciotti (2003), Buja et al. (2005), Lieli and Springborn (2013) and Elliott et al. (2015) have considered the use of economically justifiable criteria.
The interpretations that were developed in the present paper can help to design economically or societally relevant criteria in more general settings. For example, the elementary expectile score
in expression (12) depends on x and y via the absolute deviation between the event threshold θ and the observation y only and therefore might be interpreted in terms of the original unit in any applied problem. Owing to the mixture representation (11), any consistent scoring function can be associated with a weighting of thresholds, as encoded by the mixing measure. The choice of the mixing measure requires careful consideration of the decision problem at hand, and it seems difficult to provide general guidance. As noted, squared error corresponds to Lebesgue measure. In applications, non‐uniform measures with finite mass may provide more realistic descriptions. The weighting becomes irrelevant in case there are dominance relationships between competing forecasters, which can be checked via Murphy diagrams. As a caveat, the dominance relationship appears to be strong, and empirical dominance may not be very commonly observed in practice. In such cases, Murphy diagrams can still provide informal clues to critical threshold values θ, which can then be investigated in detail, as illustrated in our inflation example.
Mixture representations of Choquet type can be found for other more general classes of consistent scoring functions. For instance, our results extend to the class of functionals known as generalized quantiles or M‐quantiles (Breckling and Chambers, 1988; Koltchinskii, 1997; Bellini et al., 2014; Steinwart et al., 2014), which subsume both quantiles and expectiles. Related, but more complex, mixture representations apply in the case of scoring functions that are consistent for multi‐dimensional functionals, as recently studied by Fissler and Ziegel (2015).
assigns a loss or penalty when we issue the predictive CDF F and y realizes and, for a scoring rule to be proper, the expectation inequalities in expression (22) need to hold. As we have seen, a predictive distribution for a binary variable can be identified with a probability forecast, so representation (13) applies and the answer is well known to be positive in this case. However, an extension from probability forecasts of binary to ternary or general discrete variables does not appear to be feasible, owing to results by Johansen (1974) and Bronshtein (1978) in convex analysis. (In a nutshell, Savage (1971) showed that, in the case of k+1 categories, the proper scoring rules for probability forecasts essentially are parameterized by the convex functions on the unit simplex in
. Johansen (1974) and Bronshtein (1978) proved that if k⩾2 then the extremal members of that class lie dense.) Despite this negative result, a closer look at a popular score is encouraging. Specifically, the widely used continuous ranked probability score (Matheson and Winkler, 1976),

. For simplicity, let us assume that F has unique quantiles. We may then invoke the mixture representation (13) along with relationships (14) and (16) to yield

Acknowledgements
This work has been funded by the European Union Seventh Framework Programme under grant agreement 290976. We thank the Klaus Tschira Foundation for infrastructural support at the Heidelberg Institute for Theoretical Studies, and we are grateful to four referees for constructive comments on an earlier version of the manuscript.
Appendix A: Proofs
The specific structure of the scoring functions in expressions (5) and (7) permits us to focus on the case
in the subsequent proofs, with the general case α ∈ (0,1) then being immediate.
A.1. Proof of theorem 1
, and the relationship H(x)−H(y)=S(x,y)/(1−α) for x>y are straightforward consequences of the fact that, for every
and
,

the Bregman‐type function of two variables
(29)
for
and the relationship
for x>y are immediate consequences of the fact that, for all
and x<y,

A.2. Proof of proposition 1
, where
and
are of the form (5) with associated functions
. Then

we have
if y⩽x, and
if x⩽y, where j=1,2. It follows that
in the first case,
in the second case and
in the third case. This coincides with the value distribution of g(x)−g(y) when
, whence indeed
.
, where
and
are of the form (7) with associated functions
. Let
be defined as in expression (29). Then


, we may apply the same argument as in the quantile case to show that
, whence
.
A.3. Proof of proposition 2
in expression (10) suppose first that
. Since


, and under that condition we have F(θ)⩽α and
, whence the desired expectation inequality. The case
is handled analogously.
in expression (12) we assume first that
, where t denotes the α‐expectile of F. Since


Appendix B: Details for the synthetic example

or the mean
of the CDF‐valued random quantity F. The scoring function S is the elementary quantile scoring function
in expression (10) or the elementary scoring function
in expression (12). For example, if X is a quantile forecast for Y at level α ∈ (0,1) then
(30)
for event probabilities, also.
| Forecast | α‐quantile | Mean |
|---|---|---|
| F |
|
|
| Perfect |
|
|
| Climatological |
|
|
| Unfocused |
|
|
| Sign reversed |
|
|
-
†For α ∈ (0,1) and
, we let
,
and
, where Φ and φ denote the CDF and the probability density function of the standard normal distribution respectively.
References
Discussion on the paper by Ehm, Gneiting, Jordan and Krüger
Christopher A. T. Ferro (University of Exeter)
Knowing how well forecasts perform can guide responses to forecasts and inform attempts to improve forecasts. It helps society, therefore, to have good ways of evaluating forecast performance and this paper is a welcome contribution to the field. The authors introduce elegant characterizations of scoring functions that are consistent for quantiles and expectiles, the main advantage of these characterizations being that they support the use of Murphy diagrams to check whether one forecaster dominates another. Do the authors see useful interpretations of Murphy diagrams beyond checks for dominance? I give some thoughts on this below.
Consider using probability forecasts, p ∈ [0,1], of binary outcomes, y ∈ {0,1}, to decide whether or not to bet on the event {y=1}. Suppose that it costs
to bet and that the return is
if we bet and y=1. Our profit is
if we bet and y=0,
if we bet and y=1, and 0 if we do not bet, whereas our regret is
if we bet and y=0,
if we do not bet and y=1, and 0 otherwise. If the cost–return ratio
is θ and we use the Bayes rule to decide whether to bet (i.e. bet if and only if p>θ) then
is our regret expressed as a proportion of the return,
. This provides an interpretation of the score curves in Fig. 6(b). For example, for most cost–return ratios, the Survey of Professional Forecasters forecasts make on average about 2% of the return less than a perfect forecaster. Similar interpretations hold for profit matrices corresponding to other forms of bet.
The overall score is obtained by integrating the score curve over θ after weighting by a measure, H. If H is a distribution function then we can think of it as a distribution of cost–return ratios associated with a population of bettors and the overall score may be interpreted as the average regret (as a proportion of the return) that is felt by this population (Schervish, 1989). The Brier score, for example, is (twice) the average regret (as a proportion of the return) felt by a population of bettors with a uniform distribution of cost–return ratios. Plotting the score curve against H(θ) instead of θ would make the area under the curve equal the overall score.
For α‐quantiles x, we choose whether or not to bet on the event {y>θ} and so θ is now an event threshold. If the cost and return of the bet are as before then our regret is
if we bet and y≤θ,
if we do not bet and y>θ, and 0 otherwise. If the cost–return ratio is 1−α and we use the Bayes rule to decide whether to bet (i.e. bet if and only if x>θ) then
is again our regret expressed as a proportion of the return. For example, in Fig. 6(c) when
the regime switching space–time forecasts make on average about 3% of the return less than a perfect forecaster when the cost–return ratio is 0.1. If H is a distribution function then the overall score is the average regret (as a proportion of the return) that is felt by a population of bettors who all have cost–return ratio 1−α and whose event thresholds θ are distributed according to H. Note that the overall score inherits its units from H, so the score may not be dimensionless for other choices of H.
For α‐expectiles x, again we choose whether or not to bet on the event {y>θ} but now our profit depends on |y−θ|. Suppose that it will cost
to bet and that the return is
if we bet and y>θ. Our regret is
if we bet and y≤θ,
if we do not bet and y>θ, and 0 otherwise. Here,
and
are the unit cost and unit return of the bet. If the unit cost–return ratio
is 1−α and we use the Bayes rule to decide whether to bet (i.e. bet if and only if x>θ) then
is our regret expressed relative to the unit return, but measured in the same units as y. For example in Fig. 6(a) when θ=2% the Survey of Professional Forecasters forecasts make on average about 0.1% times the unit return less than a perfect forecaster when the unit cost–return ratio is 0.5. Interpretations of the overall score follow as before.
Finally, as the elementary scores are non‐zero only when the forecast yields a ‘false positive’ result (y⩽θ<x or p>θ and y=0) or a ‘false negative’ result (x⩽θ<y or p⩽θ and y=1) we can distinguish the contributions from these two types of error on the Murphy diagrams. For example, Fig. 8 shows that the relatively poor performance of the probit forecasts is due largely to producing more false negative results than the Survey of Professional Forecasters forecasts.

, probit;
, Survey of Professional Forecasters) and contributions from false positive results (
,
)
I hope that these ideas add to the utility of Murphy diagrams and I gladly propose the vote of thanks.
Philip Dawid (University of Cambridge)
It might have been thought that there is little new to say about point estimation. This impressive paper demonstrates that this is not so. In particular, when a functional of a distribution has the special property of being elicitable, we can bring to bear a whole battery of useful techniques for motivating and evaluating its assessment.
Here I concentrate on the more theoretical aspects of this work. The authors show that, for some special cases of elicitable functional—mean, percentile, expectile—the property of being a consistent scoring function can be fully characterized, and such a scoring function can be uniquely expressed as a mixture of a one‐dimensional family of extremal scoring functions. I shall show how these results may be extended to other cases.
, and a family
(I prefer to use P, rather than the authors’ F) of distributions over
. Suppose given a real‐valued functional T on
. (I shall use t, rather than x, for a value or estimate of the functional; I also suppose that T is single valued.) Suppose further that there is an identifying function V(t,y) such that
- V(t,y) is non‐decreasing in t and
- t=T(P) satisfies V(t,P)=0
where
. This is the case for the examples that are treated in the paper, as follows: mean,
V(t,y)=(t−y);
α‐quantile,
;
α‐expectile
More generally, under differentiability conditions, if we have a consistent scoring function S(t,y) for T(P) we can take V(t,y)=∂S/∂t.
, define

Proposition 3.
is a consistent and order sensitive scoring function for T.
Proof.Denote
by S(t,P). Then
.
- Consistency: first note that
.
- If θ≤T(P), then
.
- If θ>T(P), then
. In either case,
. So
is consistent.
- If θ≤T(P), then
- Order sensitivity: if θ≤T(P), then V(θ,P)≤0, so
is non‐increasing in t. Hence
. Also, if
, then
. If θ>T(P), then, for
,
. Also, since
is non‐decreasing in t, so, for
.
Corollary 3.Let H be a measure on
, and define
(31)
is a consistent and order sensitive scoring function.
(32)Now formula (32), with the absolutely continuous choice dH(t)=h(t) dt (h≥0), recovers (modulo an unimportant additive function of y) the classes of consistent scoring functions for the cases that are considered in the paper. For the α‐percentile, we obtain formula (5) of the paper on taking
, whereas for the α‐expectile (the mean, when
), formula (7) of the paper arises on taking
and integrating by parts. (I note, however, that, if H also has a discrete component, equation 32 yields further consistent scoring functions that are not of the given form.)
On differentiating equation 32 with respect to t we see that H is uniquely determined by
. Reverting to the form (31), we deduce that
is uniquely expressible as a mixture over the set of extremal consistent scoring functions
.
A remaining question is whether every consistent scoring function for T is equivalent to one of the form (31) or (32). This will be so under appropriate conditions (Osband, 1985; Osband and Reichelstein, 1985). When it is, we can apply the general methods of the paper (Murphy diagrams etc.) to more general elicitable functionals.
I have one final query about the necessity of elicitability. It is easy to show that the functional
is not elicitable, and the same will hold for any reasonable measure of spread. Does that mean that we should not attempt to measure the spread of a distribution?
It will be apparent that I have been greatly stimulated by this paper. I have much pleasure in seconding the vote of thanks.
The vote of thanks was passed by acclamation.
K. Mitchell and C. A. T. Ferro (University of Exeter)
A notable aspect of the authors’ analysis is the expression of consistent scoring functions for quantiles and expectiles as Choquet representations: weighted combinations of certain extremal scoring functions. Such a characterization is distinct from (though related to) the more familiar Savage characterization of consistent scoring functions (see, for example, Savage (1971), Osband (1985), Gneiting and Raftery (2007), Gneiting (2011) and Frongillo and Kash (2014)).
The authors query whether similar Choquet representations may be found for proper scoring rules. When the observation is a binary variable, such a characterization is well known having been derived by Shuford et al. (1966) and extended by Schervish (1989) (see also Lindley (1982)). Unfortunately, the authors anticipate a negative result for multivalued observations.
(33)

Obtaining a result similar to equation 33 for probability vectors would, therefore, allow for a Choquet representation of proper scoring rules for multivalued observations. But, key to the derivation of equation (33) is that the intervals [(k−1)/N,k/N] admit a total order ([(k−1)/N,k/N]≤[(j−1)/N,j/N]⇔k≤j) and under this total order S is order sensitive.
Intervals of
for n>1 admit only a partial order, frustrating attempts to obtain a result such as equation 33 for n>1. This result seems further to negate the possibility of a Choquet representation for all proper scoring rules of multivalued observations.
The authors’ final equation, however, neatly exemplifies another approach: a forecast for a multivalued observation is replaced by a sequence of forecasts for a binary observation and the proper scoring rule is represented as an aggregation of the proper scoring rules of the separate binary observations; then, the proper scoring rule for each binary observation is replaced by its Choquet representation. If such a reduction is available in general, it seems that a proper scoring rule for a multivalued observation admits a Choquet representation with respect to a product measure. A formal treatment of this approach would be of much interest.
Frank Critchley (The Open University, Milton Keynes), Paul Marriott (University of Waterloo) andRadka Sabolova andGermain Van Bever (The Open University, Milton Keynes)
In welcoming tonight's paper, we would like to indicate potentially fruitful points of contact with information geometry and, in so doing, to highlight questions that arise naturally in this connection. Essentially, this involves working with pairs of distributions
, rather than scalars (x,y) (see Section 3), in particular, identifying a realization y with the cumulative distribution function
degenerate at that value.
Although applicable more generally, for brevity, we focus our remarks on the mean consistent scoring function S given by equation (2). When ϕ is strictly convex, S has the form of the type of divergence defined by Bregman (1967). Divergence measures such as these are a cornerstone of information geometry and have found wide application in optimization problems, image and signal processing, machine learning and statistical inference. Again, there are important dualities among divergence functions and their corresponding affine parameters, and strong links with convex analysis. For further details on information geometry and divergences see, for example, Kass and Vos (2011), Critchley and Marriott (2015), and the extensive references therein.
- Restricting attention to strictly convex ϕ, are there useful analogues for scoring functions of well‐known duality and convex analysis results for divergences?
- Conversely, do Bregman divergences enjoy mixture representations, analogous to those for scoring functions? If so,
- might the paper be developed to provide guidance in choosing between them, and
- can such extremal mixture representations be used to good effect in, for example, optimization contexts?; more particularly, in solving associated stationary equations?
- The elegant directness of the proof of theorem 1 prompts the natural questions: what is the most general result of this kind? Further, what other instances of it are of potential interest? In particular, just as mixture representations based on expectiles have direct analogues for quantiles, are there analogous representations for non‐Bregman divergences?
In closing, it is a pleasure to thank the authors for their stimulating paper.
Kent Osband (RiskTick, Birmingham)
How can we incentivize diligent, truthful forecasts x of expected outcomes y when we cannot distinguish ex post discrepancies from random errors? Contracts that generate maximal expected pay‐offs for truthful reporting, regardless of the detailed beliefs, are known as consistent scoring rules. Remarkably, for most statistics of practical interest, every convex function ϕ can be mapped to a consistent scoring rule and vice versa. Specifically, the subgradient to ϕ(x) describes the pay‐offs from reporting x, the subgradient's value at y=X marks the expected pay‐off when the best forecast is X and the highest expected pay‐off is ϕ(X).
Although this result has been known for decades, relatively little progress has been made in identifying appropriate ϕ or even systematically studying the trade‐offs. In search of insight, this paper takes a novel tack. After controlling for location and scale, it decomposes each ϕ into a mixture of primitives max(x−θ,0) over the various possible thresholds θ; the mixing weights are effectively
. For each primitive, the corresponding penalty score works out to 0 when x and y are on the same side of θ and |y−θ| when they are on opposite sides of θ.
This is a neat way of thinking about scores. It links every
to the value of knowing whether the outcome is greater or less than θ. Given these values, it is straightforward to construct ϕ.
Here are a few thoughts on how to value
. First, note that the expected cost from a small forecast error δ around the true X is approximately
. If forecaster j has predictive distribution
with mean
and variance
close to the lower bound given by Fisher information
, then the expected cost of forecast imprecision is approximately
. If forecaster j can take what amounts to
independent observations at cost
, her chosen
will satisfy
.
Next, imagine that the people soliciting the forecasts have some prior beliefs about x that amalgamate the various forecaster beliefs on a broadly equal basis and, moreover, that they wish to encourage roughly the same diligence n by all forecasters. It then make sense to equate each
to a common multiple of the aggregate Fisher information I(θ).
Debashis Paul (University of California at Davis) and Anandamayee Majumdar (Soochow University, Suzhou)
based on univariate data Y, by comparing expected elementary score functions, with respect to the joint distribution
:
(34)
denotes the elementary quantile score and
denotes the elementary expectile score, defined in equations (10) and (12) respectively. In expression (34) we treat
and
as functions of (α,θ) to emphasize the point that it may be beneficial to categorize prediction methods by their relative performance with respect to one quantile (say, the median) versus another (say, the 95th percentile), with analogous comparisons in terms of expectiles. Determining the range of α over which a particular prediction method is dominant can be important especially when forecasting errors have asymmetric implications, e.g. when the variable of interest is the amount of precipitation, or the concentration of a pollutant.
Comparisons in terms of α can also be helpful for narrowing down a class of statistical models designed to capture certain characteristics of the distribution of observations. As a specific example, we consider the class of double‐zero expectile normal processes that were introduced by Majumdar and Paul (2015). Such a process is a stationary process with sub‐Gaussian tail behaviour such that the pth expectile of marginal distributions is 0 for a given p ∈ (0,1). The parameter p controls the degree and direction of asymmetry of the marginal distributions. We formulated a Bayesian inference procedure based on Markov chain Monte Carlo sampling in a spatial regression setting where the residual process is a double‐zero expectile normal process and found that accurate estimation of p is challenging. Comparisons of predictions corresponding to different values of p, based on empirical versions of the expected score functions in expression (34), can facilitate determination of a range of plausible p (or an informative prior for p). This process can be enhanced further through specification of a range of α, depending on whether the ability to predict the central part of the data or the extremes is considered more important.
Roberto Casarin (University Ca’ Foscari, Venice) and Francesco Ravazzolo (Free University of Bozen‐Bolzano)
The authors are to be congratulated on their excellent intuition, which has culminated in the development of quite a general approach to consistent scoring of competing forecasting models. Their approach based on extremal scoring functions is in the spirit of the existing literature and in particular of the seminal papers of Gneiting and Ranjan (2011, 2013). We believe that the proposed consistent scoring functions can be applied to achieve new density calibration and density combination schemes.
and
from two predictive models and F the distribution of Y. Following the notation that was used by the authors, one could consider the map
(35)
a threshold parameter and
a combination or calibration parameter vector. If the parameter ξ is indexing a family of distributions
, with
a family of non‐decreasing functions with g(0)=0 and g(1)=1, then we have a calibration scheme. If ξ is indexing the family of distributions
then we obtain a combination scheme. Optimal calibration and optimal combination can be defined as
(36)
be the incomplete beta function with parameters ξ=(μϕ,(1−μ)ϕ), and
the quantile of the calibrated sign‐reversed model where
and ξ(θ) is a solution of expression (36). Then the Murphy diagram of the calibrated model is

and
with distribution N(−1,1) and N(2,2) respectively and use the extremal scoring rule to find the optimal combination. The Murphy diagram of the optimal predictive pooling model is

is the α‐quantile of ν Φ(q+1)+(1−ν) Φ{(q−2)/√2}.

) and sign‐reversed (
) forecasters given in Table 2 of the paper, and for the calibrated forecaster obtained by applying a beta calibration function to the sign‐reversed forecaster (
)

), biased (
,
) and combined (
) forecasters
In their paper, the authors sketch some possible extensions. We recommend as a further exciting and stimulating research line the use of consistent scoring functions for model combination and/or model calibration with the aim of improving on Mitchell and Hall (2005), Hall and Mitchell (2007), Geweke and Amisano (2010, 2011), Billio et al. (2013) and Fawcett et al. (2015). We are therefore very pleased to thank the authors for their work.
The following contributions were received in writing after the meeting.
Miguel de Carvalho (Pontificia Universidad Católica de Chile, Santiago) andAntónio Rua (Banco de Portugal, and NOVA School of Business and Economics, Lisbon)
but not on others. Typically in cases where there is no clear‐cut forecast dominance, one might wonder how the Murphy diagram of forecast combinations compares. For example, how does the Murphy diagram of the average of both forecasts,
, compare with the Survey of Professional Forecasters
and Michigan
forecasts? As can be seen in Fig. 11(a), the average of forecasts performs better on some values on some regions of Θ but not on others. One could ask: ‘Is there any other convex combination performing “better”? How do we define “better” in terms of the Murphy diagram?’ To approach these questions consider the forecast combination
, and—extending ideas from Section 3.3—define the area under the Murphy diagram and the maximum of the Murphy diagram respectively as

and
respectively denote the minimum and maximum of
, and
with w ∈ [0,1]. Smaller values of these summaries of the Murphy diagram are compatible with good forecast accuracy. Indeed, if there was a value of
for which the combination of forecasts coincided with the data, then
and
. Thus, a natural way of defining the ‘best’ convex linear combination of forecasts by using Murphy diagrams is as
, where
, or through the minimax criterion
. We call
a Murphy optimal combination forecast. For example, for the inflation forecasts
and
; also,
, whereas A(1)=0.41 (Survey of Professional Forecasters), A(0)=0.49 (Michigan) and
(mean of forecasts); in addition,
, B(1)=0.163, B(0)=0.195 and
. See Figs 11(b) and 11(d) for the plots of A(w) and B(w) over the interval [0,1].

, Survey of Professional Forecasters;
, Michigan;
, average combination forecast;
, Murphy optimal combination forecast,
), (b) area under the Murphy diagram, (c) Murphy diagrams for the inflation example (
, Survey of Professional Forecasters;
, Michigan;
, Murphy optimal combination forecast,
) and (d) maximum of the Murphy diagram
Shih‐Kang Chao and Guang Cheng (Purdue University, West Lafayette)
follow
(37)
(38)
(39)
with 1(·) the indicator function, i.e.
and
is independent of the σ‐algebra generated by
.
used in the Murphy diagram chimes with these two VaR‐standards, where
(40)
and
respectively and
, we find that (i) measures ‘coverage’, i.e. condition (a) and that (ii) measures the quality of x mimicking the dynamics of y, i.e. condition (b).
The Murphy diagram (Fig. 12) demonstrates that
‘dominates’
. This dominance pattern can be linked with standards (a) and (b).

with the scores
defined in expression (40):
,
;
,
; |, θ=−5.337
θ ∈ [−5.337,−2]
The large score deviation in this region is mainly due to term (i) caused by the fixed value of
—for most t,
, whereas
.
θ ∈ (−2,6]
The score deviation is much less in this region because of similar (i) (due to condition (a)) and (ii) of two VaRs, and the latter to the fact that
for most t.
no longer follow an AR(1) model. Suppose that
and
are two estimates for a certain quantile of
, and
dominates
at t=0. This is equivalent to
for all θ, where
. To determine the critical point
after which
dominates
, we may consider a cumulative sum statistic (Siegmund, 1985):
(41)
for some number b. Of course, an appropriate choice of (b,T) requires further study.
H. M. Christensen (University of Oxford)
As a meteorologist, I found this a very interesting and stimulating paper, with important ramifications for the verification of weather forecasts. Apart from the obvious usefulness of the newly proposed Murphy diagrams, the paper clarifies the importance of specifying the user's functional of interest, and not simply the forecast scenario.
Consider two forecasters competing for business with a wind energy company. The company requires warning if wind speeds will exceed 60 m.p.h., as in this case they must act if they want to prevent damage to the turbines. Their cost–loss ratio is 0.05.
Forecaster A presents the company with the probability that the winds will exceed 60 m.p.h. Forecaster B presents the company with the 95th quantile of his forecast probability distribution function, which can be compared with the cut‐off wind speed. Both forecasts are tailored to the same scenario, but they fall into the two different classes of point forecasts outlined in the paper.
The energy company can rank the two forecasters by comparing their expected profits when decisions are made by using each forecast in turn, given the stated cost–loss ratio and cut‐off wind speed. In addition, for forecaster A, they can use a Murphy diagram to consider how the probability forecast would perform at a variety of cost–loss ratios θ for the given cut‐off threshold. This could be of interest if the turbines became cheaper to replace, or the cost of electricity changed. For forecaster B, they can use a Murphy diagram to consider how the quantile forecast would perform at a variety of wind cut‐offs, θ—important if developments allowed safe use of turbines at higher wind speeds.
However it is not possible to compare the performance of forecaster A and B except at the original threshold and cost–loss ratio, unless the full predictive probability distribution function was available, in which case equation (16) could be invoked. The user must specify which type of point forecast is of more use to them, to allow for fair comparison and to test whether one point forecast dominates the other. It is likely that the dominance of one forecaster over the other is dependent on the point forecast requested. In explicitly stating which point forecast is required (probability or quantile), the competing forecasters are provided with an additional goal to use in improving their forecast model and calibrating the resultant forecasts.
Matei Demetrescu (University of Kiel)
I shall gladly elaborate on some ideas of this thought‐provoking contribution. First, an extension of theorem 1, parts (a) and (b), for what could be called the
‐class, p=3,4,…, is sketched. Second, an economic interpretation complementing that given in Section 2.3 is provided for generic scoring functions. (Throughout, regularity conditions are assumed implicitly.)
builds on the asymmetric power function (see for example Elliott et al. (2005))
(42)
relates to asymmetric linear (p=1) and
to asymmetric quadratic (p=2) scoring functions. The optimal point forecast
under
is the unique root of
(43)
consists of scoring consistent with such ‘higher order expectiles’.
, let
have positive, non‐decreasing pth‐order derivative. Then,
(44)



(45)
from equation 45 instead of the Bregman‐type Φ (equation (29)), where


for p=3,4,… is beyond the scope of this note.
Instead, consider the following economic interpretation of some S(x,y), for which a Choquet representation need not exist explicitly. Let Y model the distribution of the net monetary pay‐off of an uncertain investment; invest when the predicted pay‐off T(F) is larger than the pay‐off of doing nothing, i.e. T(F)>0. This parallels the decision rules given in Section 2.3 but with different thresholds. Considering alternatively a utility‐based framework, one invests only when the expected utility E[U(Y)] of the investment is larger than the utility U(0) of the certain pay‐off, i.e. E[U(Y)−U(0)]>0.



(46)
and
, matching risk neutrality where only the expected pay‐off matters.
Francis X. Diebold (University of Pennsylvania, Philadelphia)
The paper is stimulating and welcome. It most definitely pushes us forward. Given the space constraints, I shall focus on just one of its themes: the ranking of imperfect forecasts. The authors emphasize that Choquet‐type mixture representations give rise to simple checks of whether one point forecast dominates another in the sense that it is preferable under any consistent loss function. Econometricians have grappled for some time with related dominance issues and have developed tests for whether one forecast ‘stochastically dominates’ another; see Corradi and Swanson (2013), Lee et al. (2014) and Jin et al. (2015), and the references therein.
Unfortunately, however, the typical situation (indeed noted by the authors) is that no forecast dominates all others. Hence the issue of ‘loss function choice’ remains, and rankings of competing forecasts will in general depend on the loss function chosen. Typically, little thought is given to the loss function L(e). Instead, Gauss's centuries‐old quadratic loss,
, remains routinely invoked, primarily for mathematical convenience.
(the unit step function at 0),

is the error cumulative distribution function that corresponds to perfect forecasts. If e is governed by
, then e=0 with probability 1.) The idea is simply to compare F(e) with
and to favour the forecast whose F(e) is ‘closest’ to
. Formally, we favour the forecast with smallest ‘stochastic error distance’ SED, given by

The SED‐approach yields useful insights with important practical implications. In particular, it directs attention away from squared error loss and towards absolute error loss. Indeed the key result of Diebold and Shin (2015) is that SED‐loss equals expected absolute error loss, i.e.
. Among other things, this suggests shifting attention away from conditional mean forecasts and towards conditional median forecasts.
Dalia Ghanem (University of California at Davis)
To choose between two point forecasts, typically presented as a quantile or expectile of a predictive distribution, a decision maker employs a scoring function. For practical decision making settings, the choice of scoring function may play a crucial role in ranking these two forecasts. The practical problem, however, is that there is usually no justification for choosing one consistent scoring function over many others. Ehm and his co‐authors propose a theoretically sound and empirically attractive method to choose between competing point forecasts. From a practical perspective, one of the key advantages of the method proposed is that it admits a graphical representation that is easily interpretable, specifically a Murphy diagram.
There are two key results in this paper. The first is that any consistent scoring rule for a functional of interest can be represented as a mixture over a single‐parameter extremal scoring function. The second key result states that comparing the sample analogue of the extremal scoring function at a finite number of points is sufficient to establish that an empirical forecast dominates another. These two results together bring forward an intuitive and tractable method to compare competing forecasts without relying on a particular choice of scoring function.
In practice, one empirical forecast may dominate another over part of the support of the extremal scoring function, whereas the latter dominates over another part of the support. In these settings, valid inference plays a crucial role. The current paper uses pointwise confidence bands. The authors do not discuss why pointwise inference is appropriate. Naturally, it depends on the inference question that one seeks to answer. One argument in favour of pointwise inference is that each point in the support of the extremal scoring function corresponds to a particular scoring function, and the researcher seeks to test whether the two forecasts lead to equal values for each particular scoring function or not. If that is so, then one may use pointwise inference but must account for multiple testing. In contrast, uniform inference may be appropriate if one is interested in comparing the extremal scoring functions ‘as a whole’. This is especially important since the points at which the extremal scoring functions are evaluated are realizations of random variables and their number increases with sample size. A careful treatment of these inference issues is beyond the scope of the current paper but would be valuable directions for future research.
Wolfgang K. Härdle (Humboldt‐Universität zu Berlin and Singapore Management University) and Chen Huang (Humboldt‐Universität zu Berlin)
When evaluating and ranking the performance of forecasters, the choice of the scoring function to measure loss properly is a crucial question which has motivated the authors of this paper to establish general guidance for this issue. For statistical functions, for example, the mean function, or quantiles and expectiles more generally, a representation of consistent scoring functions is provided as a weighted mixture over extremal scores. In other words, all the consistent scoring functions could be reduced to a one‐dimensional family of extremal scores by this way. Also the authors have shown that such representations could be considered as Choquet representations. Consequently, one point forecast can dominate another alternative, if it is preferred, in terms of each elementary score. This will lead to a simple and uniform checking standard when assessing different point forecasts, avoiding the influence that might be caused by the choice of scoring function. In this paper the authors focus on the cases of quantile and expectile functions and give intuitive economic interpretations for them.
Corresponding to empirical measures, forecasts can be easily and sufficiently compared in terms of the average of extremal scores. The results of the comparison between forecasts can be clearly visible in the so‐called Murphy diagrams, which give the average scores over samples and also the score difference together with pointwise confidence bands. This paper illustrates three examples in line with the meteorological and economic studies in three previous works to justify whether the benchmarks outperform alternatives consistently.
We congratulate and thank the authors for their breakthrough on evaluating forecasting performance of quantiles and expectiles. It would be nice if their methodology could be applied to forecast tail event curves (conditional quantile or expectile curves in the context of functional data analysis). Studying extreme behaviour by tail event curves has attracted increasing attention recently in many applications, such as temperature and wind forecasting, derivative pricing and risk management. Taking information on tails into consideration can help to improve forecasting precision. It will definitely become a widely used tool in many areas of industry.
Hajo Holzmann (Philipps‐Universität Marburg) and Bernhard Klar (Karlsruher Institut für Technologie, Karlsruhe)
The authors have provided us with an interesting paper on the comparison of distinct forecasts of quantiles and expectiles. Recently, Gneiting (2011) made a strong point that such forecast comparisons should be based on consistent scoring functions. However, there are a variety of consistent scoring functions for both quantiles and expectiles, and empirical forecast rankings may depend on the particular choice.
Therefore, the authors introduce the notion of forecast dominance for all consistent scoring functions and show that, in the case of quantiles or expectiles, this can be checked by considering appropriate one‐parameter families of elementary scoring functions.
We feel that this notion of forecast dominance is very strong and can often not be expected to hold even in situations where one forecast should clearly be favoured over another. Consider the example given in Table 2 in the paper, and let us focus on forecasting the median. Introduce another forecaster, call her extreme, who issues 10μ. Murphy diagrams for the expected score as well as for empirical scores for samples of sizes n=30 are given in Fig. 13. The curves of the perfect and the extreme forecaster all touch each other at ϑ=0, and the empirical curves in Figs 13(b)–13(d) actually all intersect, even in the ‘reasonable’ subinterval [−1,1] of values for the parameter ϑ. We simulated the probability that the Murphy diagrams intersect (not only touch) in the interval [−1,1] for distinct sample sizes; the results are in Table 4. Whereas the probability converges to 0 for the climatological and the sign‐reversed forecasters, it actually comes close to 1 for the extreme forecaster. In contrast, the probability that the classical median score, the absolute loss, of the perfect forecast exceeds that of one of the other forecasters quickly tends to 0, as shown in Table 5.

), climatological (
), sign‐reversed (
) and extreme (
) forecasters: (a) theoretical scores; (b)–(d) scores for distinct samples of size n=30
| n | Climatological | Sign reversed | Extreme |
|---|---|---|---|
| 10 | 63.5 | 21.5 | 85.2 |
| 20 | 58.4 | 8.2 | 91.9 |
| 30 | 51.2 | 3.0 | 94.4 |
| 50 | 39.1 | 0 | 96.6 |
| 100 | 20.8 | 0 | 98.1 |
| 200 | 5.2 | 0 | 98.9 |
| n | Climatological | Sign reversed | Extreme |
|---|---|---|---|
| 10 | 8.7 | 0.6 | 0 |
| 20 | 2.9 | 0 | 0 |
| 30 | 0.8 | 0 | 0 |
| 50 | 0 | 0 | 0 |
Thus, although Murphy diagrams are useful descriptive tools for comparing forecasts, the formal notion of forecast dominance which involves a large family of consistent but not strictly consistent scoring functions should be used with care.
Ian Jolliffe (University of Exeter)
I add my congratulations to the authors on a stimulating paper. It is fitting that they have named their key diagram the Murphy diagram. Before his untimely death in 1997, Allan Murphy not only did a tremendous amount of work in the field of what atmospheric scientists call forecast verification, but he was also a central figure in bringing together the statistical and climate communities in a series of 3‐yearly International Meetings on Statistical Climatology, the 13th of which is scheduled to take place in 2016.
I have just two practical questions on the paper. I may be wrong but it seemed to me that a dominated forecast is somewhat like an inadmissible decision strategy in decision theory. Typically, consideration of admissibility eliminates some strategies but rarely leaves only one admissible strategy. Similarly, I would have expected dominance rarely to give an unequivocal best forecast. However, the paper has examples of single best forecasts and I was wondering whether the authors could give some idea of how often this happens.
Different aspects of a forecast may be of practical interest, even to the same user, e.g. reliability and resolution in the decomposition of the Brier score for probability forecasts (Murphy, 1973) or mean‐square error and correlation for continuous forecasts (Murphy, 1988). One forecasting system may do better for one aspect, but a different forecasting system may be better on another. Do the authors have any thoughts on what should be done in such situations, if anything, beyond simply recording relative performance of the forecasting systems for each aspect?
Victor Richmond R. Jose (Georgetown University, Washington DC)
- One of the topics that has been alluded to but perhaps not given much attention is that of calibration. For quantiles, calibration can be studied by matching the percentage of times that realizations fall above or below reported forecasts to α. For expectiles, the notion of calibration is not as direct but perhaps can be generated if one were to view expectiles as quantiles of a closely related function (Jones, 1994). In some earlier works on forecast dominance in the probability setting, the notion of calibration appears to be inseparable from dominance (e.g. Schervish (1989)). One then wonders whether the notion of calibration is relevant when dealing with quantiles and expectiles. If so, could calibration be meaningfully incorporated in this setting?
- Some generalizations and extensions come to mind when dealing with order sensitivity and dominance. For a scoring rule S, order sensitivity could be generalized by carefully defining a distance function d associated with S with
. This may be a good starting exercise to think about how these concepts can be extended more generally as newer definitions of quantiles and expectiles appear, for example, in the multivariate setting.
and
about Y; what conditions are needed for
to attain a lower minimum score under S compared with
, i.e. when would

for k=1,2? Jose and Winkler (2009) showed that the necessary and sufficient condition for this is that the two distributions must be stochastically ordered in the dilation sense (Shaked and Shanthikumar, 2007). It would be interesting to see whether such a type of dominance could also be extended to the expectile scoring rules as well as the extremal versions of both rules.
Alex Lenkoski and Thordis L. Thorarinsdottir (Norwegian Computing Center, Oslo)
We congratulate the authors on an excellent paper which addresses several issues that we regularly encounter in our work. The Murphy diagram offers an informative way of assessing predictive performance and we are interested to hear the authors’ thoughts regarding its use to improve the forecast in settings where there is not a single dominating forecast. To illustrate this, we consider the daily return on IBM's stock from January 3rd, 1962, to December 24th, 2015: a time series of 13589 points.
be the return on day t. We consider forecasting models of the form
where
(47)The challenge with model (47) is in choosing the tuning parameter ν. As an example, Fig. 14 shows the Murphy diagram for ν=0.96 and ν=0.99 for the 0.5%‐quantile of the daily returns where the scores are averaged over the final 10000 time points. For very low values of θ, the model with ν=0.96 performs better as the more rapidly decaying model performs better at times with high volatility. Similarly, as θ increases, the inclusion of more historic information in model (47) improves the performance. These results are significant at the 95% level (the results are not shown).

) and ν=0.99 (
) for 0.5%‐quantile predictions
In this situation, it seems that the parameter ν should either be chosen dynamically or the forecasts could be combined potentially to outperform each consitituent method. Do you have any suggestions for how to use results such as that shown in Fig. 17 to create a new, improved forecast?
Han Liu (Princeton University)
I congratulate the authors for making this thought‐provoking contribution. In particular, I found the discussion of the Choquet‐type mixture representation in Section 1 and Section 2.2 very intriguing. Here I comment on possible implications of this representation to the field of statistical machine learning. Specifically, I first present a new supervised learning framework based on the mixture representation. I then raise several questions that remain to be answered.
where
is the response variable and
is a d‐dimensional covariate. Denote (Y,X) to be the population quantities. Let
be the squared error loss and
be a measurable function; define

be a set of functions; define

. In the machine learning community,
is in general deemed to be the optimal predictor within the model class
. An important machine learning task is to construct an estimator
based on the samples
which satisfies

is optimal only with respect to the square error loss. As has been insightfully pointed out by the authors, the squared error loss is just a special case of a larger family of scoring functions

are consistent for the mean functional. In addition, the authors show that any scoring function
can be represented as

that is simultaneously robust to a set of scoring functions:

is a set of non‐negative measures. An empirical risk minimizer is defined as

- Under what conditions of
is the empirical risk minimizer
computationally tractable?
- In which situations is
preferred to
?
- What is the optimal rate of convergence for estimating
? This rate may depend on the volumes of both
and
.
Jorge Mateu (University Jaume I, Castellón)
The authors are to be congratulated on a valuable contribution and thought‐provoking paper on this timely topic of point prediction that forecasters have to face in a variety of transdisciplinary problems. As the authors state, there is an ample consensus that forecasts ought to be probabilistic in nature, taking the form of predictive probability distributions over future quantities or events.
I shall focus on a particular methodological problem to add some food for thought. The problem relates to evaluating and comparing competing forecasts, and how a consistent scoring function can be developed and properly used. Motivated by the approach of Str
hl and Ziegel (2015) for time series, consider the case where we do have spatial dependence.
Analysis of environmental phenomena for risk assessment usually involves the construction of indicators that are related to structural characteristics of extremal events defined by exceedances over critical thresholds. Recurrence and persistence, among others, are examples of such characteristics which provide information about the distribution patterns of extremal events. Formally, these concepts are intimately related to the geometrical characteristics of the excursion sets defined by threshold exceedances over a given (bounded) domain (Angulo and Madrid, 2010). Given the fragmented nature of threshold exceedance sets, depending on the variation properties that are inherited by sample paths from the probabilistic structure of a random field and on the threshold considered, marked point processes provide a powerful framework for the analysis of their structural properties (Madrid et al., 2012). This approach can be exploited to help to establish the bridge between the construction and interpretation of risk indicators and the properties of the underlying random field generating critical events. Connected components of a threshold exceedance set can be treated as single isolated events, with some geometrical properties such as size and contour roughness being considered as possible marks of interest for complementary analysis of diverse forms of heterogeneity and anisotropy. Hence, a variety of marked point process characteristics can be used to describe some features of interest, in particular for risk assessment purposes.
Marked point process models provide predictive probability distributions of these types of exceedances. When facing the problem of a point prediction of such an exceedance in a region of interest based on neighbouring spatial dependence, considering extremal scoring functions as quantiles or even expectiles is of key interest. Open questions such as defining appropriate scoring functions, selection of that scoring function dominating other competitors are then worth considering.
Xiaochun Meng (University of Oxford)
The authors are to be congratulated on an excellent paper. It is interesting to see that the Murphy diagram can be used to compare estimators for the full set of proper scoring functions. My research focuses on quantile estimation for time series, and in this context the authors note that the quantile regression check loss function is not the only proper scoring rule. Three issues concern me. Firstly, I am unsure how one should proceed if the Murphy diagram does not show clear dominance of one method over another. Perhaps dominance in one part of the diagram is preferable to dominance in another. Secondly, I am interested in the implications of the authors’ work for the empirical average scoring function, which is obtained by taking the sum of the scoring function for each day in a certain period. For a typical time series of daily financial returns, the distribution of the return on one day is unlikely to be exactly the same as the distribution of the return on a different day, and these distributions are likely to be notably different if there is a reasonable period of time between the two days. When evaluating a quantile forecasting method by using a post‐sample period of, say, 500 days, it is not clear to me whether the check loss function still serves as a proper scoring function. Thirdly, I am curious whether the authors’ work on evaluation measures has implications for the choice of loss function to use in model estimation.
Edgar Merkle (University of Missouri, Columbia)
I congratulate the authors on a clear paper that impacts a variety of disparate fields. I especially appreciate the notion of forecaster dominance and see many applications to the evaluation of both forecasters and general statistical models that make probabilistic predictions.
It seems that further connections could be made between dominance and forecast aggregation or combination. If forecaster A dominates forecaster B (in the sense of definition 2), then does this imply that forecaster A is better than all convex combinations of forecasters A and B? Previous researchers (e.g. DeGroot and Fienberg (1983) and Schervish (1989)) hinted that we cannot immediately answer this question even when we know that one forecaster dominates the other.
- forecaster ID 3 empirically dominates forecasters ID 6 and ID 8, and
- forecaster ID 5 empirically dominates forecaster ID 10.
For both these relationships, the mean Brier score of the dominating forecaster is better than the mean Brier score when we average the dominating forecaster with the corresponding dominated forecasters (for example, forecaster ID 3 has a better Brier score than the average of forecasters ID 3, ID 6 and ID 8). However, when we include other forecasters in the average, results can change. For example, the mean Brier score that is associated with the average of all 10 forecasters is better than the mean Brier score that is associated with the average of all forecasters except ID 6 and ID 8. So forecasters ID 6 and ID 8 can help the full group, even though they are dominated by another forecaster within the group. It seems that we would have to tell a forecast consumer that forecaster ID 3 is clearly better than forecasters ID 6 and ID 8, but we should still keep forecasters ID 6 and ID 8 around to counteract the others.
In addition to helping us to discern dominance, perhaps the empirical scores could be used to develop some weighting of forecasters that leads to the lowest score, on average, as θ goes from 0 to 1. Such developments may prove useful for producing aggregate forecasts that are robust to choice of scoring rule, which is an interest of many forecast consumers.
Monica Musio (Università degli Studi di Cagliari)
In the following comments I use the same notation as in the paper.
Suppose that T(F) is an estimable functional, and S(t,Y) is a consistent scoring function for T. This is just a loss function for a decision problem, whose Bayes act (for simplicity here assumed unique), when Y∼F, is T(F), by consistency. By general theory (see Dawid (1986)) this implies that
is a proper scoring rule, whose expectation
when Y∼G is minimized (though in general not uniquely) by the honest choice F=G.
, and a parametric family
, we can construct an associated M‐estimator of θ. For independent identically distributed data
, this is given by

is the empirical distribution function of the data. In the case at hand,
. By consistency, this is minimized for any θ such that

Andrew Patton (New York University and Duke University, Durham)
I congratulate the authors on an insightful and stimulating paper. Their ‘Choquet representation’ opens the way to new ways of thinking about forecast comparison, model estimation and more. By showing how a class of loss functions (scoring rules) can be represented as a mixture of ‘elementary scores’, where the measure governing the mixture is indexed by a scalar parameter, consumers (and producers) of forecasts may gain new insights into whether one forecast dominates another. I think that ‘Murphy diagrams’ will make a useful addition to the empirical researcher's toolkit.
Like all estimated quantities, though, it is important to consider sampling variation. In some applications, this may be less important: in Fig. 6(d) we observe that the difference in average elementary scores crosses zero, indicating that the ranking of these two forecasts is sensitive to the choice of the elementary score parameter (θ), and the need for methods of inference is not so strongly felt. In Fig. 6(e), however, we observe that the difference in average elementary scores is non‐positive for all values of θ, hinting that one forecast (the Survey of Professional Forecasters forecast, in this case) may be preferred to the other (a forecast based on a probit model) for any consistent scoring rule. The pointwise confidence intervals that are given in Fig. 6(e) are clear and easily interpreted guidelines, but for a formal test of forecast dominance (i.e. dominance across the range of all elementary score parameters) we need a joint test. Considering such tests would be very difficult without the authors’ Choquet representation and, indeed, as the authors note, they remain challenging even with this new tool. But similar hypotheses arise in other applications, e.g. in testing for first‐ and second‐order stochastic dominance of asset returns (see Linton et al. (2010) for a recent contribution to this literature), and perhaps the methods that are used in those applications may be adapted for use here.
As the authors note, it may not be common to observe forecast dominance in practice, and I agree with their conjecture. With the growing number of on‐line surveys, and as economic (and other) surveys grow and gather respondents, however, one could imagine that forecast dominance may become more relevant in the future. Implementing tests of forecast dominance may also improve decision making: combination forecasts have been found to work well in many economic applications (see Timmermann (2006) for a survey), and these combinations can be improved by excluding truly awful individual forecasts. Tests of forecast dominance, perhaps taking into account the multitude of individual forecasts in such an application, could yield great gains.
Justinas Pelenis (Institute for Advanced Studies, Vienna)
I congratulate the authors on this timely and interesting contribution to the forecast ranking literature. For the cases of expectiles and quantiles they present a novel and very important result demonstrating how any consistent scoring function of a quantile or expectile would admit a unique representation as a mixture of elementary elements. This important result limits the need to explore a variety of consistent scoring functions and allows us to focus simply on the elementary scoring functions
and
.
The usefulness of this mixture representation becomes apparent when we consider ranking competing forecasts of a target functional. If forecast
is preferred over forecast
for all elementary scoring functions, then
is ranked above
for any consistent scoring function of that functional. The authors claim that
‘the weighting becomes irrelevant in the case that there are dominance relationships between competing forecasts, ...’
suggesting that the choice of the mixing measure H(·) for forecast comparison is irrelevant when there is a dominance relationship between
and
.
Nonetheless, I suggest that the mixing measure might still be of relevance even in the case when there is a dominance relationship between competing forecasts. The mixing measure might matter when we consider M‐estimation methods based on consistent scoring functions. Suppose that we are interested in the α‐expectile and two competing misspecified models A and B and model‐based forecasts are to be considered. It is plausible that M‐estimation based on a given consistent scoring function
would deliver that forecasts
dominate
, whereas M‐estimation based on a given consistent scoring function
would deliver that forecasts
dominate
. It seems plausible, that the dominance could be reversed depending on the consistent scoring function that is used for estimation. The finding that one set of forecasts dominates another set of forecasts might be an artefact of the mixing measure that is used for estimation and the weighting could be considered relevant even in the cases when dominance is observed.
It is true that the methods that are considered in this paper apply only to the comparison of forecasts and not to the comparison of models; nonetheless a practitioner might be interested in estimating a set of models and producing model‐based forecasts. Therefore, one should explore the consequences of the choice of the mixing measure H(·) and I second the authors’ suggestion that ‘... the question of the optimal choice of loss or scoring function to be used for estimation in regression problems’ is of considerable interest for future research.
Pierre Pinson (Technical University of Denmark, Lyngby)
The authors are to be congratulated for an intriguing and exciting contribution to the science of forecasting and more particularly to forecast verification. Some fundamental contributions to this field can be traced back to the 1950s. Even so, some of the key issues in forecasting still appear to be open problems today. I mainly think here of the link between forecast quality and value. Those were elegantly defined by Murphy (1993), in short, as the objective correspondence of the forecasts with the process observations for the former, and as the increased benefits from integrating such forecasts in decision processes for the latter. Linking these two is definitely not trivial in a general sense, even though for specific example problems and toy models one can nicely illustrate an existing (or non‐existing) connection. For forecasters having to deal with a wealth of decision makers with different decision problems and loss functions, it can never be possible to consider all potential problems and loss functions to assess the value of their forecast to these decision makers.
Consequently, the intriguing nature of that contribution lies in the fact that the authors investigated and found another rather elegant path to make a better connection between forecast quality and value. Elementary scoring functions and Choquet‐type mixture representations allow defining a simple toolbox to assess whether a forecaster (or a set of forecasts) dominates another, under any consistent scoring function. Even though it cannot readily tell whether this forecaster will yield higher value in all potential decision processes, the strength of the authors’ result is something that brings us closer to ensuring that forecasts seen as having higher quality should eventually yield higher value, whatever their loss function.
On the basis of this contribution, one is left wondering how practical this result and so‐called Murphy diagrams may be in empirical forecast comparisons. The authors mention that empirical dominance is for the case of an elementary scoring curve lying under another one, for all θ. How informative really is the case where curves intersect? Besides, this result is given in a univariate set‐up only, as if, when making a decision at time t, the decision maker would consider a single variable at time t+k only. In many practical applications, decision makers are to account jointly for information from many variables, possibly various lead times, locations, etc. I therefore wonder whether such results and diagnostic tool could generalize to the case of multivariate set‐ups.
Barbara Rossi (Istitució Catalana de Recerca i Estudis Avançats–Universitat Pompeu Fabra, Barcelona, Barcelona Graduate School of Economics, and Centre de Recerca en Economia Internacional, Barcelona)
Previous literature has found that the ranking of competing forecasts may depend on the choice of the scoring function (e.g. Patton (2015)). This finding clearly has the problematic implication that, for a given loss function, a forecast may be superior to another, but, at the same time, the latter forecast may be superior to the former when a different scoring function is used.
This paper makes a very useful point: it shows that every consistent scoring function can be written as a weighted average of elementary scoring functions, where the weights depend on a parameter θ. Thus, if the average elementary scoring functions of a forecast are lower than those of a competing forecast for every value of θ, then the former is superior, no matter what the choice of the loss function is. This can be easily analysed by depicting the differences of the scoring functions of competing forecasts as a function of θ, as shown for example in Fig. 6(d) in the paper. In fact, if, for example, the difference of the elementary scoring functions of the Survey of Professional Forecasters and the Michigan survey forecasts of inflation is negative for every value of θ, then the Survey of Professional Forecasters provides better forecasts, no matter which scoring function is used to evaluate the forecasts.
I would like to complement the analysis made in this paper by observing that, empirically, the relative ranking of competing forecasts may also depend on the sample period. Rossi (2013) has shown that the relative ranking of competing forecasting models of US inflation, for example, may change over time, for a given scoring function. The very interesting results in this paper could therefore be extended one step further. In fact, one could plot the differences in the elementary scoring functions in rolling windows, as opposed to the full out‐of‐sample period, and then verify that the individual scoring function differences are negative at each point in time by using Giacomini and Rossi's (2010) fluctuation test. Such an analysis would be useful to determine at which points in time the ranking of competing forecasts is independent of the choice of the scoring function and when it is not. This may provide additional valuable information to study the reliability of a particular forecasting model no matter which scoring function and sample period the forecaster considers.
Ville A. Satopää (University of Pennsylvania, Philadelphia)
The scoring function is often selected without a justification. This is particularly unfortunate because, as the authors of the current paper point out, different choices can lead to different relative rankings of the competing forecasting methods. Motivated by this, the authors express a large class of scoring functions as different mixtures of simple, interpretable extremal scoring functions indexed by a parameter θ. Furthermore, by considering a subclass of scoring functions, they can establish a Choquet representation. Unfortunately, the motivation and real importance of this representation are not well described, especially given that Choquet is mentioned in the title of the paper. Nonetheless, the mixture representation is a remarkable contribution because it allows the practitioner to test whether some forecasting method simultaneously dominates its competitors over an entire class of scoring functions.
Unfortunately, however, as the authors state in their discussion, ‘dominance may not be very commonly observed in practice’. Therefore the practitioner most probably observes multiple methods, each dominating over different values of θ. To proceed, the practitioner must then decide which values of θ are more important for the application at hand, i.e. the practitioner must choose a mixing measure, which is essentially equivalent to picking a scoring function. Therefore the initial problem of choosing a scoring function is typically not avoided.
Choosing a mixing measure, of course, requires a clear understanding of what different values of θ represent. For this, the authors interpret the different extremal functions in terms of different betting games. Even though the discussion therein is very interesting, the interpretations are overall rather cumbersome for comparing forecasting methods across different values of θ. This questions whether a Murphy diagram for probability forecasts of binary events, such as Fig. 4(a), should be used over a regular reliability plot. After all, a reliability plot describes the forecasting method in terms of simple traits such as sharpness, overconfidence and underconfidence, which together are suggestive of the overall Brier score. This intuitive appeal makes it easier for the practitioner to evaluate any trade‐offs and to proceed with a final method.
Milan Stehlík (Johannes Kepler University, Linz, and University of Valparaíso)
I congratulate the authors on this paper, which introduces readers to a challenging world of consistent scoring functions for quantiles and expectiles! I would like to point out the intrinsic relationship between information divergences and scores. If the underlying distributions are from a regular exponential family (see Barndorff‐Nielsen (1978)), with the sufficient statistics t(y) for the canonical parameter γ, then we can directly relate scoring function S(x,y) between the point forecast x and the realized observation y to Kullback–Leibler (KL) divergence I(y,x) of the observed vector y in the sense of Pázman (1993).
Then, the squared error scoring function
corresponds to the I(y,x) for linear regression with normal errors (see Stehlík (2003), page 151).
(on page 511), which for a particular point b=0 has the form of KL divergence


and shape parameter
. The exact distribution of KL divergence of the observed vector y has been studied in Stehlík (2003), and such studies can give optimal tests for forecast dominance, as a complement to the Murphy diagram. The application of
with underlying distribution for methane emission can be found in Sabolová et al. (2015).
Thus, aside from a useful study of score functions, it will be useful to mention its direct relationship to information theory, in particular KL divergences and KL scores. The Bregman score has been extensively studied (Gneiting and Raftery, 2007; Murata et al., 2004; Hendrickson and Buehler, 1971); however, the KL score could be also integrated in the whole picture. Decompositions of information divergences from a broader perspective should be of interest (see for example Stehlík (2012)). Thus the relationship between ϕ‐divergence and statistical information in the sense of DeGroot (see DeGroot (1970)) can be obtained. In DeGroot and Fienberg (1983) or Weijs et al. (2010), a decomposition of the divergence score was presented. This was inspired by a decomposition of the Brier score (see Brier (1950)) into uncertainty, reliability and resolution.
In general, aside from consistency of scores one should understand different learning mechanisms hidden behind the different choices of scores, together with their distributional properties. This is needed for good Basel practices, which can request for example consistent and robust M‐estimation (as an example see Beran et al. (2014)).
Ingo Steinwart (University of Stuttgart)
I first thank the authors for an interesting and stimulating paper. As they briefly indicate in their discussion on best practices, scoring functions not only occur in the forecasting community but also in machine learning or statistical learning theory. Since there, scoring functions, or, in their terminology, loss functions, have been investigated independently for quite some time (see for example Bartlett et al. (2006), Tewari and Bartlett (2005), Zhang (2004), Steinwart (2007), Reid and Williamson (2011), Calauzènes et al. (2012, 2013), Scott (2012), Duchi et al. (2013), Menon and Williamson (2014), Agarwal and Agarwal (2015), Frongillo and Kash (2015) and Ramaswamy and Agarwal (2016) and the many more reference mentioned therein), I would like to use this contribution to describe briefly the role of scoring functions in machine learning.
drawn from some unknown distribution P on
to find a function
such that its average score, or risk,

satisfying

is almost surely unique, i.e.
is almost surely a property P(·|x) that can be elicited by S. Now, using data D, we can rarely hope to obtain
, and thus we need a performance measure that quantifies how close
and
are. Here, three fundamentally different possibilities exist and are considered in machine learning.
- We are purely interested in S‐risk minimization, in which case the excess risk
with
(48)
is the sole quantity of interest.
- Our goal is to estimate
, in which case we typically describe the quality of
by some suitable norm, i.e.
.
- We are actually interested in
‐risk minimization for a different scoring function
, but, for example, for algorithmic reasons we can only perform S‐risk minimization.
Note that, in both (b) and (c), the scoring function S serves only as an instrument for solving the actual problem at hand, and therefore it is natural to ask which S is the ‘best instrument’, how expression (48) relates to the actual problem for different choices of S and do suitable scoring functions S exist that have additional, algorithmically interesting properties such as convexity and/or differentiability? The Choquet‐type representation given by Ehm and his colleagues may make it possible to attack these questions with a new set of tools.
Samuel L. Ventura and Rebecca Nugent (Carnegie Mellon University, Pittsburgh)
The authors provide a detailed, theoretical framework for evaluating, comparing and ranking forecasts with scoring functions. Highlighting that forecasters might be required to report a mean or quantile of a distribution of predictions, they assert that scoring functions must be used when comparing forecasts and they provide several theoretical results relating to the consistency of quantiles and expectiles scoring functions, and conditions under which they can be expressed in mixture (and Choquet) representations. We thank them for their efforts, as their results should be of interest to both industry and academic researchers.
The authors’ efforts to make their results accessible and interpretable to a broad audience are evident in their two detailed profit–loss economic examples which, in conjunction with the easily understandable pay‐off structure overview in Table 1, allow readers to interpret properly the authors’ theoretical results in real world contexts. In future work, it would be interesting to see a similar discussion of the theoretical results on order sensitivity in a real world context. For example, one could use the set‐up of the authors’ first profit‐loss example to demonstrate the results on order sensitivity for quantiles and expectiles in proposition 2.
The theoretical results and empirical extensions relating to ranking forecasts are perhaps the most important work in this paper. Although (as the authors point out) some past work has been done in this area, the authors establish both theoretical and empirical rules for forecast dominance and discuss how visualization tools like the Murphy diagram can be used to help future researchers to compare forecasts. These findings should be highly valued by both the broader research community and in the industry, where ranking forecasts is of growing importance. That said, the focus on forecast dominance may come at the expense of ranking groups of non‐dominant forecasts. Equation (28) requires specific values of θ to rank forecasts. How do we rank forecasts when θ is unknown, or independently of θ? Is there some test statistic that can be derived by considering all possible values of θ when comparing forecasts? How can this information be incorporated in visualization tools like Murphy diagrams? Finally, can variability around the expected score be similarly incorporated (e.g. via confidence bands) to compare forecasters better visually?
Given that dominance is an ‘all‐or‐nothing’ property, there are likely to be many practical applications where the better forecaster is obvious despite strict dominance not holding. How dominant is dominant enough?
Johanna F. Ziegel (University of Bern)
I congratulate the authors on a great piece of work. It provides genuinely new insights into the evaluation of forecasts for quantiles and expectiles, which may be relevant and influential well beyond the applications that were studied in Sections 3.3, 3.4 and 4. The authors provide many directions of future research in the discussion of which the considerations concerning the choice of scoring functions in regression problems appears particularly interesting.
The concept of forecast dominance that was introduced in definition 2 is intriguing. The authors give a collection of sufficient criteria for situations where forecast dominance occurs in Section 3.2. However, the question how strong a requirement forecast dominance really is merits further study. For instance, can one construct an interesting example of two forecasters
with non‐nested information sets such that
dominates
? A second interesting example would be one of two forecasters, one dominating the other, possibly with nested information sets, but neither of them ideal.
; see Steinwart et al. (2014), section 3, for the definition of an oriented identification function. Any sufficiently smooth consistent scoring function can be written as

, parameterized by v, are consistent. In this context it is natural to pose the question if there is a version of Osband's principle without strong differentiability assumptions on the scoring function. Possibly, using an approximation argument, one might be able to show that all consistent scoring functions with minimal smoothness assumptions can be written as

The authors replied later, in writing, as follows.
- interpretation,
- forecast dominance, i.e. strength of the notion, sampling variability and inference,
- forecast combinations and dominance graphs,
- Choquet representations and
- other points.
1. Interpretation
- For quantile forecasts at level α ∈ (0,1), the Murphy diagram illustrates how much less a forecast makes on average, compared with an oracle, in a binary betting problem at the event threshold
, in the monetary unit of the betting return,
.
- For probability forecasts of a binary event, the Murphy diagram illustrates how much less a forecast makes on average, compared with an oracle, in the respective betting problem with cost–loss ratio θ ∈ (0,1), in the monetary unit of the betting return,
.
We emphasize that the interpretation is in monetary terms, regardless of the original unit of the predictand, y: multiplying the value on the diagram's vertical axis by
yields the monetary regret. Consider the stock market example of Lenkoski and Thorarinsdottir, and suppose that
equals $1000. The Murphy diagram shows that for a bet with threshold −0.03 the two forecasts earn about $6 less than an oracle on average. This seems small as a difference and can be explained by the fact that Lenkoski and Thorarinsdottir consider quantile forecasts at level α=0.005. As we assume that
this is an unattractive lottery, which one should mostly reject. Hence the decision to make is not difficult, explaining why both forecast models incur only a small monetary regret compared with the oracle. In the wind speed example in our paper (the bottom of Figs 5 and 6), we have α=0.90, which corresponds to a much more attractive lottery. For a bet with threshold 12
, the regime switching space–time forecast earns about $20 less on average, and the auto‐regressive forecast about $30 less, than an oracle.
Mean and expectile forecasts. In the case of mean and expectile forecasts, Osband links economic cost to statistical efficiency considerations. However, the interpretation of the Murphy diagram is less straightforward, as the elementary score is now in the original unit of the predictand, y. When y is monetary, our interpretation in terms of an investment problem with differential tax rates applies. Otherwise, a detour via financial options may allow for such an interpretation. As any consistent scoring function admits a mixture representation in terms of elementary scoring functions, it admits an interpretation in terms of the original unit in the applied problem at hand. In particular, this holds for squared error, thereby raising the intriguing possibility of interpreting (mean‐) squared error directly in the original unit, rather than its square.
2. Forecast dominance: strength of the notion, sampling variability, and inference
Theoretical example. In response to Ziegel's request, we give a simple example of two forecasts with non‐nested information sets such that one dominates the other. The general idea is that if one forecast has exclusive access to a highly informative explanatory variable, then it may dominate a competitor with exclusive access to a weakly informative variable, even though the information sets may not be nested. Specifically, let
, where
and ɛ are independent Gaussian variables with mean 0 and variances 2, 1 and 1 respectively. Suppose that Anne's forecast of the mean of Y is given by
, whereas Bob's forecast is given by
. The expected elementary scores for the mean functional can be computed explicitly and yield the Murphy diagram in Fig. 15.

, Anne;
, Bob;
, simple average)
Practical relevance, sampling variability and choice of sampling period. Diebold, Jolliffe, Patton and Pinson, among others, note that forecast dominance is a strong requirement that may not be met all too commonly in practice. Furthermore, sampling variability may obscure dominance relations, as noted by Ventura and Nugent and illustrated very nicely in the simulation example by Holzmann and Klar. Rossi points out that forecast rankings tend to be unstable over time, and she suggests accounting for this possibility when analysing Murphy diagrams. For this, we have implemented the fluctuation test of Giacomini and Rossi (2010), using elementary scores as loss functions, with values θ in the centre of the support of the predictand. In a nutshell, the fluctuation test concerns the null hypothesis that two forecasts perform equally well at all time points. Fig. 16 presents the results for mean forecasts of inflation and wind speed. The null hypothesis of equal performance is not rejected for the inflation data, whereas, for the wind data, the test statistic repeatedly falls below the critical value. Thus, the interpretations in our paper are qualitatively robust across different choices of the sampling period. A next step is to account for the possibility that forecast rankings depend on both the sampling period and the mixture parameter θ, and we encourage future work along these lines.

, test statistic defined in equation (1) of Giacomini and Rossi (2010);
, critical values from Giacomini and Rossi's (2010) Table 1 (two‐sided test; 5% level; μ=0.5)
Formal inference. The above comment demonstrates the need for formal inference tools and tests for forecast dominance, as called for by Ghanem, Patton, and Ventura and Nugent, among others. The permutation test that was sketched at the end of our Section 3.4 provides such a tool, even though we concede that the null hypothesis formally tested, namely the invariance of the joint distribution of the vectors
under arbitrary changes of sign in the n components, is quite restrictive. Still, asymptotically as n grows large, the effective discrepancy with the quite liberal null hypothesis of mean 0 score differences may be much reduced.
3. Forecast combinations and dominance graphs

is a convex combination of l individual forecasts. An analogous inequality holds in the case of quantiles. Therefore, a linearly combined forecast performs no worse than the worst individual forecast. Similar insurance‐type properties of forecast combinations have been noted in other contexts (e.g. Kascha and Ravazzolo (2010), page 237) and provide a partial explanation for the empirical success of linear combination formulae. The examples by de Carvalho and Rua nicely illustrate that weights for one value of θ may not perform well for other values. Hence their proposal to use some summary measure when estimating combination formulae is natural. Casarin and Ravazzolo take a different route, by making the combination weights depend on θ.
Elimination strategies: weeding out poor forecasters. Merkle and Patton ask whether Murphy diagrams can be used to identify subsets of forecasters which should enter a combination formula. In the simple example in Fig. 15, an equally weighted average of Anne's and Bob's forecasts dominates Bob, and it attains better scores than Anne for many values of θ. Hence, despite being dominated by Anne, Bob's forecasts can still be useful in a combination context. This suggests that there may be no simple analytical justification for eliminating dominated forecasters. Nevertheless, as mentioned by Patton, eliminating dominated forecasts may be a very useful strategy in empirical settings with many forecasters and overlapping information sets, such as in big on‐line surveys (Tetlock and Gardner, 2015).
Dominance graphs. As noted in the paper, the dominance and empirical dominance relations induce partial orders among forecasts. The partial order can be represented in the form of a directed graph, which we call a dominance graph, as illustrated in Fig. 17 in the very simple, synthetic example of the 10 probability forecasts in Table A.1 of Merkle and Steyvers (2013). Forecast ID 3 dominates ID 6 and ID 8, and ID 5 dominates ID 10, so it is the forecast IDs at the top which are not dominated. In big surveys with possibly redundant forecasts, dominance graphs may become much more complex and much more informative. They give rise to simple pruning algorithms for the identification of forecasts that are to be combined or considered further. In the simplest case, one might restrict attention to the subset of the forecasts which are dominated by at most two other forecasters, say.

4. Choquet representations
Generalizations. Several discussants ask for analogues of our mixture representations in various settings, with reference to other types of univariate functionals (Critchley and his colleagues, Dawid, Demetrescu and Ziegel), multivariate functionals (Mitchell and Ferro, and Pinson) and divergences (Critchley and his colleagues and Stehlík).
Further types of elicitable functionals. Dawid and Ziegel consider other types of functionals in the general univariate case, and Demetrescu in the specific case built around the asymmetric power loss function. Starting from any single‐valued functional that admits an identification function V=V(t,y), which is non‐decreasing in t, Dawid's construction yields an explicit mixture representation in terms of a linear family
of elementary scoring functions that derive from V. Ziegel argues along the same lines, using the weaker condition on V of being oriented in the sense of equation (4) in Steinwart et al. (2014) which still implies consistency. She also notes the connection to Osband's principle and to the development in Steinwart et al. (2014), where the class of all elicitable real‐valued functionals is characterized, subject to quite stringent conditions. Steinwart et al. (2014) showed that the convexity of level sets, which was shown by Osband (1985) to be a necessary condition for elicitability, is sufficient for the existence of an oriented identification function. Once the latter has been given, Dawid's and Ziegel's reverse construction yields consistent scoring functions of virtually the same form as obtained by Steinwart et al. (2014); see their equations (8), (12) and (18). Whether or not the construction exhausts all consistent scoring functions for a given functional appears to depend on intricate regularity conditions.
Multicomponent functionals. Mixture representations for consistent scoring functions in the case of
‐valued elicitable functionals and related settings are highly desirable. However, extensions from the one‐ to higher dimensional cases are not possible in general. This is because proper scoring rules (and hence consistent scoring functions) arise as supergradients of concave functionals, and simple mixture representations of concave (or convex) functions as used in our paper are unavailable even on
. A notable exception concerns interval forecasts in the form of the 1−2α central prediction interval, which corresponds to the
‐valued functional with components given by the univariate quantile functionals at level α and 1−α respectively. By proposition 4.2 of Fissler and Ziegel (2015), any consistent scoring function in this setting is the sum of a consistent scoring function for the α‐quantile and a consistent scoring function for the (1−α)‐quantile, so our results and representations carry over immediately. Although this is a rather unique case from a theoretical perspective, it is one of strong applied relevance, as discussed in Section 6.2 of Gneiting and Raftery (2007).

for θ>0. An immediate consequence (for measures P absolutely continuous with respect to Q) is the representation

.
Mixture representations versus Choquet representations. For all practical purposes, it is the feasibility of a mixture representation in terms of parsimoniously parameterized elementary objects that matters in the above settings. From a more theoretical perspective, it is of interest to seek any additional assumptions under which the mixture representations in fact are Choquet representations in the sense of functional analysis (Phelps, 2001). In this context, it is not immediate to us whether the aforementioned approach via Osband's principle yields Choquet representations, though it clearly yields mixture representations of practical use and appeal. To qualify as a Choquet representation, Dawid's elementary scoring functions
need to be the extreme points of suitable restricted convex sets of scoring functions, which is a condition that in general seems difficult to check.
5. Other points
Estimation. Several discussants point to the role of scoring functions in estimation problems from the perspectives of statistics (Meng and Musio), machine learning (Liu and Steinwart) and econometrics (Pelenis). For example, Liu's proposal aims at estimates that are robust with respect to the choice of the scoring function, and Pelenis addresses the issue of dominance relations among forecasts based on estimates from misspecified models. Clearly, the question of the optimal choice of the loss or scoring function to be employed for estimation in regression problems continues to offer substantial challenges.
Miscellanea. Jolliffe, Jose and Satopää discuss the role of calibration or reliability. To assess these facets of forecast quality, reliability diagrams in the case of probability forecasts and exceedance tabulations in the case of quantile forecasts remain the methods.of choice, particularly when the goal is to diagnose strengths and weaknesses of predictive models. In contrast, the key application of Murphy diagrams is in the comparison and ranking of competing forecasts. We second Mateu's call for theoretical and methodological tools that can cope with spatial dependences in the forecasts. Härdle and Huang hint at the challenging problem of forecast verification for tail events, which has recently been addressed by Lerch et al. (2015).
Conclusion
We are much obliged to our colleagues for an inspiring interdisciplinary discussion with contributions from the perspectives of statistics, machine learning, information theory, economics and meteorology. It is our hope that the discussion stimulates further interest in a both classical and topical area for which Philip Dawid coined the very appropriate term ‘reverse decision theory’ at the Pre‐Ordinary‐Meeting in December 2015. We look forward to hearing about future methodological, theoretical and applied progress in this exciting field. An R package accompanying our paper (Jordan and Krüger, 2016) facilitates the implementation of Murphy diagrams in a wide range of applications.
Appendix A: Proofs
The specific structure of the scoring functions in expressions (5) and (7) permits us to focus on the case
in the subsequent proofs, with the general case α ∈ (0,1) then being immediate.
A.1. Proof of theorem 1
, and the relationship H(x)−H(y)=S(x,y)/(1−α) for x>y are straightforward consequences of the fact that, for every
and
,

the Bregman‐type function of two variables
(29)
for
and the relationship
for x>y are immediate consequences of the fact that, for all
and x<y,

A.2. Proof of proposition 1
, where
and
are of the form (5) with associated functions
. Then

we have
if y⩽x, and
if x⩽y, where j=1,2. It follows that
in the first case,
in the second case and
in the third case. This coincides with the value distribution of g(x)−g(y) when
, whence indeed
.
, where
and
are of the form (7) with associated functions
. Let
be defined as in expression (29). Then


, we may apply the same argument as in the quantile case to show that
, whence
.
A.3. Proof of proposition 2
in expression (10) suppose first that
. Since


, and under that condition we have F(θ)⩽α and
, whence the desired expectation inequality. The case
is handled analogously.
in expression (12) we assume first that
, where t denotes the α‐expectile of F. Since


Appendix B: Details for the synthetic example

or the mean
of the CDF‐valued random quantity F. The scoring function S is the elementary quantile scoring function
in expression (10) or the elementary scoring function
in expression (12). For example, if X is a quantile forecast for Y at level α ∈ (0,1) then
(30)
for event probabilities, also.
| Forecast | α‐quantile | Mean |
|---|---|---|
| F |
|
|
| Perfect |
|
|
| Climatological |
|
|
| Unfocused |
|
|
| Sign reversed |
|
|
-
†For α ∈ (0,1) and
, we let
,
and
, where Φ and φ denote the CDF and the probability density function of the standard normal distribution respectively.
References in the discussion
Citing Literature
Number of times cited according to CrossRef: 33
- Mustapha Mohammedi, Salim Bouzebda, Ali Laksaci, The consistency and asymptotic normality of the kernel type expectile regression estimator for functional data, Journal of Multivariate Analysis, 10.1016/j.jmva.2020.104673, 181, (104673), (2021).
- Abdelaati Daouia, Stéphane Girard, Gilles Stupfler, ExpectHill estimation, extreme risk and heavy tails, Journal of Econometrics, 10.1016/j.jeconom.2020.02.003, (2020).
- Jakob W. Messner, Pierre Pinson, Jethro Browell, Mathias B. Bjerregård, Irene Schicker, Evaluation of wind power forecasts—An up‐to‐date view, Wind Energy, 10.1002/we.2497, 23, 6, (1461-1481), (2020).
- Qifa Xu, Lu Chen, Cuixia Jiang, Keming Yu, Mixed data sampling expectile regression with applications to measuring financial risk, Economic Modelling, 10.1016/j.econmod.2020.06.018, 91, (469-486), (2020).
- Gokhan Mert Yagli, Dazhi Yang, Dipti Srinivasan, Reconciling solar forecasts: Probabilistic forecasting with homoscedastic Gaussian errors on a geographical hierarchy, Solar Energy, 10.1016/j.solener.2020.06.005, (2020).
- Edmore Ranganai, Caston Sigauke, Capturing Long-Range Dependence and Harmonic Phenomena in 24-Hour Solar Irradiance Forecasting: A Quantile Regression Robustification via Forecasts Combination Approach, IEEE Access, 10.1109/ACCESS.2020.3024661, 8, (172204-172218), (2020).
- Q.F. Xu, X.H. Ding, C.X. Jiang, K.M. Yu, L. Shi, An elastic-net penalized expectile regression with applications, Journal of Applied Statistics, 10.1080/02664763.2020.1787355, (1-26), (2020).
- M. J. Rodwell, J. Hammond, S. Thornton, D. S. Richardson, User decisions, and how these could guide developments in probabilistic forecasting, Quarterly Journal of the Royal Meteorological Society, 10.1002/qj.3845, 0, 0, (2020).
- Fabian Krüger, Johanna F. Ziegel, Generic Conditions for Forecast Dominance, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1741376, (1-12), (2020).
- Yanning Wang, Yue Jiang, Guoqian He, undefined, 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), 10.1109/BCD.2019.8884865, (90-93), (2019).
- Zied Ben Bouallègue, Linus Magnusson, Thomas Haiden, David S. Richardson, Monitoring trends in ensemble forecast performance focusing on surface variables and high‐impact events, Quarterly Journal of the Royal Meteorological Society, 10.1002/qj.3523, 145, 721, (1741-1755), (2019).
- James W. Taylor, Forecast combinations for value at risk and expected shortfall, International Journal of Forecasting, 10.1016/j.ijforecast.2019.05.014, (2019).
- Xiaochun Meng, James W. Taylor, Estimating Value-at-Risk and Expected Shortfall Using the Intraday Low and Range Data, European Journal of Operational Research, 10.1016/j.ejor.2019.07.011, (2019).
- Awdesch Melzer, Wolfgang K. Härdle, Brenda López Cabrera, An Expectile Factor Model for Day-ahead Wind Power Forecasting, SSRN Electronic Journal, 10.2139/ssrn.3363164, (2019).
- Siu Cheung, Ziqi Chen, Yanli Li, Comparing the Estimations of Value-at-Risk Using Artificial Network and Other Methods for Business Sectors, , 10.1007/978-3-030-16841-4_28, (267-275), (2019).
- Robert P. Lieli, Maxwell B. Stinchcombe, Viola M. Grolmusz, Unrestricted and controlled identification of loss functions: Possibility and impossibility results, International Journal of Forecasting, 10.1016/j.ijforecast.2018.11.007, 35, 3, (878-890), (2019).
- Zhichao Du, Menghan Wang, Zhewen Xu, undefined, 2019 Second International Conference on Artificial Intelligence for Industries (AI4I), 10.1109/AI4I46381.2019.00034, (103-106), (2019).
- Jonas R. Brehmer, Tilmann Gneiting, Properization: constructing proper scoring rules via Bayes acts, Annals of the Institute of Statistical Mathematics, 10.1007/s10463-019-00705-7, (2019).
- Johanna F Ziegel, Fabian Krüger, Alexander Jordan, Fernando Fasciati, Robust Forecast Evaluation of Expected Shortfall*, Journal of Financial Econometrics, 10.1093/jjfinec/nby035, (2019).
- Andrew J. Patton, Comparing Possibly Misspecified Forecasts, Journal of Business & Economic Statistics, 10.1080/07350015.2019.1585256, (1-23), (2019).
- Fabio Bellini, Ilia Negri, Mariya Pyatkova, Backtesting VaR and expectiles with realized scores, Statistical Methods & Applications, 10.1007/s10260-018-00434-w, 28, 1, (119-142), (2018).
- James Ming Chen, On Exactitude in Financial Regulation: Value-at-Risk, Expected Shortfall, and Expectiles, SSRN Electronic Journal, 10.2139/ssrn.3136278, (2018).
- Zied Ben Bouallègue, Thomas Haiden, David S. Richardson, The diagonal score: Definition, properties, and interpretations, Quarterly Journal of the Royal Meteorological Society, 10.1002/qj.3293, 144, 714, (1463-1473), (2018).
- Zongwu Cai, Ying Fang, Dingshi Tian, Assessing Tail Risk Using Expectile Regressions with Partially Varying Coefficients, Journal of Management Science and Engineering, 10.3724/SP.J.1383.304011, 3, 4, (183-213), (2018).
- Peter Vogel, Peter Knippertz, Andreas H. Fink, Andreas Schlueter, Tilmann Gneiting, Skill of Global Raw and Postprocessed Ensemble Predictions of Rainfall over Northern Tropical Africa, Weather and Forecasting, 10.1175/WAF-D-17-0127.1, 33, 2, (369-388), (2018).
- James Chen, On Exactitude in Financial Regulation: Value-at-Risk, Expected Shortfall, and Expectiles, Risks, 10.3390/risks6020061, 6, 2, (61), (2018).
- Abdelaati Daouia, Stéphane Girard, Gilles Stupfler, Estimation of tail risk based on extreme expectiles, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12254, 80, 2, (263-292), (2017).
- Timothy I. Cannings, Richard J. Samworth, Random‐projection ensemble classification, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12228, 79, 4, (959-1035), (2017).
- Fabio Bellini, Ilia Negri, Mariya Pyatkova, Backtesting VaR and Expectiles with Realized Scores, SSRN Electronic Journal, 10.2139/ssrn.3012932, (2017).
- Yael Grushka-Cockayne, Kenneth C. Lichtendahl, Victor Richmond R. Jose, Robert L. Winkler, Quantile Evaluation, Sensitivity to Bracketing, and Sharing Business Payoffs, Operations Research, 10.1287/opre.2017.1588, 65, 3, (712-728), (2017).
- W. Ehm, E. Y. Ovcharov, Bias-corrected score decomposition for generalized quantiles, Biometrika, 10.1093/biomet/asx004, 104, 2, (473-480), (2017).
- Roberto Casarin, Stefano Grassi, Francesco Ravazzolo, H. K. van Dijk, Dynamic Predictive Density Combinations for Large Data Sets in Economics and Finance, SSRN Electronic Journal, 10.2139/ssrn.2633388, (2015).
- Yael Grushka-Cockayne, Kenneth C. Lichtendahl, Victor Richmond R. Jose, Robert L. Winkler, Proper Scoring Rules That Are Sensitive to Bracketing, SSRN Electronic Journal, 10.2139/ssrn.2628322, (2015).




