Volume 78, Issue 3
Original Article
Free Access

Of quantiles and expectiles: consistent scoring functions, Choquet representations and forecast rankings

Werner Ehm

Heidelberger Institut für Theoretische Studien, Heidelberg, Germany

Search for more papers by this author
Tilmann Gneiting

Corresponding Author

Heidelberger Institut für Theoretische Studien, Heidelberg, Germany

Karlsruher Institut für Technologie, Karlsruhe, Germany

Address for correspondence: Tilmann Gneiting, Computational Statistics Group, Heidelberg Institute for Theoretical Studies, Schloss‐Wolfsbrunnenweg 35, 69118 Heidelberg, Germany. E‐mail: Tilmann.Gneiting@h-its.orgSearch for more papers by this author
Alexander Jordan

Heidelberger Institut für Theoretische Studien, Heidelberg, Germany

Karlsruher Institut für Technologie, Karlsruhe, Germany

Search for more papers by this author
Fabian Krüger

Heidelberger Institut für Theoretische Studien, Heidelberg, Germany

Search for more papers by this author
First published: 10 May 2016
Citations: 33
[Read before The Royal Statistical Society at a meeting organized by the Research Section on Wednesday, December 9th, 2015, Professor C. Jennison in the Chair ]

Summary

In the practice of point prediction, it is desirable that forecasters receive a directive in the form of a statistical functional. For example, forecasters might be asked to report the mean or a quantile of their predictive distributions. When evaluating and comparing competing forecasts, it is then critical that the scoring function used for these purposes be consistent for the functional at hand, in the sense that the expected score is minimized when following the directive. We show that any scoring function that is consistent for a quantile or an expectile functional can be represented as a mixture of elementary or extremal scoring functions that form a linearly parameterized family. Scoring functions for the mean value and probability forecasts of binary events constitute important examples. The extremal scoring functions admit appealing economic interpretations of quantiles and expectiles in the context of betting and investment problems. The Choquet‐type mixture representations give rise to simple checks of whether a forecast dominates another in the sense that it is preferable under any consistent scoring function. In empirical settings it suffices to compare the average scores for only a finite number of extremal elements. Plots of the average scores with respect to the extremal scoring functions, which we call Murphy diagrams, permit detailed comparisons of the relative merits of competing forecasts.

1 Introduction

Over the past two decades, a broad transdisciplinary consensus has developed that forecasts ought to be probabilistic in nature, i.e. they ought to take the form of predictive probability distributions over future quantities or events (Gneiting and Katzfuss, 2014). Nevertheless, a wealth of applied settings require point forecasts, be it for reasons of decision making, tradition, reporting requirements or ease of communication. In this situation, a directive is required about the specific feature or functional of the predictive distribution that is being sought.

We follow Gneiting (2011) and consider a functional to be a potentially set‐valued mapping T(F) from a class of probability distributions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0001 to the real line urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0002, with the mean or expectation functional, quantiles and expectiles being key examples. Competing point forecasts are then compared by using a non‐negative scoring function S(x,y) that represents the loss or penalty when the point forecast x is issued and the observation y realizes. A critically important requirement on the scoring function is that it be consistent for the functional T relative to the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0003, in the sense that
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0004(1)
for all probability distributions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0005, all t ∈ T(F) and all urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0006. (Throughout the paper, the notation urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0007 indicates that the expectation is taken with respect to YF.) If equality in expression (1) implies that x ∈ T(F), then the scoring function is strictly consistent. Thus, under a strictly consistent scoring function, a forecaster optimizes her expected score by giving a truthful and accurate assessment of the functional T(F).
To give a prominent example, the ubiquitous squared error scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0008 is strictly consistent for the mean or expectation functional relative to the class of probability distributions with finite variance. However, there are many alternatives. In a classical paper, Savage (1971) showed that, subject to weak regularity conditions, a scoring function is consistent for the mean functional if and only if it is of the form
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0009(2)
where the function ϕ is convex with subgradient urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0010; squared error arises when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0011. Holzmann and Eulert (2014) proved that, when forecasts make ideal use of nested information bases, the forecast with the broader information basis is preferable under any consistent scoring function.

However, in real world settings, as pointed out by Patton (2015), forecasts are hardly ever ideal, and the ranking of competing forecasts might depend on the choice of the scoring function. This had already been observed by Murphy (1977), Schervish (1989) and Merkle and Steyvers (2013), among others, in the important special case of a binary predictand, where y=1 corresponds to a success and y=0 to a non‐success, so that the mean of the predictive distribution provides a probability forecast for a success. As there is no obvious reason for a consistent scoring function to be preferred over any other, this raises the question which one of the many alternatives to use.

Our work is motivated by the quest for guidance in this setting. Theoretically, the key result is that, subject to unimportant regularity conditions, any function of the form (2) admits a mixture representation of the form
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0012
where H is a non‐negative measure, and
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0013
for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0014. Here and in what follows, we write urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0015 for the positive part of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0016 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0017 for the indicator function of the event A. Thus every scoring function consistent for the mean can be written as a weighted average over elementary or extremal scores urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0018. As an important consequence of the mixture representation, a point forecast that is preferable in terms of each urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0019 is preferable in terms of any consistent scoring function. The elementary score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0020 is the distance of y from θ if x is on the opposite side from y, and otherwise it is zero. It can be seen as the loss, relative to an oracle, in an investment problem with cost basis θ and revenue y. If x and y are on the same side of θ, the forecast x entails the same decision that would be taken by an oracle, yielding a loss of zero. Otherwise, the deficit relative to the oracle is |yθ| if losses due to the two types of error are equally weighted. In the generalization to be discussed in this paper, they may be weighted unequally, which yields scoring functions for expectile functionals; see Section 2.3.
In empirical settings, point forecasts are compared on the basis of their average scores. Specifically, let us consider a sequence of triplets urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0021 for i=1,…,n, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0022 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0023 are competing point forecasts and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0024 is the subsequent outcome. We may compare the two forecasts graphically, by plotting the respective empirical scores
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0025(3)
for j=1 and j=2 versus θ. The empirical score vanishes if |θ| is sufficiently large and, to establish non‐inferiority, it suffices to compare urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0026 at finitely many values of θ; see Section 3.4. An example of this type of display, which we term a Murphy diagram, is shown in Fig. 1, where we consider point forecasts of wind speed at a major wind energy centre.
image
Murphy diagrams for the comparison of point forecasts of wind speed at the Stateline wind energy centre, using a regime switching space–time (image) or auto‐regressive (image) technique (Gneiting et al., 2006) (the functional considered is the mean of the predictive distribution): (a) empirical scores urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0027 in expression (3) versus θ; (b) score differences along with pointwise 95% confidence intervals (a negative difference means that the regime switching forecast is preferable; for details, see Sections 3.3 and 3.4)

More generally, for both quantiles and expectiles the apparent wealth of consistent scoring functions can be reduced to a one‐dimensional family of readily interpretable elementary scores, in the sense that every consistent scoring function can be represented as a mixture from that family. The case of the mean or expectation functional, which includes probability forecasts for binary events as a further special case, is nested by the expectile functional. Traditionally, the expectile at level α ∈ (0,1) is the weighted centre of mass of a probability distribution when the probabilities to the right are weighted by α and the probabilities to the left by 1−α. Equivalently, the expectile is the number with respect to which the weighted squared deviation is minimized, where the squares of deviations to the right are weighted by α and those to the left by 1−α.

The remainder of the paper is organized as follows. Section 2 is devoted to the key theoretical development, in which we state and discuss the mixture representations, relate to Choquet theory and order sensitivity, and provide economic interpretations of the elementary scores and the associated functionals. In Section 3, we apply the mixture representations to study forecast rankings and propose the aforementioned Murphy diagram for forecast comparisons. Illustrations on data examples follow in Section 4, where we revisit meteorological and economic case‐studies in the work of Gneiting et al. (2006), Rudebusch and Williams (2009) and Patton (2015). The paper closes with a discussion in Section 5. Proofs and computational details are deferred to appendices.

The data that are analysed and the programs that were used to analyse them can be obtained from

http://wileyonlinelibrary.com/journal/rss-datasets

2 Consistent scoring functions for quantiles and expectiles

Before focusing on the specific cases of quantiles and expectiles, we review general background material on the assessment of point forecasts, with emphasis on consistent scoring functions.

2.1 Consistent scoring functions

We first introduce notation and explain conventions. Let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0028 denote the class of the probability measures on the Borel–Lebesgue sets of the real line urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0029. For simplicity, we do not distinguish between a measure urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0030 and the associated cumulative distribution function (CDF). We follow standard conventions and assume that CDFs are right continuous. A function S defined on a rectangle urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0031 is called a scoring function if S(x,y)⩾0 for all (x,y) ∈ D with S(x,y)=0 if x=y. Here, S(x,y) is interpreted as the loss or cost that is accrued when the point forecast x is issued and the observation y realizes. The scoring function is regular if it is jointly measurable and left continuous in its first argument x for every y.

In point prediction problems, it is rarely evident which functional of the predictive distribution should be reported. Guidance can be given implicitly, by specifying a loss function, or explicitly, by specifying a functional. The notion of consistency originates in this setting.

Consider a functional urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0032 on a class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0033 on which the mapping is well defined. Usually, the functional is single valued, as in the case of the mean functional where we take urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0034 as the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0035 of the probability measures with finite first moment. More generally, the expectile at level α ∈ (0,1) of a probability measure urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0036 is the unique solution t to the equation
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0037
where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0038 corresponds to the mean functional (Newey and Powell, 1987). In the case of quantiles, the functional might be set valued. Specifically, the quantile functional at level α ∈ (0,1) maps a probability measure F to the closed interval urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0039, with lower limit urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0040 and upper limit urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0041. The two limits differ only when the level set urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0042 contains more than one point, so typically the functional is single valued. Any number between urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0043 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0044 represents an α‐quantile and will be denoted urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0045.
The scoring function S is consistent for a functional T relative to the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0046 if
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0047(4)
for all probability measures urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0048, all t ∈ T(F) and all point forecasts urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0049. A functional T that admits a strictly consistent scoring function is called elicitable and can then be represented as the solution to an optimization problem, in that
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0050
Hence, if the goal is to minimize expected loss, the optimal strategy is to follow the requested directive in the form of a functional.

In what follows, we restrict attention to the quantile and expectile functionals. These are critically important in a gamut of applications, including quantile and expectile regression in general, and least squares (i.e. mean) and probit and logit (i.e. binary probability) regression in particular.

2.2 Mixture representations

The classes of the consistent scoring functions for quantiles and expectiles have been described by Savage (1971), Thomson (1979) and Gneiting (2011), and we review the respective characterizations in the setting of Gneiting (2011), where further detail is available.

Up to mild regularity conditions, a scoring function S is consistent for the quantile functional at level α ∈ (0,1) relative to the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0051 if and only if it is of the form
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0052(5)
where g is non‐decreasing. The most prominent example arises when g(t)=t, which yields the asymmetric piecewise linear scoring function
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0053(6)
that lies at the heart of quantile regression (Koenker and Bassett, 1978; Koenker, 2005). Similarly, a scoring function is consistent for the expectile at level α ∈ (0,1) relative to the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0054 if and only if it is of the form
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0055(7)
where ϕ is convex with subgradient urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0056. The key example arises when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0057, where
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0058(8)
This is the loss function that is used for estimation in expectile regression (Newey and Powell, 1987; Efron, 1991), including the case urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0059 of ordinary least squares regression.

In view of the representations (5) and (7), the scoring functions that are consistent for quantiles and expectiles are parameterized by the non‐decreasing functions g and the convex functions ϕ with subgradient urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0060 respectively. In general, neither g nor ϕ and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0061 are uniquely determined. We therefore select special versions of these functions. Furthermore, in the interest of simplicity we generally assume that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0062, adding comments in cases where there are finite boundary points. Let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0063 denote the class of all left‐continuous non‐decreasing real functions g, and let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0064 denote the class of all convex real functions ϕ with subgradient urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0065. This last condition is satisfied when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0066 is chosen to be the left‐hand derivative of ϕ, which exists everywhere and is left continuous by construction.

In what follows, we use the symbol urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0067 to denote the class of the scoring functions S of the form (5) where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0068. Similarly, we write urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0069 for the class of the scoring functions S of the form (7) where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0070. For all practical purposes, the families urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0071 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0072 can be identified with the classes of the regular scoring functions that are consistent for quantiles and expectiles respectively. These classes appear to be rather large. However, in either case the apparent multitude can be reduced to a one‐dimensional family of elementary scoring functions, in the sense that every consistent scoring function admits a representation as a mixture of elementary elements.

Theorem 1.

  1. (Quantiles): any member of the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0073 admits a representation of the form
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0074(9)
    where H is a non‐negative measure and
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0075(10)
    The mixing measure H is unique and satisfies dH(θ)=dg(θ) for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0076, where g is the non‐decreasing function in the representation (5). Furthermore, we have H(x)−H(y)=S(x,y)/(1−α) for x>y.
  2. (Expectiles): any member of the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0077 admits a representation of the form
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0078(11)
    where H is a non‐negative measure and
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0079(12)
    The mixing measure H is unique and satisfies urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0080 for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0081, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0082 is the left‐hand derivative of the convex function ϕ in the representation (7). Furthermore, we have urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0083 for x>y, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0084 denotes the left‐hand derivative with respect to the second argument.

Note that the relationships (9) and (11) hold pointwise. In particular, the respective integrals are pointwise well defined. This is because for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0085 the functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0086 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0087 are right continuous, non‐negative and uniformly bounded with bounded support, and because the non‐decreasing functions g and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0088 define non‐negative measures dg and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0089 that assign finite mass to any finite interval. In particular, given any non‐negative measure H that assigns finite mass to any finite interval, the representations (9) and (11) generate members of the classes urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0090 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0091 respectively. Strict consistency is obtained in case H assigns positive mass to any finite interval.

In the case of quantiles, the asymmetric piecewise linear scoring function corresponds to the choice g(t)=t in equation 5, so the mixing measure H in the representation (9) is the Lebesgue measure. The elementary scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0092 arises when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0093, i.e. when H is a one‐point measure in θ.

In the case of expectiles, the mixing measure for the asymmetric squared error scoring function is twice the Lebesgue measure. The choice urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0094 recovers the mean or expectation functional, for which existing parametric subfamilies emerge as special cases of our mixture representation. Patton's (2015) exponential Bregman family,
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0095
which nests the squared error loss in the limit as a→0, corresponds to the choice urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0096 in equation 7. The mixing measure H in the representation (11) then has Lebesgue density h(θ)= exp () for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0097. For Patton's (2011) family
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0098
of homogeneous scoring functions on the positive half‐line the mixing measure has Lebesgue density urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0099, remarkably with no case distinction being required. The elementary scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0100 emerges when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0101 in equation 7; here the mixing measure in representation (11) is a one‐point measure in θ.

From a theoretical perspective, a natural question is whether the mixture representations (9) and (11) can be considered Choquet representations in the sense of functional analysis (Phelps, 2001). A Choquet representation is a special, non‐redundant type of mixture representation. Specifically, a member S of a convex class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0102 is an extreme point of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0103 if it cannot be written as an average of two other members, i.e. if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0104 with urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0105 implies urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0106. Our mixture representations qualify as Choquet representations if the elementary scores urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0107 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0108 form extreme points of the underlying classes of scoring functions. This cannot possibly be true for our classes urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0109 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0110 because they are invariant under dilations and hence admit trivial average representations built with multiples of one and the same scoring function. Therefore, the families urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0111 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0112 need to be restricted suitably. Specifically, let the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0113 consist of all functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0114 such that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0115 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0116. Similarly, let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0117 denote the family of all urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0118 such that ϕ(0)=0 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0119. These classes are convex and so are the associated subclasses of the families urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0120 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0121, which we denote by urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0122 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0123 respectively. The elementary scores urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0124 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0125 evidently are members of these restricted families.

Proposition 1.

  1. (Quantiles): for every α ∈ (0,1) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0126, the scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0127 is an extreme point of the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0128.
  2. (Expectiles): for every α ∈ (0,1) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0129, the scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0130 is an extreme point of the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0131.

We thus have furnished Choquet representations for the consistent scoring functions for quantiles and expectiles. In the extant literature, such Choquet representations have been known in the binary case only, where y=1 corresponds to a success and y=0 to a non‐success, so that the mean p ∈ [0,1] of the predictive distribution provides a probability forecast for a success. In this setting, the Savage representation (7) for the members of the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0132 reduces to
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0133
The mixture representation (11) can then be written as
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0134(13)
where H is a non‐negative measure and
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0135(14)
Up to unimportant conventions regarding coding, scaling and gain–loss orientation, this recovers the well‐known mixture representation of the proper scoring rules for probability forecasts of binary events (Shuford et al., 1966; Schervish, 1989). Different choices of the mixing measure yield the standard examples of scoring rules in this case; see Buja et al. (2005) and Table 1 in Gneiting and Raftery (2007). The widely used Brier score,
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0136(15)
arises when H is twice the Lebesgue measure.
We close the section by noting a fundamental connection between the extremal scoring rules for quantiles, expectiles and probabilities in expressions (10), (12) and (14) respectively. Specifically, given any predictive CDF F and outcome urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0137,
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0138(16)
for every α ∈ (0,1) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0139. This relationship can facilitate computations, particularly in synthetic settings, as we exemplify in Appendix B.

2.3 Economic interpretation

Our results in the previous section give rise to natural economic interpretations of the extremal scoring functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0140 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0141, along with the quantile and expectile functionals themselves. In either case, the interpretation relates to a binary betting or investment decision with random outcome y.

In the case of the extremal quantile scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0142 in expression (10), the pay‐off takes only two possible values, relating to a bet on whether or not the outcome y will exceed the event threshold θ. Specifically, consider the following pay‐off scheme, which is realized in spread betting in prediction markets (Wolfers and Zitzewitz, 2008).
  1. If Quinn refrains from betting, his pay‐off will be 0, independently of the outcome y.
  2. If Quinn enters the bet and yθ realizes, he loses his wager, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0143.
  3. If Quinn enters the bet and y>θ realizes, his winnings are urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0144, for a gain of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0145.
The top left‐hand matrix in Table 1 summarizes Quinn's pay‐off under the decision rule enter the bet if and only if x>θ, where x is Quinn's point forecast. This pay‐off scheme is equivalent to the extremal scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0146. To demonstrate this, we shift attention from positively oriented pay‐offs to negatively oriented regrets, which we define as the difference between the pay‐off for an oracle and Quinn's pay‐off. Here the term oracle refers to a (hypothetical) omniscient bettor who enters the bet if and only if y>θ realizes, which would yield an ideal pay‐off urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0147 if y>θ and 0 otherwise. Quinn's regret equals the extremal score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0148 except for an irrelevant multiplicative factor. This is illustrated in the bottom left‐hand matrix in Table 1 and corresponds to the classical, simple cost–loss decision model (Richardson, 2012). In decision theoretic terms, the distinction between pay‐off and regret is inessential, because the difference depends on the outcome y only. In either case, the optimal strategy is to choose urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0149, where
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0150(17)
and the quantile urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0151 is computed from Quinn's predictive CDF F for the future outcome y. (For simplicity, we assume that F is strictly increasing.) In summary, Quinn is willing to accept the bet if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0152.
Table 1. Overview of pay‐off structures for decision rules of the form enter the bet or invest if and only if x>θ
Quantiles Expectiles
yθ y>θ yθ y>θ
Monetary pay‐off
xθ 0 0 0 0
x>θ urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0153 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0154 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0155 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0156
Score (regret)
xθ 0 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0157 0 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0158
x>θ urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0159 0 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0160 0
  • †Monetary pay‐offs are positively oriented, whereas scores are negatively oriented regrets relative to an oracle. For quantiles, the regret equals the extremal score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0161, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0162, up to a multiplicative factor. For expectiles, the regret is urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0163, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0164, again up to a multiplicative factor.
In the case of the extremal expectile scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0165 in expression (7), the pay‐off is real valued. Specifically, suppose that Eve considers investing a fixed amount θ in a start‐up company, in exchange for an unknown future amount y of the company's profits or losses. The pay‐off structure then is as follows.
  1. If Eve refrains from the deal, her pay‐off will be 0, independently of the outcome y.
  2. If Eve invests and yθ realizes, her pay‐off is negative, at urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0166. Here, θy is the sheer monetary loss, and the factor urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0167 accounts for Eve's reduction in income tax, with urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0168 representing the deduction rate. (In financial terms, the loss acts as a tax shield. The linear functional form that is assumed here is not unrealistic, even though it is simpler than many real world tax schemes, where non‐linearities may arise from tax exemptions, progression, etc.)
  3. If Eve invests and y>θ realizes, her pay‐off is positive, at urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0169, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0170 denotes the tax rate that applies to her profits.
The top right‐hand matrix in Table 1 shows Eve's pay‐off under the decision rule enter the deal if and only if x>θ, where x is Eve's point forecast. To show that the pay‐off is equivalent to the extremal scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0171, we again shift attention to regrets relative to an omniscient investor or oracle who enters the deal if and only if y>θ occurs, which would yield the ideal pay‐off urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0172. As seen in the bottom right‐hand matrix, Eve's regret equals the extremal score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0173, up to a multiplicative factor. This implies that Eve's optimal decision rule is to enter the deal if and only if the expectile at level
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0174(18)
of her predictive CDF F exceeds θ.

Therefore, expectiles induce optimal decision rules in investment problems with fixed costs and differential tax rates for profits versus losses. The mean arises in the special case when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0175 in expression (18). It corresponds to situations in which losses are fully tax deductible (urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0176) and nests situations without taxes (urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0177). Tough taxation settings where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0178 shift Eve's incentives towards not entering the deal and correspond to expectiles at levels urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0179. For example, if losses cannot be deducted at all urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0180, whereas profits are taxed at a rate of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0181, Eve will invest only if the expectile at level urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0182 of her predictive CDF F exceeds the deal's fixed costs θ. Note that we permit the case θ<0, which may reflect subsidies or tax credits, say.

The elementary score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0183 for probability forecasts of a binary event in expression (14) is obtained as the further special case that arises when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0184 and y ∈ {0,1}. Then |yθ| ∈ {θ,1−θ}, so the pay‐offs in the bottom right‐hand matrix of Table 1 attain only two possible values. Hence, θ can be interpreted as a cost–loss ratio. We emphasize that this interpretation is specific to the binary case. In the general setting where y is continuous, θ takes the role of an event threshold, whereas α governs the costs of underprediction versus overprediction relative to this threshold.

The above interpretation of expectiles attaches an economic meaning to this class of functionals, which thus far seems to have been missing; for example, Schulze Waltrup et al. (2015), page 434, noted that ‘expectiles lack an intuitive interpretation’. In a notable exception, Bellini and Di Bernardino (2015) offered a succinct financial interpretation of expectiles which is of the same spirit as ours. The foregoing discussion may also bear on the debate about the revision of the Basel protocol for banking regulation, which involves contention about the choice of the functional of in‐house risk distributions that banks are supposed to report to regulators (Embrechts et al., 2014). Recently, expectiles have been proposed as potential candidates, as it has been proved that expectiles at level urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0185 are the only elicitable law invariant coherent risk measures (Ziegel, 2014; Bellini and Bignozzi, 2015; Delbaen et al., 2015). See McNeil et al. (2015), for a recent treatment of these concepts and Fissler et al. (2016) for a discussion of the use of consistent scoring functions in financial regulation.

2.4 Order sensitivity

The extremal scoring functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0186 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0187 are not only consistent for their respective functional; they in fact also enjoy the stronger property of order sensitivity. Generally, a scoring function S is order sensitive for the functional FT(F) relative to the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0188 if, for all urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0189, all t ∈ T(F), and all urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0190,
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0191
and
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0192
The order sensitivity is strict if these conditions continue to hold when the inequalities involving urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0193 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0194 are strict. As before, we denote the class of the Borel probability measures on urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0195 by urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0196, and we write urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0197 for the subclass of the probability measures with finite first moment.

Proposition 2.

  1. (Quantiles): for every α ∈ (0,1) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0198, the extremal scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0199 is order sensitive for the α‐quantile functional relative to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0200.
  2. (Expectiles): for every α ∈ (0,1) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0201, the extremal scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0202 is order sensitive for the α‐expectile functional relative to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0203.

Owing to the mixture representations (9) and (11), the order sensitivity of the extremal scoring functions transfers to all regular consistent scoring functions. Strict order sensitivity applies if the functions g and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0204 in equations 5 and 7 respectively are strictly increasing. For suitably large classes urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0205, the respective condition is also necessary. Analogous relationships hold in regard to (strict) consistency.

Recent studies of elicitability have revealed that (strict) order sensitivity and (strict) consistency are equivalent in quite general settings (Nau, 1985; Lambert, 2013; Steinwart et al., 2014; Bellini and Bignozzi, 2015). These results rely on continuity conditions on the scoring function and do not readily apply in our framework.

3 Forecast rankings

In this section, we turn to the task of comparing and ranking forecasts. Before applying our mixture representations to this problem, we introduce the prediction space setting of Gneiting and Ranjan (2013) and define notions of forecast dominance.

3.1 Prediction spaces

A prediction space is a probability space tailored to the study of forecasting problems. Following the seminal work of Murphy and Winkler (1987), the prediction space setting of Gneiting and Ranjan (2013) considers the joint distribution of forecasts and observations. See also Gneiting and Katzfuss (2014), section 2.1, for a brief informal description. Here we first focus on probabilistic forecasts F, which we identify with the associated CDFs for the real‐valued outcome Y. The elements of the sample space Ω can be identified with tuples of the form
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0206(19)
where the predictive distributions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0207 utilize information sets urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0208 respectively, with urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0209 being a σ‐field on the sample space Ω. In measure theoretic language, the information sets correspond to sub‐σ‐fields, and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0210 is a CDF‐valued random quantity measurable with respect to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0211. The joint distribution of the quantities in expression (19) is encoded by a probability measure urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0212 on urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0213. In this setting, a predictive distribution urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0214 is ideal relative to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0215 if it corresponds to the conditional distribution of the outcome Y under urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0216 given urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0217. An extended, more realistic notion of prediction space that allows for serial dependence between forecast–observation tuples has recently been introduced by Strähl and Ziegel (2015), along with far‐reaching generalizations of the concepts of ideality and calibration.

In a nutshell, a prediction space specifies the joint distribution of tuples of the form (19). To give an example, Table 2 revisits a scenario studied by Gneiting et al. (2007) and Gneiting and Ranjan (2013). The only difference is that we let the random variable τ attain the values −2 and 2, rather than the values −1 and 1. Here, the outcome is generated as urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0218 where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0219. The perfect forecaster is ideal relative to the σ‐field that is generated by the random variable μ. The unfocused and sign‐reversed forecasters also have knowledge of μ but fail to be ideal. The climatological forecaster, issuing the unconditional distribution of the outcome Y as predictive distribution, is ideal relative to the uninformative σ‐field that is generated by the empty set.

Table 2. Example of a prediction space with four competing forecasters†
Forecaster Predictive distribution αquantile Mean Prob(Y>y)
Perfect urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0220 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0221 μ 1−Φ(yμ)
Climatological urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0222 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0223 0 1−Φ(y/√2)
Unfocused urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0224 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0225 μ+τ/2 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0226
Sign reversed urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0227 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0228 μ 1−Φ(y+μ)
  • †The outcome is generated as urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0229, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0230. The random variable τ attains the values −2 and 2 with probability urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0231, independently of μ and Y. For α ∈ (0,1) and τ ∈ {−2,2}, we let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0232, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0233 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0234, where Φ denotes the CDF of the standard normal distribution.

Any predictive distribution F can be reduced to a point forecast by extracting the sought functional T(F). In what follows, we focus on quantiles, the mean or expectation functional and probability forecasts of the binary event that the outcome exceeds a threshold value. The respective point forecasts for the perfect, climatological, unfocused and sign‐reversed forecaster are shown in Table 2.

In practice, point forecasts might be an end in themselves, i.e. they might have been issued without there being an underlying predictive distribution. To accommodate such cases, we define a point prediction space to be a probability space urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0235, where the elements of the sample space Ω can be identified with tuples of the form
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0236(20)
where the random variables urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0237 represent point forecasts and utilize information sets urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0238 respectively. For simplicity, we let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0239 be single valued. Extensions to set‐valued random quantities, as might occur in the case of quantiles, are straightforward. The joint distribution of the point forecasts and the observation in expression (20) is specified by the probability measure urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0240. Similarly, it is sometimes useful to consider a mixed prediction space, by specifying the joint distribution urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0241 of tuples of the form
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0242(21)
where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0243 represent CDF‐valued random quantities, and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0244 represent point forecasts.

3.2 Notions of forecast dominance

We now define notions of forecast dominance, starting with probabilistic forecasts that take the form of predictive CDFs, and then turning to point forecasts. In the former setting, a scoring rule urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0245 is a suitably measurable function that assigns a loss or penalty when we issue the predictive distribution F and y realizes. A scoring rule urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0246 is proper if
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0247(22)
for all probability measures F and G in its domain of definition (Gneiting and Raftery, 2007). Proper scoring rules therefore encourage honest and careful assessments. As is well known, a scoring function S that is consistent for a single‐valued functional T relative to a class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0248 induces a proper scoring rule, by defining urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0249 for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0250 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0251.

Definition 1.(predictive CDFs). Let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0252 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0253 be probabilistic forecasts, and let Y be the outcome, in a prediction space. Then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0254 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0255 relative to a class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0256 of proper scoring rules if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0257 for every urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0258.

We now turn to quantiles and expectiles and the respective families urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0259 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0260 of the regular consistent scoring functions for these functionals.

Definition 2.

  1. (Quantiles): let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0261 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0262 be point forecasts, and let Y be the outcome, in a point prediction space. Then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0263 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0264 as an α‐quantile forecast if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0265 for every scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0266.
  2. (Expectiles): let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0267 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0268 be point forecasts, and let Y be the outcome, in a point prediction space. Then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0269 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0270 as an α‐expectile forecast if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0271 for every scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0272.

It is important to note that the expectations in the definitions are taken with respect to the joint distribution of the forecasts and the outcome. To cover time series settings, as commonly encountered in practice, the definitions can be applied to the more general prediction space setting for serial dependence that was introduced and studied by Strähl and Ziegel (2015). The dominance notions provide partial orderings for the predictive distributions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0273 in expression (19) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0274 in expression (20). In the special case of probability forecasts of a binary event, related notions of sufficiency and dominance have been studied by DeGroot and Fienberg (1983), Vardeman and Meeden (1983), Schervish (1989), Feuerverger and Rahman (1992), Krämer (2005) and Bröcker (2009). Essentially, a probabilistic forecast that dominates another is preferable, or at least not inferior, in any type of decision that involves the respective predictive distributions. (To see this, note that any utility function induces a proper scoring rule via the Bayes act. Details of the construction are given in section 3 of Dawid (2007) and section 2.2 of Gneiting and Raftery (2007).) In the case of quantiles or expectiles, a point forecast that dominates another is preferable, or at least not inferior, in any type of decision problem that depends on the respective predictive distributions via the considered functional only. Adaptations to functionals other than quantiles or expectiles are straightforward.

Under which conditions does a forecast dominate another? Holzmann and Eulert (2014) recently showed that, if two predictive distributions are ideal, then the one with the richer information set dominates the other. Furthermore, the result carries over to ideal forecasters’ induced point predictions, including but not limited to the cases of quantiles and expectiles that we consider here. To give an example in the setting of Table 2, the perfect and the climatological forecasters are ideal relative to the σ‐fields that are generated by μ and generated by the empty set respectively. Therefore, the perfect forecaster dominates the climatological forecaster, in any of the above senses.

Tsyplakov (2014) went on to show that, if a predictive distribution is ideal relative to a certain information set, then it dominates any predictive distribution measurable with respect to the information set. Again, the result carries over to the induced point forecasts. In the setting of Table 2, the perfect forecaster is ideal relative to the σ‐field generated by the random variables μ and τ. The climatological, unfocused and sign‐reversed forecasters are measurable with respect to this σ‐field, and so they are dominated by the perfect forecaster, in any of the above senses.

In the practice of forecasting, predictive distributions are hardly ever ideal, and information sets may not be nested, as emphasized by Patton (2015). Therefore, the above theoretical results are not readily applicable, and distinct scoring rules, or distinct consistent scoring functions, may yield distinct forecast rankings, as in empirical examples given by Schervish (1989), Merkle and Steyvers (2013) and Patton (2015), among others. Furthermore, in general it is not feasible to check the validity of the expectation inequalities in the definitions, for any proper scoring rule urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0275, or consistent scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0276, or urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0277.

Fortunately, in the case of quantile and expectile forecasts, the mixture representations in theorem 1 reduce checks for dominance to the respective one‐dimensional families of elementary scoring functions.

Corollary 1.

  1. (Quantiles): in a point prediction space, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0278 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0279 as an α‐quantile forecast if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0280 for every urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0281.
  2. (Expectiles): in a point prediction space, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0282 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0283 as an α‐expectile forecast if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0284 for every urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0285.

The reduction to a one‐dimensional problem suggests graphical comparisons via Murphy diagrams. Before we discuss this tool, we note that order sensitivity can sometimes be invoked to prove dominance. For example, consider the mixed prediction space setting (21) with k=1 and l=2. Suppose that the CDF‐valued random quantity F is ideal relative to the σ‐field urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0286, and let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0287 denote its α‐quantile. Suppose furthermore that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0288 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0289 are measurable with respect to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0290. By corollary 1, in concert with proposition 1 and a conditioning argument, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0291 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0292 as an α‐quantile forecast if with probability 1 either
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0293
or
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0294
holds true. An analogous argument applies in the case of the α‐expectile.

In the scenario of Table 2, the argument can be put to work in the case urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0295 that corresponds to median and mean forecasts. Specifically, let F be the perfect forecast, which has median and mean μ, let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0296 be the σ‐field that is generated by μ, and let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0297 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0298. Invoking the order sensitivity argument, we see that the climatological forecaster dominates the sign‐reversed forecaster for both median and mean predictions.

3.3 The Murphy diagram as a diagnostic tool

As noted, corollary 1 suggests graphical tools for the comparison of quantile and expectile forecasts, including the special case of the mean or expectation functional, and the further special case of probability forecasts of a binary event. We describe these diagnostic tools in the setting of a point prediction space (20), where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0299 denote point forecasts for the outcome Y, and the probability measure urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0300 represents their joint distribution. In the case of probability forecasts, we use the more suggestive notation urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0301 for the forecasts.
  1. For quantile forecasts at level α ∈ (0,1), we plot the graph of the expected elementary quantile score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0302,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0303(23)
    for j=1,…,l. By corollary 1, part (a), forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0304 dominates forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0305 if and only if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0306 for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0307. The area under urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0308 equals the expected asymmetric piecewise linear score (6).
  2. For expectile forecasts at level α ∈ (0,1), we plot the graph of the expected elementary expectile score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0309,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0310(24)
    for i=1,…,l. By corollary 1, part (b), forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0311 dominates forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0312 if and only if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0313 for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0314. The area under urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0315 equals half the expected asymmetric squared error (8).
  3. For probability forecasts of a binary event, we plot the graph of the expected elementary score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0316,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0317(25)
    for i=1,…,l. By corollary 1, part (b), the probability forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0318 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0319 if and only if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0320 for θ ∈ (0,1). The area under urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0321 equals half the expected Brier score (15).

In the context of probability forecasts for binary weather events, displays of this type have a rich tradition that can be traced to Thompson and Brier (1955) and Murphy (1977). More recent examples include Schervish (1989), Feuerverger and Rahman (1992), Richardson (2000), Wilks (2001), Mylne (2002) and Berrocal et al. (2010), among many others. Murphy (1977) distinguished three kinds of diagram that reflect the economic decisions involved. The negatively oriented expense diagram shows the mean raw loss or expense of a given forecast scheme; the positively oriented value diagram takes the unconditional or climatological forecast as reference and plots the difference in expense between this reference forecast and the forecast at hand and, lastly, the relative value diagram plots the ratio of the utility of a given forecast and the utility of an oracle forecast. The displays that were introduced above are similar to the value diagrams of Murphy, and we refer to them as Murphy diagrams. Our Murphy diagrams are by default negatively oriented and plot the expected elementary score for competing quantile, expectile and probability forecasters. For better visual appearance, we generally connect the left‐ and right‐hand limits at the jump points of the empirical score curves.

Fig. 2 shows Murphy diagrams for the perfect, climatological, unfocused and sign‐reversed forecasters in Table 2. We compare point predictions for the mean or expectation functional, and the quantile at level α=0.90, along with probability forecasts for the binary event that the outcome exceeds the threshold value 2. Analytic expressions for the expected scores are given in Appendix B. As proved in the previous section, the perfect forecaster dominates the other forecasters for all functionals considered. The expected score curves for the climatological and the unfocused, and for the unfocused and the sign‐reversed forecasters, intersect in all three cases, so there are no order relationships between these forecasters. Finally, the Murphy diagrams suggest that the climatological forecaster dominates the sign‐reversed forecaster for all three functionals and, in the case of the mean functional, the order sensitivity argument in the previous section confirms the visual impression. In the cases of the quantile and probability forecasts, final confirmation would need to be based on tedious analytic investigations of the asymptotic behaviour of the expected score functions.

image
Murphy diagrams for the forecasters in Table 2: the functionals considered are (a), (b) the mean, (c) the quantile at level α=0.90 and (d) the probability of the binary event Y⩾2 (⋮, equivalent extremal scores; see expressions (14) and (16)): image, perfect; image, climatological; image, unfocused; image, sign reversed
By default, our Murphy diagrams show the expected elementary scores. If interest focuses on binary comparisons, it is natural to consider Murphy diagrams for the difference,
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0322(26)
between the expected elementary scores of two point forecasters.

3.4 Murphy diagrams for empirical forecasters

We now turn to the comparison and ranking of empirical forecasts. Specifically, we consider tuples
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0323(27)
where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0324 are the jth forecaster's point predictions, for j=1,…,l, and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0325 are the respective outcomes. Thus, we have l competing forecasters, and each of them issues a set of n point predictions. A convenient interpretation of the empirical setting is as a special case of a point prediction space, in which the tuples urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0326 in expression (20) attain each of the values in expression (27) with probability 1/n. Then the probability measure urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0327 is the corresponding empirical measure and, with this identification, the (average) empirical scores
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0328
where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0329 is either urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0330, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0331 or urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0332, become the expected elementary scores from expressions (23), (24) and (25) respectively. Accordingly, we say that forecaster urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0333empirically dominates forecaster urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0334 if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0335 for all urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0336. When comparing the two forecasters urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0337 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0338, it is convenient to show a Murphy plot of the equivalent of the difference (26), namely
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0339
where
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0340(28)
for i=1,…,n, and again urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0341 is either urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0342, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0343 or urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0344.

Murphy diagrams can be used efficiently to show a lack of domination when forecasters’ expected elementary score curves intersect. However, in general it is not possible to conclude domination, unless the visual impression is supported by tedious analytic investigations of the behaviour of the expected score functions as θ→±∞. Fortunately, these complications do not arise in the empirical case, where dominance can be established by comparing the empirical score functions at a well‐defined finite set of arguments only, as follows.

Corollary 2.

  1. (Quantiles): the forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0345 empirically dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0346 for α‐quantile predictions if
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0347
    for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0348.
  2. (Expectiles): the forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0349 empirically dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0350 for α‐expectile predictions if
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0351
    for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0352 and in the left‐hand limit as urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0353. In the case urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0354 evaluations at urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0355 can be omitted.

To see why these results hold, note that in either case the score differential urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0356 is right continuous, and that it vanishes unless urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0357. Furthermore, in the case of quantiles urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0358 is piecewise constant with no other jump points than urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0359 or urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0360. Similarly, in the case of expectiles urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0361 is piecewise linear with no other jump points than urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0362 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0363, and no other change of slope than at urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0364. The change of slope disappears when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0365. Fig. 3 illustrates the behaviour of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0366 in the cases of the median and the mean.

image
General shape of the score differential urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0367 in expression (28) for the median (image) and mean (image) functionals
To give an example, we consider the 10 forecasters in Table A.1 of Merkle and Steyvers (2013), encoded as ID 1–ID 10, each of whom issues probability forecasts for 21 binary events. The data are artificial but mimic forecasters in the aggregate contingent estimation system, which is a Web‐based survey that solicited probability forecasts for world events from the general public. The Murphy diagram in Fig. 4(a) shows the empirical score curves
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0368
where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0369 is forecaster j's stated probability for world event i to materialize, and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0370 is the respective binary realization. By corollary 2, part (b), dominance relationships can be inferred by evaluating urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0371 at the forecasters’ stated probabilities. We note that ID 3 empirically dominates ID 6 and ID 8, and that ID 5 empirically dominates ID 10. The remaining pairwise comparisons do not give rise to dominance relationships. The induced partial order between the IDs applies to comparisons under any proper scoring rule, as reflected by the rankings in Table 1 of Merkle and Steyvers (2013). Fig. 4(b) considers joint comparisons. We see that ID 3 attains the lowest score over a wide range of θ. However, ID 2, ID 5, ID 7 and ID 9 show the unique best empirical score under urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0372 for other values of θ and, therefore, have superior economic utility under the associated cost–loss ratios.
image
(a) Murphy diagram for the probability forecasters ID 1–ID 10 in Table A.1 of Merkle and Steyvers (2013) (image, ID 3; image, ID 6; image, ID 8; image, ID 5; image, ID 10; image, others) and (b) best forecast ID(s) under urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0373 (image, unique best score; image, shared best score) (for example, ID 9 attains the unique best score for θ ∈ [0.02,0.04) and ID 10 attains the shared best score for θ ∈ [0.91,1))

It seems desirable to complement a Murphy diagram by formal tests of forecast dominance in the underlying population, with possible null hypothesis urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0374 that forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0375 dominates forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0376. Intuitively, large positive values of the mean score difference urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0377 speak against urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0378. Tests thus could be based on any functional T defined on the paths of the stochastic process urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0379 that is monotone, in that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0380 implies that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0381, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0382 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0383 are functions of θ. For example, one might choose urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0384 or urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0385 and reject urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0386 if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0387 exceeds some critical value, which in general is difficult to determine. One possibility is to utilize randomization, drawing on the idea that in a one‐sided test it should suffice to control the error of the first kind at the boundary of the null hypothesis, where there is no difference in the predictive performance, so that an exchange of labels would not change the distribution of the test statistic. The critical value can then be determined from the distribution of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0388, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0389 with independent random signs urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0390. Although permutation tests of this type have intuitive appeal and are easy to implement, there remain conceptual issues, concerning, for example, the specifics of the null hypothesis being tested. In any case, tests for forecast dominance based on extremal scoring functions, perhaps in concert with the bootstrap or the Westfall–Young method (Westfall and Young, 1993; Cox and Lee, 2008), deserve further investigation.

4 Empirical examples

We now demonstrate the use of Murphy diagrams in economic and meteorological case‐studies in time series settings. In each example, interest is in a comparison of two forecasts, and so we show Murphy diagrams for the empirical scores and their difference. The jagged visual appearance stems from the behaviour of the empirical score functions just explained and depends on the number n of forecast cases. We supplement the Murphy diagram for a difference by confidence intervals based on Diebold and Mariano (1995) tests with a heteroscedasticity and auto‐correlation robust variance estimator (Newey and West, 1987). The approach of Diebold and Mariano (1995) views empirical data of the form (27) as a sample from an underlying population and tests the hypothesis of equal expected scores. The confidence bands are pointwise and have a nominal level of 95%.

4.1 Mean forecasts of inflation

In macroeconomics, subjective expert forecasts often compare favourably with statistical forecasting approaches; see Faust and Wright (2013) for evidence and discussion. For the USA, the Survey of Professional Forecasters (SPF) run by the Federal Reserve Bank of Philadelphia is a key source of data; see, for example, Engelberg et al. (2009). Patton (2015) used SPF data to illustrate the use of various scoring functions that are consistent for the mean functional.

Motivated by Patton's analysis, we analyse quarterly SPF mean forecasts for the annual inflation rate of the consumer price index over the next 12 months in the USA. We compare the SPF forecasts with forecasts from another survey, the Michigan Survey of Consumers, based on data from the third quarter of 1982 to the third quarter of 2014, for a test period of 129 quarters. Our implementation choices are as in section 5 of Patton (2015), except that we update the data set to cover the observations for the second and third quarters in 2014, and that we use the slightly newer fourth quarter of 2014 vintage for the consumer price index realizations. Fig. 5(a) shows the forecasts along with the realizing values.

image
Point forecasts and realizations in the empirical examples: (a) mean inflation (Patton (2015); n=129; image, SPF; image, Michigan survey; image, actual); (b) probability of recession (Rudebusch and Williams (2009); n=186; image, SPF; image, probit; image, actual recessions); (c) 90% quantile of wind speed (Gneiting et al. (2006); n=5136; image, regime switching space–time; image, auto‐regressive; image, actual), restricted to a subperiod in the summer of 2003

The Murphy diagrams are shown in Figs 6(a) and 6(d). In Fig. 6(a), the curves for the empirical elementary score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0391 of the SPF and the Michigan survey intersect prominently, suggesting that neither of the two surveys empirically dominates the other. In Fig. 6(d), the confidence intervals for the score differences are fairly broad and include zero for all values of θ. Note that the SPF is preferred for smaller values, whereas the Michigan forecast is preferred for larger values of θ. To interpret these results, consider the event threshold θ=6. A forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0392 attains a non‐zero extremal score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0393 in expression (7) if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0394 or urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0395. Fig. 5(a) and the more detailed display in Fig. 7 identify five quarters when the SPF incurs a non‐zero penalty, compared with two quarters only for the Michigan survey. Interestingly, the threshold θ=6 has become less relevant over time, in that forecasts and realizations have remained below 6% from 1991 onwards.

image
Murphy diagrams in the empirical examples ((a)–(c) scores; (d)–(f) score differences (less than 0 means that SPF or regime switching is preferable)): (a), (d) mean inflation (Patton (2015); n=129); (b), (e) probability of recession (Rudebusch and Williams (2009); n=186); (c), (f) 90% quantile of wind speed (Gneiting et al. (2006); n=5136)
image
Mean forecasts of inflation for the third quarter of 1982 to the first quarter of 1992: image, non‐zero extremal score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0396 for both the SPF (image) and the Michigan (image) forecast; image, non‐zero score for the SPF forecast only; image, realization

4.2 Probability forecasts of recession

We now relate to the rich literature on binary regression and prediction and analyse probability forecasts of US recessions, as proxied by negative real gross domestic product growth. The SPF covers probability forecasts for this event since the fourth quarter of 1968. Following Rudebusch and Williams (2009), we compare current quarter probability forecasts from the SPF with forecasts from a probit model based on the term spread, i.e. the difference between long‐ and short‐term interest rates. We follow Rudebusch and Williams (2009) in all choices of data and implementation, except that we update their sample through the second quarter of 2014, for a test period of 186 quarters. Detailed economic and/or statistical justification for these choices can be found in Rudebusch and Williams (2009).

Fig. 5(b) shows the SPF and probit‐model‐based probability forecasts for a recession, with the grey vertical bars indicating actual recessions. During recessionary periods, the SPF tends to assign higher forecast probabilities than the probit model. Also, the SPF tends to assign lower forecast probabilities during non‐recessionary periods. The Murphy diagrams in Figs 6(b) and 6(e) show that the SPF attains lower empirical elementary scores urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0397 at all thresholds θ ∈ (0,1). The confidence intervals for the score differences exclude zero for small values of the cost–loss ratio θ and confirm the superiority of the SPF over the probit model for current quarter forecasts. This can partly be attributed to the fact that SPF panelists have access to timely within‐quarter information that is not available to the probit model. As demonstrated by Rudebusch and Williams (2009), the relative performance of the probit model improves at longer forecast horizons, where within‐quarter information plays a lesser role.

4.3 Quantile forecasts for wind speed

We return to the meteorological example in Fig. 1, but instead of the mean or expectation functional we now consider quantile forecasts at level α=0.90. We compare the regime switching space–time (RST) approach that was introduced by Gneiting et al. (2006) with a simple auto‐regressive (AR) benchmark for 2‐h‐ahead forecasts of hourly average wind speed at the Stateline wind energy centre in the Pacific Northwest of the USA. Gneiting et al. (2006) referred to the specifications considered here as RST‐D‐CH and AR‐D‐CH. This terminology indicates that the methods account for the diurnal cycle and conditional heteroscedasticity. The data set, evaluation period, estimation and forecast methods for this example are identical to those in Gneiting et al. (2006), and we refer to Gneiting et al. (2006) for detailed descriptions. Both methods yield predictive distributions, from which we extract the quantile forecasts. The evaluation period ranges from May 1st to November 30th, 2003, for a total of 5136 hourly forecast cases.

Fig. 5(c) shows the quantile forecasts and realizations. The quantile forecasts exceed the outcomes at about the nominal level, at 89.7% for the RST forecast and 90.9% for the AR forecast, indicating good calibration. However, the RST forecasts are sharper, in that the average forecast value over the evaluation period is 9.2 m surn:x-wiley:13697412:media:rssb12154:rssb12154-math-0398, compared with 9.7 m surn:x-wiley:13697412:media:rssb12154:rssb12154-math-0399 in the case of the AR forecast. To see why the sharpness interpretation applies here, note that wind speed is a non‐negative quantity, so the lower prediction interval at level α ∈ (0,1) ranges from 0 to the α‐quantile, whence smaller quantiles translate into shorter, more informative prediction intervals and sharper predictive distributions. These observations suggest the superiority of the RST forecasts over the benchmark AR forecasts, and the Murphy diagrams for the empirical elementary scores urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0400 in Figs 6(c) and 6(f) confirm this intuition, in line with what we saw in Fig. 1 for the mean functional.

5 Discussion

We have studied mixture representations of Choquet type for the scoring functions that are consistent for quantiles and expectiles, including the case of the mean or expectation functional, and nesting probability forecasts for binary events as a further special case. A particularly interesting aspect of these results is that they allow an economic interpretation of consistent scoring functions in terms of betting and investment problems. Our interpretation of expectiles in the context of investment problems with fixed costs and differential tax rates appears to be original and may bear on the current debate about the revision of the Basel protocol for banking regulation.

From a general applied perspective, Gneiting (2011), page 757, had argued that, if point forecasts are to be issued and evaluated,

‘it is essential that either the scoring function be specified ex ante, or an elicitable target function be named, such as the mean or a quantile of the predictive distribution, and scoring functions be used that are consistent for the target functional’.

Patton (2015), page 1, took this argument a step further, by positing that

‘rather than merely specifying the target functional, which narrows the set of relevant loss functions only to the class of loss functions consistent for that functional … forecast consumers or survey designers should specify the single specific loss function that will be used to evaluate forecasts’.

This is a very valid point. Whenever forecasters are to be compensated for their efforts in one way or another, the scoring function ought to be disclosed. To give an example of this best practice, the participants of forecast competitions hosted on the Kaggle platform (www.kaggle.com) are routinely informed about the relevant scoring function before the start of the competition. See, for example, Hong et al. (2014) for a description of the global energy forecasting competition 2012.

However, many situations remain in which point forecasters receive directives in the form of a functional, without an accompanying scoring function being available. This might be because the forecasts are utilized by a myriad of communities, a situation that is often faced by national and international weather centres, because costs and losses are unknown or confidential, because the goal is general methodological development, as opposed to a specific applied task, because interest centres on an understanding of forecasters’ behaviours and performance or simply because of negligence of best practices. In such settings, our findings suggest the routine use of new diagnostic tools in the evaluation and ranking of forecasts, which we call Murphy diagrams. Interest sometimes centres on decompositions of expected or empirical scores into uncertainty, resolution, and reliability components, as studied by DeGroot and Fienberg (1983), Bröcker (2009) and Bentzien and Friederichs (2014), among others. Extensions of Murphy diagrams in these directions may be worthwhile.

As discussed in Section 3.2, nested information sets are sufficient for forecast dominance. However, the converse is not true, in that, if a forecaster dominates another, the respective information sets need not be nested. Specifically, if a forecaster has access to a highly informative explanatory variable, but not to a weakly informative variable, then she may dominate a competitor who can access the weakly informative variable only, even though the information sets are not nested. Explicit examples of this type can readily be constructed. From a broader perspective, it would be of interest to study any implications of forecast dominance on information sets.

Our results also bear on estimation problems, in that scoring functions connect naturally to M‐estimation (Huber, 1964; Koltchinskii, 1997). An interesting observation is that the loss functions that have traditionally been employed for estimation in quantile regression, ordinary least squares regression and expectile regression, namely the asymmetric piecewise linear and squared error scoring functions (6) and (8), correspond to the choice of the Lebesgue measure in the mixture representations (9) and (11) respectively. This is in contrast with binary regression, where estimation is typically based on the logarithmic score, which corresponds to the choice of the infinite measure with density urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0401 in the mixture representation (13), rather than the Lebesgue or uniform measure that yields (half) the Brier score (15). Quite generally, this raises the question of the optimal choice of the loss or scoring function to be used for estimation in regression problems. Focusing on the binary case, Hand and Vinciotti (2003), Buja et al. (2005), Lieli and Springborn (2013) and Elliott et al. (2015) have considered the use of economically justifiable criteria.

The interpretations that were developed in the present paper can help to design economically or societally relevant criteria in more general settings. For example, the elementary expectile score urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0402 in expression (12) depends on x and y via the absolute deviation between the event threshold θ and the observation y only and therefore might be interpreted in terms of the original unit in any applied problem. Owing to the mixture representation (11), any consistent scoring function can be associated with a weighting of thresholds, as encoded by the mixing measure. The choice of the mixing measure requires careful consideration of the decision problem at hand, and it seems difficult to provide general guidance. As noted, squared error corresponds to Lebesgue measure. In applications, non‐uniform measures with finite mass may provide more realistic descriptions. The weighting becomes irrelevant in case there are dominance relationships between competing forecasters, which can be checked via Murphy diagrams. As a caveat, the dominance relationship appears to be strong, and empirical dominance may not be very commonly observed in practice. In such cases, Murphy diagrams can still provide informal clues to critical threshold values θ, which can then be investigated in detail, as illustrated in our inflation example.

Mixture representations of Choquet type can be found for other more general classes of consistent scoring functions. For instance, our results extend to the class of functionals known as generalized quantiles or M‐quantiles (Breckling and Chambers, 1988; Koltchinskii, 1997; Bellini et al., 2014; Steinwart et al., 2014), which subsume both quantiles and expectiles. Related, but more complex, mixture representations apply in the case of scoring functions that are consistent for multi‐dimensional functionals, as recently studied by Fissler and Ziegel (2015).

An interesting question is whether there might be mixture representations in terms of interpretable elementary scores for proper scoring rules. As noted, a scoring rule urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0403 assigns a loss or penalty when we issue the predictive CDF F and y realizes and, for a scoring rule to be proper, the expectation inequalities in expression (22) need to hold. As we have seen, a predictive distribution for a binary variable can be identified with a probability forecast, so representation (13) applies and the answer is well known to be positive in this case. However, an extension from probability forecasts of binary to ternary or general discrete variables does not appear to be feasible, owing to results by Johansen (1974) and Bronshtein (1978) in convex analysis. (In a nutshell, Savage (1971) showed that, in the case of k+1 categories, the proper scoring rules for probability forecasts essentially are parameterized by the convex functions on the unit simplex in urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0404. Johansen (1974) and Bronshtein (1978) proved that if k⩾2 then the extremal members of that class lie dense.) Despite this negative result, a closer look at a popular score is encouraging. Specifically, the widely used continuous ranked probability score (Matheson and Winkler, 1976),
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0405
equals the integral of the Brier score (15) for the induced probability forecast, namely F(θ), of the binary event {Yθ} over all thresholds urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0406. For simplicity, let us assume that F has unique quantiles. We may then invoke the mixture representation (13) along with relationships (14) and (16) to yield
urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0407
An expected or empirical continuous ranked probability score then corresponds to the volume under the surface spanned by the Murphy diagrams for all α‐quantile predictions, or by the Murphy diagrams for all threshold‐determined binary probability forecasts. Depending on the order of integration, the mixture representation recovers the quantile or the threshold decomposition of the continuous ranked probability score (Gneiting and Ranjan, 2011) after evaluating the first integral. More complex weighting schemes depending on θ and α can be employed, for a general family of proper scoring rules that can be economically motivated and justified. Related ideas have recently been put forward in the hydrologic and meteorological literatures (Laio and Tamea, 2007; Bradley and Schwartz, 2011; Smet et al., 2012).

Acknowledgements

This work has been funded by the European Union Seventh Framework Programme under grant agreement 290976. We thank the Klaus Tschira Foundation for infrastructural support at the Heidelberg Institute for Theoretical Studies, and we are grateful to four referees for constructive comments on an earlier version of the manuscript.

    Appendix A: Proofs

    The specific structure of the scoring functions in expressions (5) and (7) permits us to focus on the case urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0408 in the subsequent proofs, with the general case α ∈ (0,1) then being immediate.

    A.1. Proof of theorem 1

    In the case of quantiles, the mixture representation (9), the fact that dH(θ)=dg(θ) for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0409, and the relationship H(x)−H(y)=S(x,y)/(1−α) for x>y are straightforward consequences of the fact that, for every urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0410 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0411,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0412
    As the increments of H are determined by S, the mixing measure is unique.
    Turning now to the case of expectiles, we associate with any function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0413 the Bregman‐type function of two variables
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0414(29)
    Then the mixture representation (11), the fact that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0415 for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0416 and the relationship urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0417 for x>y are immediate consequences of the fact that, for all urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0418 and x<y,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0419
    The case x>y is handled analogously, and the case x=y is trivial. Finally, as the increments of H are determined by S, the mixing measure is unique.

    A.2. Proof of proposition 1

    In the case of the elementary quantile scoring function (12), suppose that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0420, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0421 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0422 are of the form (5) with associated functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0423. Then
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0424
    As urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0425 we have urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0426 if yx, and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0427 if xy, where j=1,2. It follows that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0428 in the first case, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0429 in the second case and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0430 in the third case. This coincides with the value distribution of g(x)−g(y) when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0431, whence indeed urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0432.
    In the case of the elementary expectile scoring function (12), suppose that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0433, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0434 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0435 are of the form (7) with associated functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0436. Let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0437 be defined as in expression (29). Then
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0438
    Taking left‐hand derivatives with respect to y, we obtain
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0439
    As urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0440, we may apply the same argument as in the quantile case to show that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0441, whence urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0442.

    A.3. Proof of proposition 2

    In the case of the elementary quantile scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0443 in expression (10) suppose first that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0444. Since
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0445
    we have
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0446
    The second factor on the right‐hand side vanishes unless urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0447, and under that condition we have F(θ)⩽α and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0448, whence the desired expectation inequality. The case urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0449 is handled analogously.
    In the case of the elementary expectile scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0450 in expression (12) we assume first that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0451, where t denotes the α‐expectile of F. Since
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0452
    we obtain
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0453
    As the first term on the right‐hand side is strictly increasing in θ and has a unique zero at the α‐expectile of F, the proof can be completed in the same way as above.

    Appendix B: Details for the synthetic example

    Here we give details for the synthetic example that was introduced in Table 2 and discussed throughout Section 3. Table 3 shows analytic expressions for the expected score
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0454
    where F is either the perfect, the climatological, the unfocused or the sign‐reversed forecaster, and the functional T(F) is either the α‐quantile urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0455 or the mean urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0456 of the CDF‐valued random quantity F. The scoring function S is the elementary quantile scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0457 in expression (10) or the elementary scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0458 in expression (12). For example, if X is a quantile forecast for Y at level α ∈ (0,1) then
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0459(30)
    decomposes into three terms, the first depending on the outcome only, the second depending on the forecast only and the third accounting for the joint distribution. In view of the relationships (14) and (16), the foregoing discussion covers the case of the extremal scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0460 for event probabilities, also.
    Table 3. Expected extremal scores in the prediction space example of Table 2
    Forecast αquantile Mean
    F urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0461 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0462
    Perfect urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0463 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0464
    Climatological urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0465 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0466
    Unfocused urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0467 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0468
    Sign reversed urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0469 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0470
    • †For α ∈ (0,1) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0471, we let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0472, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0473 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0474, where Φ and φ denote the CDF and the probability density function of the standard normal distribution respectively.

    Discussion on the paper by Ehm, Gneiting, Jordan and Krüger

    Christopher A. T. Ferro (University of Exeter)

    Knowing how well forecasts perform can guide responses to forecasts and inform attempts to improve forecasts. It helps society, therefore, to have good ways of evaluating forecast performance and this paper is a welcome contribution to the field. The authors introduce elegant characterizations of scoring functions that are consistent for quantiles and expectiles, the main advantage of these characterizations being that they support the use of Murphy diagrams to check whether one forecaster dominates another. Do the authors see useful interpretations of Murphy diagrams beyond checks for dominance? I give some thoughts on this below.

    Consider using probability forecasts, p ∈ [0,1], of binary outcomes, y ∈ {0,1}, to decide whether or not to bet on the event {y=1}. Suppose that it costs urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1001 to bet and that the return is urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1002 if we bet and y=1. Our profit is urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1003 if we bet and y=0, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1004 if we bet and y=1, and 0 if we do not bet, whereas our regret is urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1005 if we bet and y=0, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1006 if we do not bet and y=1, and 0 otherwise. If the cost–return ratio urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1007 is θ and we use the Bayes rule to decide whether to bet (i.e. bet if and only if p>θ) then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1008 is our regret expressed as a proportion of the return, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1009. This provides an interpretation of the score curves in Fig. 6(b). For example, for most cost–return ratios, the Survey of Professional Forecasters forecasts make on average about 2% of the return less than a perfect forecaster. Similar interpretations hold for profit matrices corresponding to other forms of bet.

    The overall score is obtained by integrating the score curve over θ after weighting by a measure, H. If H is a distribution function then we can think of it as a distribution of cost–return ratios associated with a population of bettors and the overall score may be interpreted as the average regret (as a proportion of the return) that is felt by this population (Schervish, 1989). The Brier score, for example, is (twice) the average regret (as a proportion of the return) felt by a population of bettors with a uniform distribution of cost–return ratios. Plotting the score curve against H(θ) instead of θ would make the area under the curve equal the overall score.

    For α‐quantiles x, we choose whether or not to bet on the event {y>θ} and so θ is now an event threshold. If the cost and return of the bet are as before then our regret is urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1010 if we bet and yθ, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1011 if we do not bet and y>θ, and 0 otherwise. If the cost–return ratio is 1−α and we use the Bayes rule to decide whether to bet (i.e. bet if and only if x>θ) then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1012 is again our regret expressed as a proportion of the return. For example, in Fig. 6(c) when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1013 the regime switching space–time forecasts make on average about 3% of the return less than a perfect forecaster when the cost–return ratio is 0.1. If H is a distribution function then the overall score is the average regret (as a proportion of the return) that is felt by a population of bettors who all have cost–return ratio 1−α and whose event thresholds θ are distributed according to H. Note that the overall score inherits its units from H, so the score may not be dimensionless for other choices of H.

    For α‐expectiles x, again we choose whether or not to bet on the event {y>θ} but now our profit depends on |yθ|. Suppose that it will cost urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1014 to bet and that the return is urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1015 if we bet and y>θ. Our regret is urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1016 if we bet and yθ, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1017 if we do not bet and y>θ, and 0 otherwise. Here, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1018 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1019 are the unit cost and unit return of the bet. If the unit cost–return ratio urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1020 is 1−α and we use the Bayes rule to decide whether to bet (i.e. bet if and only if x>θ) then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1021 is our regret expressed relative to the unit return, but measured in the same units as y. For example in Fig. 6(a) when θ=2% the Survey of Professional Forecasters forecasts make on average about 0.1% times the unit return less than a perfect forecaster when the unit cost–return ratio is 0.5. Interpretations of the overall score follow as before.

    Finally, as the elementary scores are non‐zero only when the forecast yields a ‘false positive’ result (yθ<x or p>θ and y=0) or a ‘false negative’ result (xθ<y or pθ and y=1) we can distinguish the contributions from these two types of error on the Murphy diagrams. For example, Fig. 8 shows that the relatively poor performance of the probit forecasts is due largely to producing more false negative results than the Survey of Professional Forecasters forecasts.

    image
    Murphy diagram for the probability of recession: empirical scores (image, probit; image, Survey of Professional Forecasters) and contributions from false positive results (image, image)

    I hope that these ideas add to the utility of Murphy diagrams and I gladly propose the vote of thanks.

    Philip Dawid (University of Cambridge)

    It might have been thought that there is little new to say about point estimation. This impressive paper demonstrates that this is not so. In particular, when a functional of a distribution has the special property of being elicitable, we can bring to bear a whole battery of useful techniques for motivating and evaluating its assessment.

    Here I concentrate on the more theoretical aspects of this work. The authors show that, for some special cases of elicitable functional—mean, percentile, expectile—the property of being a consistent scoring function can be fully characterized, and such a scoring function can be uniquely expressed as a mixture of a one‐dimensional family of extremal scoring functions. I shall show how these results may be extended to other cases.

    Consider a random variable Y taking values in some space urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1022, and a family urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1023 (I prefer to use P, rather than the authors’ F) of distributions over urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1024. Suppose given a real‐valued functional T on urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1025. (I shall use t, rather than x, for a value or estimate of the functional; I also suppose that T is single valued.) Suppose further that there is an identifying function V(t,y) such that
    1. V(t,y) is non‐decreasing in t and
    2. t=T(P) satisfies V(t,P)=0

    where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1026. This is the case for the examples that are treated in the paper, as follows: mean,

    V(t,y)=(ty);

    αquantile,

    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1027;

    αexpectile

    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1028

    More generally, under differentiability conditions, if we have a consistent scoring function S(t,y) for T(P) we can take V(t,y)=∂S/∂t.

    Now, for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1029, define
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1030

    Proposition 3.urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1031 is a consistent and order sensitive scoring function for T.

    Proof.Denote urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1032 by S(t,P). Then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1033.

    1. Consistency: first note that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1034.
      1. If θT(P), then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1035.
      2. If θ>T(P), then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1036. In either case, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1037. So urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1038 is consistent.
    2. Order sensitivity: if θT(P), then V(θ,P)≤0, so urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1039 is non‐increasing in t. Hence urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1040. Also, if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1041, then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1042. If θ>T(P), then, for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1043, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1044. Also, since urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1045 is non‐decreasing in t, so, for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1046.

    Corollary 3.Let H be a measure on urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1047, and define

    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1048(31)
    Then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1049 is a consistent and order sensitive scoring function.

    Note that we can re‐express formula (31) as
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1050(32)

    Now formula (32), with the absolutely continuous choice dH(t)=h(t) dt (h≥0), recovers (modulo an unimportant additive function of y) the classes of consistent scoring functions for the cases that are considered in the paper. For the α‐percentile, we obtain formula (5) of the paper on taking urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1051, whereas for the α‐expectile (the mean, when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1052), formula (7) of the paper arises on taking urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1053 and integrating by parts. (I note, however, that, if H also has a discrete component, equation 32 yields further consistent scoring functions that are not of the given form.)

    On differentiating equation 32 with respect to t we see that H is uniquely determined by urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1054. Reverting to the form (31), we deduce that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1055 is uniquely expressible as a mixture over the set of extremal consistent scoring functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1056.

    A remaining question is whether every consistent scoring function for T is equivalent to one of the form (31) or (32). This will be so under appropriate conditions (Osband, 1985; Osband and Reichelstein, 1985). When it is, we can apply the general methods of the paper (Murphy diagrams etc.) to more general elicitable functionals.

    I have one final query about the necessity of elicitability. It is easy to show that the functional urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1057 is not elicitable, and the same will hold for any reasonable measure of spread. Does that mean that we should not attempt to measure the spread of a distribution?

    It will be apparent that I have been greatly stimulated by this paper. I have much pleasure in seconding the vote of thanks.

    The vote of thanks was passed by acclamation.

    K. Mitchell and C. A. T. Ferro (University of Exeter)

    A notable aspect of the authors’ analysis is the expression of consistent scoring functions for quantiles and expectiles as Choquet representations: weighted combinations of certain extremal scoring functions. Such a characterization is distinct from (though related to) the more familiar Savage characterization of consistent scoring functions (see, for example, Savage (1971), Osband (1985), Gneiting and Raftery (2007), Gneiting (2011) and Frongillo and Kash (2014)).

    The authors query whether similar Choquet representations may be found for proper scoring rules. When the observation is a binary variable, such a characterization is well known having been derived by Shuford et al. (1966) and extended by Schervish (1989) (see also Lindley (1982)). Unfortunately, the authors anticipate a negative result for multivalued observations.

    Some insight may be gained by considering the elicitation of statistical properties for a binary observation X ∈ {0,1}. Consider the particular statistical property which takes as its value the interval [(k−1)/N,k/N] if and only if p ∈ [(k−1)/N,k/N] where p is the forecast probability that X=1. Lambert (2013) derived a representation for all consistent scoring functions for this property, which we write here: as S is a consistent scoring function for the interval property if and only if
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1058(33)
    where g is a non‐decreasing function. If, as N→∞,k/Np, then (with appropriate restrictions on S and g),
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1059
    which is equivalent to the Choquet representation
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1060

    Obtaining a result similar to equation 33 for probability vectors would, therefore, allow for a Choquet representation of proper scoring rules for multivalued observations. But, key to the derivation of equation (33) is that the intervals [(k−1)/N,k/N] admit a total order ([(k−1)/N,k/N]≤[(j−1)/N,j/N]⇔kj) and under this total order S is order sensitive.

    Intervals of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1061 for n>1 admit only a partial order, frustrating attempts to obtain a result such as equation 33 for n>1. This result seems further to negate the possibility of a Choquet representation for all proper scoring rules of multivalued observations.

    The authors’ final equation, however, neatly exemplifies another approach: a forecast for a multivalued observation is replaced by a sequence of forecasts for a binary observation and the proper scoring rule is represented as an aggregation of the proper scoring rules of the separate binary observations; then, the proper scoring rule for each binary observation is replaced by its Choquet representation. If such a reduction is available in general, it seems that a proper scoring rule for a multivalued observation admits a Choquet representation with respect to a product measure. A formal treatment of this approach would be of much interest.

    Frank Critchley (The Open University, Milton Keynes), Paul Marriott (University of Waterloo) andRadka Sabolova andGermain Van Bever (The Open University, Milton Keynes)

    In welcoming tonight's paper, we would like to indicate potentially fruitful points of contact with information geometry and, in so doing, to highlight questions that arise naturally in this connection. Essentially, this involves working with pairs of distributions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1062, rather than scalars (x,y) (see Section 3), in particular, identifying a realization y with the cumulative distribution function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1063 degenerate at that value.

    Although applicable more generally, for brevity, we focus our remarks on the mean consistent scoring function S given by equation (2). When ϕ is strictly convex, S has the form of the type of divergence defined by Bregman (1967). Divergence measures such as these are a cornerstone of information geometry and have found wide application in optimization problems, image and signal processing, machine learning and statistical inference. Again, there are important dualities among divergence functions and their corresponding affine parameters, and strong links with convex analysis. For further details on information geometry and divergences see, for example, Kass and Vos (2011), Critchley and Marriott (2015), and the extensive references therein.

    As with the wide choice of consistent scoring functions so, also, there is a profusion of divergences that could be used. As in the paper, appropriate guidance is, then, of potentially great value in a wide variety of contexts. We therefore ask the following questions.
    1. Restricting attention to strictly convex ϕ, are there useful analogues for scoring functions of well‐known duality and convex analysis results for divergences?
    2. Conversely, do Bregman divergences enjoy mixture representations, analogous to those for scoring functions? If so,
      1. might the paper be developed to provide guidance in choosing between them, and
      2. can such extremal mixture representations be used to good effect in, for example, optimization contexts?; more particularly, in solving associated stationary equations?
    3. The elegant directness of the proof of theorem 1 prompts the natural questions: what is the most general result of this kind? Further, what other instances of it are of potential interest? In particular, just as mixture representations based on expectiles have direct analogues for quantiles, are there analogous representations for non‐Bregman divergences?

    In closing, it is a pleasure to thank the authors for their stimulating paper.

    Kent Osband (RiskTick, Birmingham)

    How can we incentivize diligent, truthful forecasts x of expected outcomes y when we cannot distinguish ex post discrepancies from random errors? Contracts that generate maximal expected pay‐offs for truthful reporting, regardless of the detailed beliefs, are known as consistent scoring rules. Remarkably, for most statistics of practical interest, every convex function ϕ can be mapped to a consistent scoring rule and vice versa. Specifically, the subgradient to ϕ(x) describes the pay‐offs from reporting x, the subgradient's value at y=X marks the expected pay‐off when the best forecast is X and the highest expected pay‐off is ϕ(X).

    Although this result has been known for decades, relatively little progress has been made in identifying appropriate ϕ or even systematically studying the trade‐offs. In search of insight, this paper takes a novel tack. After controlling for location and scale, it decomposes each ϕ into a mixture of primitives max(xθ,0) over the various possible thresholds θ; the mixing weights are effectively urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1064. For each primitive, the corresponding penalty score works out to 0 when x and y are on the same side of θ and |yθ| when they are on opposite sides of θ.

    This is a neat way of thinking about scores. It links every urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1065 to the value of knowing whether the outcome is greater or less than θ. Given these values, it is straightforward to construct ϕ.

    Here are a few thoughts on how to value urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1066. First, note that the expected cost from a small forecast error δ around the true X is approximately urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1067. If forecaster j has predictive distribution urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1068 with mean urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1069 and variance urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1070 close to the lower bound given by Fisher information urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1071, then the expected cost of forecast imprecision is approximately urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1072. If forecaster j can take what amounts to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1073 independent observations at cost urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1074, her chosen urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1075 will satisfy urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1076.

    Next, imagine that the people soliciting the forecasts have some prior beliefs about x that amalgamate the various forecaster beliefs on a broadly equal basis and, moreover, that they wish to encourage roughly the same diligence n by all forecasters. It then make sense to equate each urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1077 to a common multiple of the aggregate Fisher information I(θ).

    Debashis Paul (University of California at Davis) and Anandamayee Majumdar (Soochow University, Suzhou)

    We congratulate the authors for their excellent characterization of quantiles and expectiles and for the lucid exposition of the ideas. Furthermore, they have developed an innovative tool for comparing point predictions by making use of these characterizations. The graphical method (Murphy diagram) proposed is attractive because of its simplicity and highly powerful because of its distribution‐free nature. Specifically, the authors propose to compare point predictions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1078 based on univariate data Y, by comparing expected elementary score functions, with respect to the joint distribution urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1079:
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1080(34)
    where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1081 denotes the elementary quantile score and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1082 denotes the elementary expectile score, defined in equations (10) and (12) respectively. In expression (34) we treat urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1083 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1084 as functions of (α,θ) to emphasize the point that it may be beneficial to categorize prediction methods by their relative performance with respect to one quantile (say, the median) versus another (say, the 95th percentile), with analogous comparisons in terms of expectiles. Determining the range of α over which a particular prediction method is dominant can be important especially when forecasting errors have asymmetric implications, e.g. when the variable of interest is the amount of precipitation, or the concentration of a pollutant.

    Comparisons in terms of α can also be helpful for narrowing down a class of statistical models designed to capture certain characteristics of the distribution of observations. As a specific example, we consider the class of double‐zero expectile normal processes that were introduced by Majumdar and Paul (2015). Such a process is a stationary process with sub‐Gaussian tail behaviour such that the pth expectile of marginal distributions is 0 for a given p ∈ (0,1). The parameter p controls the degree and direction of asymmetry of the marginal distributions. We formulated a Bayesian inference procedure based on Markov chain Monte Carlo sampling in a spatial regression setting where the residual process is a double‐zero expectile normal process and found that accurate estimation of p is challenging. Comparisons of predictions corresponding to different values of p, based on empirical versions of the expected score functions in expression (34), can facilitate determination of a range of plausible p (or an informative prior for p). This process can be enhanced further through specification of a range of α, depending on whether the ability to predict the central part of the data or the extremes is considered more important.

    Roberto Casarin (University Ca’ Foscari, Venice) and Francesco Ravazzolo (Free University of Bozen‐Bolzano)

    The authors are to be congratulated on their excellent intuition, which has culminated in the development of quite a general approach to consistent scoring of competing forecasting models. Their approach based on extremal scoring functions is in the spirit of the existing literature and in particular of the seminal papers of Gneiting and Ranjan (2011, 2013). We believe that the proposed consistent scoring functions can be applied to achieve new density calibration and density combination schemes.

    Consider the forecast distributions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1085 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1086 from two predictive models and F the distribution of Y. Following the notation that was used by the authors, one could consider the map
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1087(35)
    where X is a point forecast, α a quantile level, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1088 a threshold parameter and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1089 a combination or calibration parameter vector. If the parameter ξ is indexing a family of distributions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1090, with urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1091 a family of non‐decreasing functions with g(0)=0 and g(1)=1, then we have a calibration scheme. If ξ is indexing the family of distributions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1092 then we obtain a combination scheme. Optimal calibration and optimal combination can be defined as
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1093(36)
    As a first example we consider the perfect (full curve) and sign‐reversed (chain curve) forecasters given in Table 2 of the paper. We assume a beta calibration scheme and let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1094 be the incomplete beta function with parameters ξ=(μϕ,(1−μ)ϕ), and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1095 the quantile of the calibrated sign‐reversed model where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1096 and ξ(θ) is a solution of expression (36). Then the Murphy diagram of the calibrated model is
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1097
    and is given in Fig. 9. As a second example we consider two biased forecasters urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1098 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1099 with distribution N(−1,1) and N(2,2) respectively and use the extremal scoring rule to find the optimal combination. The Murphy diagram of the optimal predictive pooling model is
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1100
    (Fig. 10), where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1101 is the α‐quantile of ν Φ(q+1)+(1−ν) Φ{(q−2)/√2}.
    image
    Murphy diagrams for (a) α=0.9 and (b) α=0.8, for the perfect (image) and sign‐reversed (image) forecasters given in Table 2 of the paper, and for the calibrated forecaster obtained by applying a beta calibration function to the sign‐reversed forecaster (image)
    image
    Murphy diagrams for (a) α=0.9 and (b) α=0.8 for the perfect (image), biased (image, image) and combined (image) forecasters

    In their paper, the authors sketch some possible extensions. We recommend as a further exciting and stimulating research line the use of consistent scoring functions for model combination and/or model calibration with the aim of improving on Mitchell and Hall (2005), Hall and Mitchell (2007), Geweke and Amisano (2010, 2011), Billio et al. (2013) and Fawcett et al. (2015). We are therefore very pleased to thank the authors for their work.

    The following contributions were received in writing after the meeting.

    Miguel de Carvalho (Pontificia Universidad Católica de Chile, Santiago) andAntónio Rua (Banco de Portugal, and NOVA School of Business and Economics, Lisbon)

    We congratulate the authors for this thought‐provoking lesson for forecasters. In the space that is available we focus on discussing the possibility of using summary measures based on Murphy diagrams for suggesting ‘optimal’ ways of combining forecasts. In principle one would expect that in many settings of applied interest the performance of competing forecasters would be more like the inflation example in Section 4.1, where the Survey of Professional Forecasters dominates for some values of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1102 but not on others. Typically in cases where there is no clear‐cut forecast dominance, one might wonder how the Murphy diagram of forecast combinations compares. For example, how does the Murphy diagram of the average of both forecasts, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1103, compare with the Survey of Professional Forecasters urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1104 and Michigan urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1105 forecasts? As can be seen in Fig. 11(a), the average of forecasts performs better on some values on some regions of Θ but not on others. One could ask: ‘Is there any other convex combination performing “better”? How do we define “better” in terms of the Murphy diagram?’ To approach these questions consider the forecast combination urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1106, and—extending ideas from Section 3.3—define the area under the Murphy diagram and the maximum of the Murphy diagram respectively as
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1107
    where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1108 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1109 respectively denote the minimum and maximum of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1110, and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1111 with w ∈ [0,1]. Smaller values of these summaries of the Murphy diagram are compatible with good forecast accuracy. Indeed, if there was a value of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1112 for which the combination of forecasts coincided with the data, then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1113 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1114. Thus, a natural way of defining the ‘best’ convex linear combination of forecasts by using Murphy diagrams is as urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1115, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1116, or through the minimax criterion urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1117. We call urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1118 a Murphy optimal combination forecast. For example, for the inflation forecasts urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1119 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1120; also, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1121, whereas A(1)=0.41 (Survey of Professional Forecasters), A(0)=0.49 (Michigan) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1122 (mean of forecasts); in addition, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1123, B(1)=0.163, B(0)=0.195 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1124. See Figs 11(b) and 11(d) for the plots of A(w) and B(w) over the interval [0,1].
    image
    (a) Murphy diagrams for the inflation example (image, Survey of Professional Forecasters; image, Michigan; image, average combination forecast; image, Murphy optimal combination forecast, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1125), (b) area under the Murphy diagram, (c) Murphy diagrams for the inflation example (image, Survey of Professional Forecasters; image, Michigan; image, Murphy optimal combination forecast, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1126) and (d) maximum of the Murphy diagram

    Shih‐Kang Chao and Guang Cheng (Purdue University, West Lafayette)

    We congratulate the authors for an inspiring piece of work. This motivates several intriguing and practically important research directions. First we point out an interesting connection between the Murphy diagram and the ‘backtest’ of Christoffersen (1998), which is a popular procedure for validating value at risk VaR. We start from a simple simulation study that compares two VaR‐predictors (for τ=0.01) through the Murphy diagram. Let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1127 follow
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1128(37)
    Two ways of predicting VaR (inspired by Christoffersen (1998)) given knowledge of expression (37) are
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1129(38)
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1130(39)
    To evaluate VaR‐estimation, overall coverage and error independence are checked through urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1131 with 1(·) the indicator function, i.e.
    1. urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1132 and
    2. urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1133 is independent of the σ‐algebra generated by urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1134.
    As checked by the tests proposed in Christoffersen (1998), expressions (38) and (39) satisfy (a), whereas they differ in (b). Interestingly, the elementary scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1135 used in the Murphy diagram chimes with these two VaR‐standards, where
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1136(40)
    By replacing x and y in equation 40 by urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1137 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1138 respectively and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1139, we find that (i) measures ‘coverage’, i.e. condition (a) and that (ii) measures the quality of x mimicking the dynamics of y, i.e. condition (b).

    The Murphy diagram (Fig. 12) demonstrates that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1140 ‘dominates’ urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1141. This dominance pattern can be linked with standards (a) and (b).

    image
    Murphy diagram urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1142 with the scores urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1143 defined in expression (40): image, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1144; image, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1145; |, θ=−5.337

    θ ∈ [−5.337,−2]

    The large score deviation in this region is mainly due to term (i) caused by the fixed value of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1146—for most t, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1147, whereas urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1148.

    θ ∈ (−2,6]

    The score deviation is much less in this region because of similar (i) (due to condition (a)) and (ii) of two VaRs, and the latter to the fact that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1149 for most t.

    Another direction is on the detection of pattern change in dominance between two point forecasts. For example, the estimation quality for two VaRs in expressions (38) and (39) may change when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1150 no longer follow an AR(1) model. Suppose that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1151 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1152 are two estimates for a certain quantile of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1153, and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1154 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1155 at t=0. This is equivalent to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1156 for all θ, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1157. To determine the critical point urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1158 after which urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1159 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1160, we may consider a cumulative sum statistic (Siegmund, 1985):
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1161(41)
    We conjecture that the dominance relationship changes if urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1162 for some number b. Of course, an appropriate choice of (b,T) requires further study.

    H. M. Christensen (University of Oxford)

    As a meteorologist, I found this a very interesting and stimulating paper, with important ramifications for the verification of weather forecasts. Apart from the obvious usefulness of the newly proposed Murphy diagrams, the paper clarifies the importance of specifying the user's functional of interest, and not simply the forecast scenario.

    Consider two forecasters competing for business with a wind energy company. The company requires warning if wind speeds will exceed 60 m.p.h., as in this case they must act if they want to prevent damage to the turbines. Their cost–loss ratio is 0.05.

    Forecaster A presents the company with the probability that the winds will exceed 60 m.p.h. Forecaster B presents the company with the 95th quantile of his forecast probability distribution function, which can be compared with the cut‐off wind speed. Both forecasts are tailored to the same scenario, but they fall into the two different classes of point forecasts outlined in the paper.

    The energy company can rank the two forecasters by comparing their expected profits when decisions are made by using each forecast in turn, given the stated cost–loss ratio and cut‐off wind speed. In addition, for forecaster A, they can use a Murphy diagram to consider how the probability forecast would perform at a variety of cost–loss ratios θ for the given cut‐off threshold. This could be of interest if the turbines became cheaper to replace, or the cost of electricity changed. For forecaster B, they can use a Murphy diagram to consider how the quantile forecast would perform at a variety of wind cut‐offs, θ—important if developments allowed safe use of turbines at higher wind speeds.

    However it is not possible to compare the performance of forecaster A and B except at the original threshold and cost–loss ratio, unless the full predictive probability distribution function was available, in which case equation (16) could be invoked. The user must specify which type of point forecast is of more use to them, to allow for fair comparison and to test whether one point forecast dominates the other. It is likely that the dominance of one forecaster over the other is dependent on the point forecast requested. In explicitly stating which point forecast is required (probability or quantile), the competing forecasters are provided with an additional goal to use in improving their forecast model and calibrating the resultant forecasts.

    Matei Demetrescu (University of Kiel)

    I shall gladly elaborate on some ideas of this thought‐provoking contribution. First, an extension of theorem 1, parts (a) and (b), for what could be called the urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1163‐class, p=3,4,…, is sketched. Second, an economic interpretation complementing that given in Section 2.3 is provided for generic scoring functions. (Throughout, regularity conditions are assumed implicitly.)

    The class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1164 builds on the asymmetric power function (see for example Elliott et al. (2005))
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1165(42)
    the same way that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1166 relates to asymmetric linear (p=1) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1167 to asymmetric quadratic (p=2) scoring functions. The optimal point forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1168 under urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1169 is the unique root of
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1170(43)
    and parallels the usual expectile (p=2) for p>2. The class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1171 consists of scoring consistent with such ‘higher order expectiles’.
    To characterize urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1172, let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1173 have positive, non‐decreasing pth‐order derivative. Then,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1174(44)
    has the same solution as equation 43. Rewrite equation 44
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1175
    with
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1176
    implying
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1177
    Concretely,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1178
    with
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1179(45)
    In the proof of theorem 1, use urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1180 from equation 45 instead of the Bregman‐type Φ (equation (29)), where
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1181
    with
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1182
    which suffices for the result; uniqueness follows on suitably restricting γ. Interpreting urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1183 for p=3,4,… is beyond the scope of this note.

    Instead, consider the following economic interpretation of some S(x,y), for which a Choquet representation need not exist explicitly. Let Y model the distribution of the net monetary pay‐off of an uncertain investment; invest when the predicted pay‐off T(F) is larger than the pay‐off of doing nothing, i.e. T(F)>0. This parallels the decision rules given in Section 2.3 but with different thresholds. Considering alternatively a utility‐based framework, one invests only when the expected utility E[U(Y)] of the investment is larger than the utility U(0) of the certain pay‐off, i.e. E[U(Y)−U(0)]>0.

    Now, the predicted pay‐off T(F) satisfies
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1184
    with
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1185
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1186
    Therefore, choosing
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1187(46)
    as utility function leads to the same decision under both rules. One might then extend the interpretation of the utility function (46) to S(x,y): for instance, a risk neutral, linear U(y) corresponds to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1188 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1189, matching risk neutrality where only the expected pay‐off matters.

    Francis X. Diebold (University of Pennsylvania, Philadelphia)

    The paper is stimulating and welcome. It most definitely pushes us forward. Given the space constraints, I shall focus on just one of its themes: the ranking of imperfect forecasts. The authors emphasize that Choquet‐type mixture representations give rise to simple checks of whether one point forecast dominates another in the sense that it is preferable under any consistent loss function. Econometricians have grappled for some time with related dominance issues and have developed tests for whether one forecast ‘stochastically dominates’ another; see Corradi and Swanson (2013), Lee et al. (2014) and Jin et al. (2015), and the references therein.

    Unfortunately, however, the typical situation (indeed noted by the authors) is that no forecast dominates all others. Hence the issue of ‘loss function choice’ remains, and rankings of competing forecasts will in general depend on the loss function chosen. Typically, little thought is given to the loss function L(e). Instead, Gauss's centuries‐old quadratic loss, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1190, remains routinely invoked, primarily for mathematical convenience.

    Against this background, Diebold and Shin (2015) approached the accuracy ranking problem directly, basing rankings on the entire distribution of e. Recognizing that any reasonable loss function must satisfy L(0)=0, they studied accuracy measures based on direct comparison of F(e) (the cumulative distribution function of e) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1191 (the unit step function at 0),
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1192
    (urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1193 is the error cumulative distribution function that corresponds to perfect forecasts. If e is governed by urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1194, then e=0 with probability 1.) The idea is simply to compare F(e) with urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1195 and to favour the forecast whose F(e) is ‘closest’ to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1196. Formally, we favour the forecast with smallest ‘stochastic error distance’ SED, given by
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1197

    The SED‐approach yields useful insights with important practical implications. In particular, it directs attention away from squared error loss and towards absolute error loss. Indeed the key result of Diebold and Shin (2015) is that SED‐loss equals expected absolute error loss, i.e. urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1198. Among other things, this suggests shifting attention away from conditional mean forecasts and towards conditional median forecasts.

    Dalia Ghanem (University of California at Davis)

    To choose between two point forecasts, typically presented as a quantile or expectile of a predictive distribution, a decision maker employs a scoring function. For practical decision making settings, the choice of scoring function may play a crucial role in ranking these two forecasts. The practical problem, however, is that there is usually no justification for choosing one consistent scoring function over many others. Ehm and his co‐authors propose a theoretically sound and empirically attractive method to choose between competing point forecasts. From a practical perspective, one of the key advantages of the method proposed is that it admits a graphical representation that is easily interpretable, specifically a Murphy diagram.

    There are two key results in this paper. The first is that any consistent scoring rule for a functional of interest can be represented as a mixture over a single‐parameter extremal scoring function. The second key result states that comparing the sample analogue of the extremal scoring function at a finite number of points is sufficient to establish that an empirical forecast dominates another. These two results together bring forward an intuitive and tractable method to compare competing forecasts without relying on a particular choice of scoring function.

    In practice, one empirical forecast may dominate another over part of the support of the extremal scoring function, whereas the latter dominates over another part of the support. In these settings, valid inference plays a crucial role. The current paper uses pointwise confidence bands. The authors do not discuss why pointwise inference is appropriate. Naturally, it depends on the inference question that one seeks to answer. One argument in favour of pointwise inference is that each point in the support of the extremal scoring function corresponds to a particular scoring function, and the researcher seeks to test whether the two forecasts lead to equal values for each particular scoring function or not. If that is so, then one may use pointwise inference but must account for multiple testing. In contrast, uniform inference may be appropriate if one is interested in comparing the extremal scoring functions ‘as a whole’. This is especially important since the points at which the extremal scoring functions are evaluated are realizations of random variables and their number increases with sample size. A careful treatment of these inference issues is beyond the scope of the current paper but would be valuable directions for future research.

    Wolfgang K. Härdle (Humboldt‐Universität zu Berlin and Singapore Management University) and Chen Huang (Humboldt‐Universität zu Berlin)

    When evaluating and ranking the performance of forecasters, the choice of the scoring function to measure loss properly is a crucial question which has motivated the authors of this paper to establish general guidance for this issue. For statistical functions, for example, the mean function, or quantiles and expectiles more generally, a representation of consistent scoring functions is provided as a weighted mixture over extremal scores. In other words, all the consistent scoring functions could be reduced to a one‐dimensional family of extremal scores by this way. Also the authors have shown that such representations could be considered as Choquet representations. Consequently, one point forecast can dominate another alternative, if it is preferred, in terms of each elementary score. This will lead to a simple and uniform checking standard when assessing different point forecasts, avoiding the influence that might be caused by the choice of scoring function. In this paper the authors focus on the cases of quantile and expectile functions and give intuitive economic interpretations for them.

    Corresponding to empirical measures, forecasts can be easily and sufficiently compared in terms of the average of extremal scores. The results of the comparison between forecasts can be clearly visible in the so‐called Murphy diagrams, which give the average scores over samples and also the score difference together with pointwise confidence bands. This paper illustrates three examples in line with the meteorological and economic studies in three previous works to justify whether the benchmarks outperform alternatives consistently.

    We congratulate and thank the authors for their breakthrough on evaluating forecasting performance of quantiles and expectiles. It would be nice if their methodology could be applied to forecast tail event curves (conditional quantile or expectile curves in the context of functional data analysis). Studying extreme behaviour by tail event curves has attracted increasing attention recently in many applications, such as temperature and wind forecasting, derivative pricing and risk management. Taking information on tails into consideration can help to improve forecasting precision. It will definitely become a widely used tool in many areas of industry.

    Hajo Holzmann (Philipps‐Universität Marburg) and Bernhard Klar (Karlsruher Institut für Technologie, Karlsruhe)

    The authors have provided us with an interesting paper on the comparison of distinct forecasts of quantiles and expectiles. Recently, Gneiting (2011) made a strong point that such forecast comparisons should be based on consistent scoring functions. However, there are a variety of consistent scoring functions for both quantiles and expectiles, and empirical forecast rankings may depend on the particular choice.

    Therefore, the authors introduce the notion of forecast dominance for all consistent scoring functions and show that, in the case of quantiles or expectiles, this can be checked by considering appropriate one‐parameter families of elementary scoring functions.

    We feel that this notion of forecast dominance is very strong and can often not be expected to hold even in situations where one forecast should clearly be favoured over another. Consider the example given in Table 2 in the paper, and let us focus on forecasting the median. Introduce another forecaster, call her extreme, who issues 10μ. Murphy diagrams for the expected score as well as for empirical scores for samples of sizes n=30 are given in Fig. 13. The curves of the perfect and the extreme forecaster all touch each other at ϑ=0, and the empirical curves in Figs 13(b)–13(d) actually all intersect, even in the ‘reasonable’ subinterval [−1,1] of values for the parameter ϑ. We simulated the probability that the Murphy diagrams intersect (not only touch) in the interval [−1,1] for distinct sample sizes; the results are in Table 4. Whereas the probability converges to 0 for the climatological and the sign‐reversed forecasters, it actually comes close to 1 for the extreme forecaster. In contrast, the probability that the classical median score, the absolute loss, of the perfect forecast exceeds that of one of the other forecasters quickly tends to 0, as shown in Table 5.

    image
    Murphy diagram for the perfect (image), climatological (image), sign‐reversed (image) and extreme (image) forecasters: (a) theoretical scores; (b)–(d) scores for distinct samples of size n=30
    Table 4. Relative frequency of an intersection of the Murphy curve with that of the perfect forecaster in the interval [−1,1] for various sample sizes in 10000 simulations
    n Climatological Sign reversed Extreme
    10 63.5 21.5 85.2
    20 58.4 8.2 91.9
    30 51.2 3.0 94.4
    50 39.1 0 96.6
    100 20.8 0 98.1
    200 5.2 0 98.9
    Table 5. Relative frequency of exceedance of the quantile score of the perfect forecaster over that of the respective forecaster for various sample sizes in 10000 simulations
    n Climatological Sign reversed Extreme
    10 8.7 0.6 0
    20 2.9 0 0
    30 0.8 0 0
    50 0 0 0

    Thus, although Murphy diagrams are useful descriptive tools for comparing forecasts, the formal notion of forecast dominance which involves a large family of consistent but not strictly consistent scoring functions should be used with care.

    Ian Jolliffe (University of Exeter)

    I add my congratulations to the authors on a stimulating paper. It is fitting that they have named their key diagram the Murphy diagram. Before his untimely death in 1997, Allan Murphy not only did a tremendous amount of work in the field of what atmospheric scientists call forecast verification, but he was also a central figure in bringing together the statistical and climate communities in a series of 3‐yearly International Meetings on Statistical Climatology, the 13th of which is scheduled to take place in 2016.

    I have just two practical questions on the paper. I may be wrong but it seemed to me that a dominated forecast is somewhat like an inadmissible decision strategy in decision theory. Typically, consideration of admissibility eliminates some strategies but rarely leaves only one admissible strategy. Similarly, I would have expected dominance rarely to give an unequivocal best forecast. However, the paper has examples of single best forecasts and I was wondering whether the authors could give some idea of how often this happens.

    Different aspects of a forecast may be of practical interest, even to the same user, e.g. reliability and resolution in the decomposition of the Brier score for probability forecasts (Murphy, 1973) or mean‐square error and correlation for continuous forecasts (Murphy, 1988). One forecasting system may do better for one aspect, but a different forecasting system may be better on another. Do the authors have any thoughts on what should be done in such situations, if anything, beyond simply recording relative performance of the forecasting systems for each aspect?

    Victor Richmond R. Jose (Georgetown University, Washington DC)

    I congratulate the authors for an excellent, well‐written paper that contributes to both the theory and the practice of probabilistic forecast evaluation. I envision that the Murphy diagrams the authors introduced will be widely adopted as a diagnostic tool in practice. My two comments below explore some potential extensions and future work related to Section 3 of the paper.
    1. One of the topics that has been alluded to but perhaps not given much attention is that of calibration. For quantiles, calibration can be studied by matching the percentage of times that realizations fall above or below reported forecasts to α. For expectiles, the notion of calibration is not as direct but perhaps can be generated if one were to view expectiles as quantiles of a closely related function (Jones, 1994). In some earlier works on forecast dominance in the probability setting, the notion of calibration appears to be inseparable from dominance (e.g. Schervish (1989)). One then wonders whether the notion of calibration is relevant when dealing with quantiles and expectiles. If so, could calibration be meaningfully incorporated in this setting?
    2. Some generalizations and extensions come to mind when dealing with order sensitivity and dominance. For a scoring rule S, order sensitivity could be generalized by carefully defining a distance function d associated with S with urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1199. This may be a good starting exercise to think about how these concepts can be extended more generally as newer definitions of quantiles and expectiles appear, for example, in the multivariate setting.
    With respect to forecast dominance, alternative definitions and generalizations of dominance exist in the literature as the authors point out. A slightly different definition that I would like to bring to attention is related to the problem of informativeness. Consider two distributions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1200 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1201 about Y; what conditions are needed for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1202 to attain a lower minimum score under S compared with urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1203, i.e. when would
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1204
    where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1205 for k=1,2? Jose and Winkler (2009) showed that the necessary and sufficient condition for this is that the two distributions must be stochastically ordered in the dilation sense (Shaked and Shanthikumar, 2007). It would be interesting to see whether such a type of dominance could also be extended to the expectile scoring rules as well as the extremal versions of both rules.

    Alex Lenkoski and Thordis L. Thorarinsdottir (Norwegian Computing Center, Oslo)

    We congratulate the authors on an excellent paper which addresses several issues that we regularly encounter in our work. The Murphy diagram offers an informative way of assessing predictive performance and we are interested to hear the authors’ thoughts regarding its use to improve the forecast in settings where there is not a single dominating forecast. To illustrate this, we consider the daily return on IBM's stock from January 3rd, 1962, to December 24th, 2015: a time series of 13589 points.

    Let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1206 be the return on day t. We consider forecasting models of the form urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1207 where
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1208(47)
    This simplistic model is closely related to some more complicated methodologies that are frequently employed in time series forecasting, including dynamic linear models and exponential downscaling (West and Harrison, 1999). By setting ν<1, it provides a parsimonious model that incorporates time dynamics in the volatility process and serves as the starting point for more complicated stochastic volatility models.

    The challenge with model (47) is in choosing the tuning parameter ν. As an example, Fig. 14 shows the Murphy diagram for ν=0.96 and ν=0.99 for the 0.5%‐quantile of the daily returns where the scores are averaged over the final 10000 time points. For very low values of θ, the model with ν=0.96 performs better as the more rapidly decaying model performs better at times with high volatility. Similarly, as θ increases, the inclusion of more historic information in model (47) improves the performance. These results are significant at the 95% level (the results are not shown).

    image
    Murphy diagram comparing model (47) under ν=0.96 (image) and ν=0.99 (image) for 0.5%‐quantile predictions

    In this situation, it seems that the parameter ν should either be chosen dynamically or the forecasts could be combined potentially to outperform each consitituent method. Do you have any suggestions for how to use results such as that shown in Fig. 17 to create a new, improved forecast?

    Han Liu (Princeton University)

    I congratulate the authors for making this thought‐provoking contribution. In particular, I found the discussion of the Choquet‐type mixture representation in Section 1 and Section 2.2 very intriguing. Here I comment on possible implications of this representation to the field of statistical machine learning. Specifically, I first present a new supervised learning framework based on the mixture representation. I then raise several questions that remain to be answered.

    For simplicity, consider a regression setting, in which we observe independent and identically distributed samples urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1209 where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1210 is the response variable and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1211 is a d‐dimensional covariate. Denote (Y,X) to be the population quantities. Let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1212 be the squared error loss and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1213 be a measurable function; define
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1214
    to be the predictive risk of f. Let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1215 be a set of functions; define
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1216
    to be the oracle function with respect to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1217. In the machine learning community, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1218 is in general deemed to be the optimal predictor within the model class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1219. An important machine learning task is to construct an estimator urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1220 based on the samples urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1221 which satisfies
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1222
    as n→∞.
    It is obvious that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1223 is optimal only with respect to the square error loss. As has been insightfully pointed out by the authors, the squared error loss is just a special case of a larger family of scoring functions
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1224
    It is known that, under weak regularity conditions, all the scoring functions within urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1225 are consistent for the mean functional. In addition, the authors show that any scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1226 can be represented as
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1227
    where H is a non‐negative measure. This representation motivates us to define a new prediction oracle within the class urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1228 that is simultaneously robust to a set of scoring functions:
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1229
    In this definition, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1230 is a set of non‐negative measures. An empirical risk minimizer is defined as
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1231
    Several interesting questions need to be addressed.
    1. Under what conditions of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1232 is the empirical risk minimizer urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1233 computationally tractable?
    2. In which situations is urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1234 preferred to urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1235?
    3. What is the optimal rate of convergence for estimating urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1236? This rate may depend on the volumes of both urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1237 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1238.

    Jorge Mateu (University Jaume I, Castellón)

    The authors are to be congratulated on a valuable contribution and thought‐provoking paper on this timely topic of point prediction that forecasters have to face in a variety of transdisciplinary problems. As the authors state, there is an ample consensus that forecasts ought to be probabilistic in nature, taking the form of predictive probability distributions over future quantities or events.

    I shall focus on a particular methodological problem to add some food for thought. The problem relates to evaluating and comparing competing forecasts, and how a consistent scoring function can be developed and properly used. Motivated by the approach of Strurn:x-wiley:13697412:media:rssb12154:rssb12154-math-1239hl and Ziegel (2015) for time series, consider the case where we do have spatial dependence.

    Analysis of environmental phenomena for risk assessment usually involves the construction of indicators that are related to structural characteristics of extremal events defined by exceedances over critical thresholds. Recurrence and persistence, among others, are examples of such characteristics which provide information about the distribution patterns of extremal events. Formally, these concepts are intimately related to the geometrical characteristics of the excursion sets defined by threshold exceedances over a given (bounded) domain (Angulo and Madrid, 2010). Given the fragmented nature of threshold exceedance sets, depending on the variation properties that are inherited by sample paths from the probabilistic structure of a random field and on the threshold considered, marked point processes provide a powerful framework for the analysis of their structural properties (Madrid et al., 2012). This approach can be exploited to help to establish the bridge between the construction and interpretation of risk indicators and the properties of the underlying random field generating critical events. Connected components of a threshold exceedance set can be treated as single isolated events, with some geometrical properties such as size and contour roughness being considered as possible marks of interest for complementary analysis of diverse forms of heterogeneity and anisotropy. Hence, a variety of marked point process characteristics can be used to describe some features of interest, in particular for risk assessment purposes.

    Marked point process models provide predictive probability distributions of these types of exceedances. When facing the problem of a point prediction of such an exceedance in a region of interest based on neighbouring spatial dependence, considering extremal scoring functions as quantiles or even expectiles is of key interest. Open questions such as defining appropriate scoring functions, selection of that scoring function dominating other competitors are then worth considering.

    Xiaochun Meng (University of Oxford)

    The authors are to be congratulated on an excellent paper. It is interesting to see that the Murphy diagram can be used to compare estimators for the full set of proper scoring functions. My research focuses on quantile estimation for time series, and in this context the authors note that the quantile regression check loss function is not the only proper scoring rule. Three issues concern me. Firstly, I am unsure how one should proceed if the Murphy diagram does not show clear dominance of one method over another. Perhaps dominance in one part of the diagram is preferable to dominance in another. Secondly, I am interested in the implications of the authors’ work for the empirical average scoring function, which is obtained by taking the sum of the scoring function for each day in a certain period. For a typical time series of daily financial returns, the distribution of the return on one day is unlikely to be exactly the same as the distribution of the return on a different day, and these distributions are likely to be notably different if there is a reasonable period of time between the two days. When evaluating a quantile forecasting method by using a post‐sample period of, say, 500 days, it is not clear to me whether the check loss function still serves as a proper scoring function. Thirdly, I am curious whether the authors’ work on evaluation measures has implications for the choice of loss function to use in model estimation.

    Edgar Merkle (University of Missouri, Columbia)

    I congratulate the authors on a clear paper that impacts a variety of disparate fields. I especially appreciate the notion of forecaster dominance and see many applications to the evaluation of both forecasters and general statistical models that make probabilistic predictions.

    It seems that further connections could be made between dominance and forecast aggregation or combination. If forecaster A dominates forecaster B (in the sense of definition 2), then does this imply that forecaster A is better than all convex combinations of forecasters A and B? Previous researchers (e.g. DeGroot and Fienberg (1983) and Schervish (1989)) hinted that we cannot immediately answer this question even when we know that one forecaster dominates the other.

    This and related questions can be empirically studied via the Merkle and Steyvers (2013) data that the authors discuss around Fig. 4. There, the authors find two dominance relationships:
    1. forecaster ID 3 empirically dominates forecasters ID 6 and ID 8, and
    2. forecaster ID 5 empirically dominates forecaster ID 10.

    For both these relationships, the mean Brier score of the dominating forecaster is better than the mean Brier score when we average the dominating forecaster with the corresponding dominated forecasters (for example, forecaster ID 3 has a better Brier score than the average of forecasters ID 3, ID 6 and ID 8). However, when we include other forecasters in the average, results can change. For example, the mean Brier score that is associated with the average of all 10 forecasters is better than the mean Brier score that is associated with the average of all forecasters except ID 6 and ID 8. So forecasters ID 6 and ID 8 can help the full group, even though they are dominated by another forecaster within the group. It seems that we would have to tell a forecast consumer that forecaster ID 3 is clearly better than forecasters ID 6 and ID 8, but we should still keep forecasters ID 6 and ID 8 around to counteract the others.

    In addition to helping us to discern dominance, perhaps the empirical scores could be used to develop some weighting of forecasters that leads to the lowest score, on average, as θ goes from 0 to 1. Such developments may prove useful for producing aggregate forecasts that are robust to choice of scoring rule, which is an interest of many forecast consumers.

    Monica Musio (Università degli Studi di Cagliari)

    In the following comments I use the same notation as in the paper.

    Suppose that T(F) is an estimable functional, and S(t,Y) is a consistent scoring function for T. This is just a loss function for a decision problem, whose Bayes act (for simplicity here assumed unique), when YF, is T(F), by consistency. By general theory (see Dawid (1986)) this implies that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1240 is a proper scoring rule, whose expectation urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1241 when YG is minimized (though in general not uniquely) by the honest choice F=G.

    Now (see Dawid et al. (2016)), given any proper scoring rule urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1242, and a parametric family urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1243, we can construct an associated M‐estimator of θ. For independent identically distributed data urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1244, this is given by
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1245
    where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1246 is the empirical distribution function of the data. In the case at hand, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1247. By consistency, this is minimized for any θ such that
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1248
    i.e. the M‐estimation method associated with any consistent scoring function for T is simply to equate the population and the sample values of the functional T.

    Andrew Patton (New York University and Duke University, Durham)

    I congratulate the authors on an insightful and stimulating paper. Their ‘Choquet representation’ opens the way to new ways of thinking about forecast comparison, model estimation and more. By showing how a class of loss functions (scoring rules) can be represented as a mixture of ‘elementary scores’, where the measure governing the mixture is indexed by a scalar parameter, consumers (and producers) of forecasts may gain new insights into whether one forecast dominates another. I think that ‘Murphy diagrams’ will make a useful addition to the empirical researcher's toolkit.

    Like all estimated quantities, though, it is important to consider sampling variation. In some applications, this may be less important: in Fig. 6(d) we observe that the difference in average elementary scores crosses zero, indicating that the ranking of these two forecasts is sensitive to the choice of the elementary score parameter (θ), and the need for methods of inference is not so strongly felt. In Fig. 6(e), however, we observe that the difference in average elementary scores is non‐positive for all values of θ, hinting that one forecast (the Survey of Professional Forecasters forecast, in this case) may be preferred to the other (a forecast based on a probit model) for any consistent scoring rule. The pointwise confidence intervals that are given in Fig. 6(e) are clear and easily interpreted guidelines, but for a formal test of forecast dominance (i.e. dominance across the range of all elementary score parameters) we need a joint test. Considering such tests would be very difficult without the authors’ Choquet representation and, indeed, as the authors note, they remain challenging even with this new tool. But similar hypotheses arise in other applications, e.g. in testing for first‐ and second‐order stochastic dominance of asset returns (see Linton et al. (2010) for a recent contribution to this literature), and perhaps the methods that are used in those applications may be adapted for use here.

    As the authors note, it may not be common to observe forecast dominance in practice, and I agree with their conjecture. With the growing number of on‐line surveys, and as economic (and other) surveys grow and gather respondents, however, one could imagine that forecast dominance may become more relevant in the future. Implementing tests of forecast dominance may also improve decision making: combination forecasts have been found to work well in many economic applications (see Timmermann (2006) for a survey), and these combinations can be improved by excluding truly awful individual forecasts. Tests of forecast dominance, perhaps taking into account the multitude of individual forecasts in such an application, could yield great gains.

    Justinas Pelenis (Institute for Advanced Studies, Vienna)

    I congratulate the authors on this timely and interesting contribution to the forecast ranking literature. For the cases of expectiles and quantiles they present a novel and very important result demonstrating how any consistent scoring function of a quantile or expectile would admit a unique representation as a mixture of elementary elements. This important result limits the need to explore a variety of consistent scoring functions and allows us to focus simply on the elementary scoring functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1249 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1250.

    The usefulness of this mixture representation becomes apparent when we consider ranking competing forecasts of a target functional. If forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1251 is preferred over forecast urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1252 for all elementary scoring functions, then urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1253 is ranked above urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1254 for any consistent scoring function of that functional. The authors claim that

    ‘the weighting becomes irrelevant in the case that there are dominance relationships between competing forecasts, ...’

    suggesting that the choice of the mixing measure H(·) for forecast comparison is irrelevant when there is a dominance relationship between urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1255 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1256.

    Nonetheless, I suggest that the mixing measure might still be of relevance even in the case when there is a dominance relationship between competing forecasts. The mixing measure might matter when we consider M‐estimation methods based on consistent scoring functions. Suppose that we are interested in the α‐expectile and two competing misspecified models A and B and model‐based forecasts are to be considered. It is plausible that M‐estimation based on a given consistent scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1257 would deliver that forecasts urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1258 dominate urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1259, whereas M‐estimation based on a given consistent scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1260 would deliver that forecasts urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1261 dominate urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1262. It seems plausible, that the dominance could be reversed depending on the consistent scoring function that is used for estimation. The finding that one set of forecasts dominates another set of forecasts might be an artefact of the mixing measure that is used for estimation and the weighting could be considered relevant even in the cases when dominance is observed.

    It is true that the methods that are considered in this paper apply only to the comparison of forecasts and not to the comparison of models; nonetheless a practitioner might be interested in estimating a set of models and producing model‐based forecasts. Therefore, one should explore the consequences of the choice of the mixing measure H(·) and I second the authors’ suggestion that ‘... the question of the optimal choice of loss or scoring function to be used for estimation in regression problems’ is of considerable interest for future research.

    Pierre Pinson (Technical University of Denmark, Lyngby)

    The authors are to be congratulated for an intriguing and exciting contribution to the science of forecasting and more particularly to forecast verification. Some fundamental contributions to this field can be traced back to the 1950s. Even so, some of the key issues in forecasting still appear to be open problems today. I mainly think here of the link between forecast quality and value. Those were elegantly defined by Murphy (1993), in short, as the objective correspondence of the forecasts with the process observations for the former, and as the increased benefits from integrating such forecasts in decision processes for the latter. Linking these two is definitely not trivial in a general sense, even though for specific example problems and toy models one can nicely illustrate an existing (or non‐existing) connection. For forecasters having to deal with a wealth of decision makers with different decision problems and loss functions, it can never be possible to consider all potential problems and loss functions to assess the value of their forecast to these decision makers.

    Consequently, the intriguing nature of that contribution lies in the fact that the authors investigated and found another rather elegant path to make a better connection between forecast quality and value. Elementary scoring functions and Choquet‐type mixture representations allow defining a simple toolbox to assess whether a forecaster (or a set of forecasts) dominates another, under any consistent scoring function. Even though it cannot readily tell whether this forecaster will yield higher value in all potential decision processes, the strength of the authors’ result is something that brings us closer to ensuring that forecasts seen as having higher quality should eventually yield higher value, whatever their loss function.

    On the basis of this contribution, one is left wondering how practical this result and so‐called Murphy diagrams may be in empirical forecast comparisons. The authors mention that empirical dominance is for the case of an elementary scoring curve lying under another one, for all θ. How informative really is the case where curves intersect? Besides, this result is given in a univariate set‐up only, as if, when making a decision at time t, the decision maker would consider a single variable at time t+k only. In many practical applications, decision makers are to account jointly for information from many variables, possibly various lead times, locations, etc. I therefore wonder whether such results and diagnostic tool could generalize to the case of multivariate set‐ups.

    Barbara Rossi (Istitució Catalana de Recerca i Estudis Avançats–Universitat Pompeu Fabra, Barcelona, Barcelona Graduate School of Economics, and Centre de Recerca en Economia Internacional, Barcelona)

    Previous literature has found that the ranking of competing forecasts may depend on the choice of the scoring function (e.g. Patton (2015)). This finding clearly has the problematic implication that, for a given loss function, a forecast may be superior to another, but, at the same time, the latter forecast may be superior to the former when a different scoring function is used.

    This paper makes a very useful point: it shows that every consistent scoring function can be written as a weighted average of elementary scoring functions, where the weights depend on a parameter θ. Thus, if the average elementary scoring functions of a forecast are lower than those of a competing forecast for every value of θ, then the former is superior, no matter what the choice of the loss function is. This can be easily analysed by depicting the differences of the scoring functions of competing forecasts as a function of θ, as shown for example in Fig. 6(d) in the paper. In fact, if, for example, the difference of the elementary scoring functions of the Survey of Professional Forecasters and the Michigan survey forecasts of inflation is negative for every value of θ, then the Survey of Professional Forecasters provides better forecasts, no matter which scoring function is used to evaluate the forecasts.

    I would like to complement the analysis made in this paper by observing that, empirically, the relative ranking of competing forecasts may also depend on the sample period. Rossi (2013) has shown that the relative ranking of competing forecasting models of US inflation, for example, may change over time, for a given scoring function. The very interesting results in this paper could therefore be extended one step further. In fact, one could plot the differences in the elementary scoring functions in rolling windows, as opposed to the full out‐of‐sample period, and then verify that the individual scoring function differences are negative at each point in time by using Giacomini and Rossi's (2010) fluctuation test. Such an analysis would be useful to determine at which points in time the ranking of competing forecasts is independent of the choice of the scoring function and when it is not. This may provide additional valuable information to study the reliability of a particular forecasting model no matter which scoring function and sample period the forecaster considers.

    Ville A. Satopää (University of Pennsylvania, Philadelphia)

    The scoring function is often selected without a justification. This is particularly unfortunate because, as the authors of the current paper point out, different choices can lead to different relative rankings of the competing forecasting methods. Motivated by this, the authors express a large class of scoring functions as different mixtures of simple, interpretable extremal scoring functions indexed by a parameter θ. Furthermore, by considering a subclass of scoring functions, they can establish a Choquet representation. Unfortunately, the motivation and real importance of this representation are not well described, especially given that Choquet is mentioned in the title of the paper. Nonetheless, the mixture representation is a remarkable contribution because it allows the practitioner to test whether some forecasting method simultaneously dominates its competitors over an entire class of scoring functions.

    Unfortunately, however, as the authors state in their discussion, ‘dominance may not be very commonly observed in practice’. Therefore the practitioner most probably observes multiple methods, each dominating over different values of θ. To proceed, the practitioner must then decide which values of θ are more important for the application at hand, i.e. the practitioner must choose a mixing measure, which is essentially equivalent to picking a scoring function. Therefore the initial problem of choosing a scoring function is typically not avoided.

    Choosing a mixing measure, of course, requires a clear understanding of what different values of θ represent. For this, the authors interpret the different extremal functions in terms of different betting games. Even though the discussion therein is very interesting, the interpretations are overall rather cumbersome for comparing forecasting methods across different values of θ. This questions whether a Murphy diagram for probability forecasts of binary events, such as Fig. 4(a), should be used over a regular reliability plot. After all, a reliability plot describes the forecasting method in terms of simple traits such as sharpness, overconfidence and underconfidence, which together are suggestive of the overall Brier score. This intuitive appeal makes it easier for the practitioner to evaluate any trade‐offs and to proceed with a final method.

    Milan Stehlík (Johannes Kepler University, Linz, and University of Valparaíso)

    I congratulate the authors on this paper, which introduces readers to a challenging world of consistent scoring functions for quantiles and expectiles! I would like to point out the intrinsic relationship between information divergences and scores. If the underlying distributions are from a regular exponential family (see Barndorff‐Nielsen (1978)), with the sufficient statistics t(y) for the canonical parameter γ, then we can directly relate scoring function S(x,y) between the point forecast x and the realized observation y to Kullback–Leibler (KL) divergence I(y,x) of the observed vector y in the sense of Pázman (1993).

    Then, the squared error scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1263 corresponds to the I(y,x) for linear regression with normal errors (see Stehlík (2003), page 151).

    In contrast, scores for expectiles have different, more complicated representations. Consider the set‐up of Patton (2001). We have the family urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1264 (on page 511), which for a particular point b=0 has the form of KL divergence
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1265
    in the gamma density model
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1266
    for the choice of N=1 and covering parameters urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1267 and shape parameter urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1268. The exact distribution of KL divergence of the observed vector y has been studied in Stehlík (2003), and such studies can give optimal tests for forecast dominance, as a complement to the Murphy diagram. The application of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1269 with underlying distribution for methane emission can be found in Sabolová et al. (2015).

    Thus, aside from a useful study of score functions, it will be useful to mention its direct relationship to information theory, in particular KL divergences and KL scores. The Bregman score has been extensively studied (Gneiting and Raftery, 2007; Murata et al., 2004; Hendrickson and Buehler, 1971); however, the KL score could be also integrated in the whole picture. Decompositions of information divergences from a broader perspective should be of interest (see for example Stehlík (2012)). Thus the relationship between ϕ‐divergence and statistical information in the sense of DeGroot (see DeGroot (1970)) can be obtained. In DeGroot and Fienberg (1983) or Weijs et al. (2010), a decomposition of the divergence score was presented. This was inspired by a decomposition of the Brier score (see Brier (1950)) into uncertainty, reliability and resolution.

    In general, aside from consistency of scores one should understand different learning mechanisms hidden behind the different choices of scores, together with their distributional properties. This is needed for good Basel practices, which can request for example consistent and robust M‐estimation (as an example see Beran et al. (2014)).

    Ingo Steinwart (University of Stuttgart)

    I first thank the authors for an interesting and stimulating paper. As they briefly indicate in their discussion on best practices, scoring functions not only occur in the forecasting community but also in machine learning or statistical learning theory. Since there, scoring functions, or, in their terminology, loss functions, have been investigated independently for quite some time (see for example Bartlett et al. (2006), Tewari and Bartlett (2005), Zhang (2004), Steinwart (2007), Reid and Williamson (2011), Calauzènes et al. (2012, 2013), Scott (2012), Duchi et al. (2013), Menon and Williamson (2014), Agarwal and Agarwal (2015), Frongillo and Kash (2015) and Ramaswamy and Agarwal (2016) and the many more reference mentioned therein), I would like to use this contribution to describe briefly the role of scoring functions in machine learning.

    Roughly speaking, the goal of supervised learning is to use observations urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1270 drawn from some unknown distribution P on urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1271 to find a function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1272 such that its average score, or risk,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1273
    is small. Here S is a scoring function defined by the application at hand and P(·|x) is a (regular) conditional probability of P. Clearly, if we have a function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1274 satisfying
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1275
    then we obtain a conditional version of the elicitation problem that was considered by Ehm and his colleagues. Let us assume for simplicity that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1276 is almost surely unique, i.e. urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1277 is almost surely a property P(·|x) that can be elicited by S. Now, using data D, we can rarely hope to obtain urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1278, and thus we need a performance measure that quantifies how close urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1279 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1280 are. Here, three fundamentally different possibilities exist and are considered in machine learning.
    1. We are purely interested in S‐risk minimization, in which case the excess risk
      urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1281(48)
      with urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1282 is the sole quantity of interest.
    2. Our goal is to estimate urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1283, in which case we typically describe the quality of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1284 by some suitable norm, i.e. urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1285.
    3. We are actually interested in urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1286‐risk minimization for a different scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1287, but, for example, for algorithmic reasons we can only perform S‐risk minimization.

    Note that, in both (b) and (c), the scoring function S serves only as an instrument for solving the actual problem at hand, and therefore it is natural to ask which S is the ‘best instrument’, how expression (48) relates to the actual problem for different choices of S and do suitable scoring functions S exist that have additional, algorithmically interesting properties such as convexity and/or differentiability? The Choquet‐type representation given by Ehm and his colleagues may make it possible to attack these questions with a new set of tools.

    Samuel L. Ventura and Rebecca Nugent (Carnegie Mellon University, Pittsburgh)

    The authors provide a detailed, theoretical framework for evaluating, comparing and ranking forecasts with scoring functions. Highlighting that forecasters might be required to report a mean or quantile of a distribution of predictions, they assert that scoring functions must be used when comparing forecasts and they provide several theoretical results relating to the consistency of quantiles and expectiles scoring functions, and conditions under which they can be expressed in mixture (and Choquet) representations. We thank them for their efforts, as their results should be of interest to both industry and academic researchers.

    The authors’ efforts to make their results accessible and interpretable to a broad audience are evident in their two detailed profit–loss economic examples which, in conjunction with the easily understandable pay‐off structure overview in Table 1, allow readers to interpret properly the authors’ theoretical results in real world contexts. In future work, it would be interesting to see a similar discussion of the theoretical results on order sensitivity in a real world context. For example, one could use the set‐up of the authors’ first profit‐loss example to demonstrate the results on order sensitivity for quantiles and expectiles in proposition 2.

    The theoretical results and empirical extensions relating to ranking forecasts are perhaps the most important work in this paper. Although (as the authors point out) some past work has been done in this area, the authors establish both theoretical and empirical rules for forecast dominance and discuss how visualization tools like the Murphy diagram can be used to help future researchers to compare forecasts. These findings should be highly valued by both the broader research community and in the industry, where ranking forecasts is of growing importance. That said, the focus on forecast dominance may come at the expense of ranking groups of non‐dominant forecasts. Equation (28) requires specific values of θ to rank forecasts. How do we rank forecasts when θ is unknown, or independently of θ? Is there some test statistic that can be derived by considering all possible values of θ when comparing forecasts? How can this information be incorporated in visualization tools like Murphy diagrams? Finally, can variability around the expected score be similarly incorporated (e.g. via confidence bands) to compare forecasters better visually?

    Given that dominance is an ‘all‐or‐nothing’ property, there are likely to be many practical applications where the better forecaster is obvious despite strict dominance not holding. How dominant is dominant enough?

    Johanna F. Ziegel (University of Bern)

    I congratulate the authors on a great piece of work. It provides genuinely new insights into the evaluation of forecasts for quantiles and expectiles, which may be relevant and influential well beyond the applications that were studied in Sections 3.3, 3.4 and 4. The authors provide many directions of future research in the discussion of which the considerations concerning the choice of scoring functions in regression problems appears particularly interesting.

    The concept of forecast dominance that was introduced in definition 2 is intriguing. The authors give a collection of sufficient criteria for situations where forecast dominance occurs in Section 3.2. However, the question how strong a requirement forecast dominance really is merits further study. For instance, can one construct an interesting example of two forecasters urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1288 with non‐nested information sets such that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1289 dominates urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1290? A second interesting example would be one of two forecasters, one dominating the other, possibly with nested information sets, but neither of them ideal.

    As a final comment, the mixture representations in theorem 1 can also be obtained by using Osband's principle (Osband, 1985; Gneiting, 2011) for any identifiable functional with oriented identification function V: urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1291; see Steinwart et al. (2014), section 3, for the definition of an oriented identification function. Any sufficiently smooth consistent scoring function can be written as
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1292
    with h⩾0 by Osband's principle. It is not too difficult to show that the elementary (or extremal) scoring functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1293, parameterized by v, are consistent. In this context it is natural to pose the question if there is a version of Osband's principle without strong differentiability assumptions on the scoring function. Possibly, using an approximation argument, one might be able to show that all consistent scoring functions with minimal smoothness assumptions can be written as
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1294
    for some measure H. We should obtain strict consistency if and only if the measure H has a strictly positive density with respect to Lebesgue measure (on the action domain).

    The authors replied later, in writing, as follows.

    We are most grateful to our colleagues, and in particular to the main discussants Chris Ferro and Philip Dawid, for a wealth of insightful thoughts and comments. Although the ideas and questions raised are too numerous for us to respond individually, we have identified several themes which our response will focus on, namely
    1. interpretation,
    2. forecast dominance, i.e. strength of the notion, sampling variability and inference,
    3. forecast combinations and dominance graphs,
    4. Choquet representations and
    5. other points.

    1. Interpretation

    Interpretation of Murphy diagrams for quantile and probability forecasts. The interpretation of extremal scoring functions and Murphy diagrams has been addressed in various facets by several colleagues, including Christensen, Ferro, Jose, Meng, Paul and Majumdar, and Satopää, among others. Ferro's interpretation of the Murphy diagram for quantile and probability forecasts is very appealing, and we review it in the notation and terminology of our paper.
    1. For quantile forecasts at level α ∈ (0,1), the Murphy diagram illustrates how much less a forecast makes on average, compared with an oracle, in a binary betting problem at the event threshold urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1295, in the monetary unit of the betting return, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1296.
    2. For probability forecasts of a binary event, the Murphy diagram illustrates how much less a forecast makes on average, compared with an oracle, in the respective betting problem with cost–loss ratio θ ∈  (0,1), in the monetary unit of the betting return, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1297.

    We emphasize that the interpretation is in monetary terms, regardless of the original unit of the predictand, y: multiplying the value on the diagram's vertical axis by urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1298 yields the monetary regret. Consider the stock market example of Lenkoski and Thorarinsdottir, and suppose that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1299 equals $1000. The Murphy diagram shows that for a bet with threshold −0.03 the two forecasts earn about $6 less than an oracle on average. This seems small as a difference and can be explained by the fact that Lenkoski and Thorarinsdottir consider quantile forecasts at level α=0.005. As we assume that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1300 this is an unattractive lottery, which one should mostly reject. Hence the decision to make is not difficult, explaining why both forecast models incur only a small monetary regret compared with the oracle. In the wind speed example in our paper (the bottom of Figs 5 and 6), we have α=0.90, which corresponds to a much more attractive lottery. For a bet with threshold 12 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1301, the regime switching space–time forecast earns about $20 less on average, and the auto‐regressive forecast about $30 less, than an oracle.

    Mean and expectile forecasts. In the case of mean and expectile forecasts, Osband links economic cost to statistical efficiency considerations. However, the interpretation of the Murphy diagram is less straightforward, as the elementary score is now in the original unit of the predictand, y. When y is monetary, our interpretation in terms of an investment problem with differential tax rates applies. Otherwise, a detour via financial options may allow for such an interpretation. As any consistent scoring function admits a mixture representation in terms of elementary scoring functions, it admits an interpretation in terms of the original unit in the applied problem at hand. In particular, this holds for squared error, thereby raising the intriguing possibility of interpreting (mean‐) squared error directly in the original unit, rather than its square.

    2. Forecast dominance: strength of the notion, sampling variability, and inference

    Theoretical example. In response to Ziegel's request, we give a simple example of two forecasts with non‐nested information sets such that one dominates the other. The general idea is that if one forecast has exclusive access to a highly informative explanatory variable, then it may dominate a competitor with exclusive access to a weakly informative variable, even though the information sets may not be nested. Specifically, let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1302, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1303 and ɛ are independent Gaussian variables with mean 0 and variances 2, 1 and 1 respectively. Suppose that Anne's forecast of the mean of Y is given by urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1304, whereas Bob's forecast is given by urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1305. The expected elementary scores for the mean functional can be computed explicitly and yield the Murphy diagram in Fig. 15.

    image
    Murphy diagram in the theoretical example (image, Anne; image, Bob; image, simple average)

    Practical relevance, sampling variability and choice of sampling period. Diebold, Jolliffe, Patton and Pinson, among others, note that forecast dominance is a strong requirement that may not be met all too commonly in practice. Furthermore, sampling variability may obscure dominance relations, as noted by Ventura and Nugent and illustrated very nicely in the simulation example by Holzmann and Klar. Rossi points out that forecast rankings tend to be unstable over time, and she suggests accounting for this possibility when analysing Murphy diagrams. For this, we have implemented the fluctuation test of Giacomini and Rossi (2010), using elementary scores as loss functions, with values θ in the centre of the support of the predictand. In a nutshell, the fluctuation test concerns the null hypothesis that two forecasts perform equally well at all time points. Fig. 16 presents the results for mean forecasts of inflation and wind speed. The null hypothesis of equal performance is not rejected for the inflation data, whereas, for the wind data, the test statistic repeatedly falls below the critical value. Thus, the interpretations in our paper are qualitatively robust across different choices of the sampling period. A next step is to account for the possibility that forecast rankings depend on both the sampling period and the mixture parameter θ, and we encourage future work along these lines.

    image
    Fluctuation test applied to mean forecasts of (a) inflation (θ=3) and (b) wind speed (θ=12) (lower test statistics speak in favour of (a) the Survey of Professional Forecasters’ forecast and (b) the regime switching space–time forecast; for implementation details see the accompanying software (Jordan and Krüger, 2016)): image, test statistic defined in equation (1) of Giacomini and Rossi (2010); image, critical values from Giacomini and Rossi's (2010) Table 1 (two‐sided test; 5% level; μ=0.5)

    Formal inference. The above comment demonstrates the need for formal inference tools and tests for forecast dominance, as called for by Ghanem, Patton, and Ventura and Nugent, among others. The permutation test that was sketched at the end of our Section 3.4 provides such a tool, even though we concede that the null hypothesis formally tested, namely the invariance of the joint distribution of the vectors urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1306 under arbitrary changes of sign in the n components, is quite restrictive. Still, asymptotically as n grows large, the effective discrepancy with the quite liberal null hypothesis of mean 0 score differences may be much reduced.

    3. Forecast combinations and dominance graphs

    Optimal combination. Several colleagues propose the use of extremal scoring functions and Murphy diagrams in designing and evaluating forecast combinations. Generally, it can be shown that
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1307
    where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1308 is a convex combination of l individual forecasts. An analogous inequality holds in the case of quantiles. Therefore, a linearly combined forecast performs no worse than the worst individual forecast. Similar insurance‐type properties of forecast combinations have been noted in other contexts (e.g. Kascha and Ravazzolo (2010), page 237) and provide a partial explanation for the empirical success of linear combination formulae. The examples by de Carvalho and Rua nicely illustrate that weights for one value of θ may not perform well for other values. Hence their proposal to use some summary measure when estimating combination formulae is natural. Casarin and Ravazzolo take a different route, by making the combination weights depend on θ.

    Elimination strategies: weeding out poor forecasters. Merkle and Patton ask whether Murphy diagrams can be used to identify subsets of forecasters which should enter a combination formula. In the simple example in Fig. 15, an equally weighted average of Anne's and Bob's forecasts dominates Bob, and it attains better scores than Anne for many values of θ. Hence, despite being dominated by Anne, Bob's forecasts can still be useful in a combination context. This suggests that there may be no simple analytical justification for eliminating dominated forecasters. Nevertheless, as mentioned by Patton, eliminating dominated forecasts may be a very useful strategy in empirical settings with many forecasters and overlapping information sets, such as in big on‐line surveys (Tetlock and Gardner, 2015).

    Dominance graphs. As noted in the paper, the dominance and empirical dominance relations induce partial orders among forecasts. The partial order can be represented in the form of a directed graph, which we call a dominance graph, as illustrated in Fig. 17 in the very simple, synthetic example of the 10 probability forecasts in Table A.1 of Merkle and Steyvers (2013). Forecast ID 3 dominates ID 6 and ID 8, and ID 5 dominates ID 10, so it is the forecast IDs at the top which are not dominated. In big surveys with possibly redundant forecasts, dominance graphs may become much more complex and much more informative. They give rise to simple pruning algorithms for the identification of forecasts that are to be combined or considered further. In the simplest case, one might restrict attention to the subset of the forecasts which are dominated by at most two other forecasters, say.

    image
    Empirical dominance graph in the synthetic example of Merkle and Steyvers (2013)

    4. Choquet representations

    Generalizations. Several discussants ask for analogues of our mixture representations in various settings, with reference to other types of univariate functionals (Critchley and his colleagues, Dawid, Demetrescu and Ziegel), multivariate functionals (Mitchell and Ferro, and Pinson) and divergences (Critchley and his colleagues and Stehlík).

    Further types of elicitable functionals. Dawid and Ziegel consider other types of functionals in the general univariate case, and Demetrescu in the specific case built around the asymmetric power loss function. Starting from any single‐valued functional that admits an identification function V=V(t,y), which is non‐decreasing in t, Dawid's construction yields an explicit mixture representation in terms of a linear family urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1309 of elementary scoring functions that derive from V. Ziegel argues along the same lines, using the weaker condition on V of being oriented in the sense of equation (4) in Steinwart et al. (2014) which still implies consistency. She also notes the connection to Osband's principle and to the development in Steinwart et al. (2014), where the class of all elicitable real‐valued functionals is characterized, subject to quite stringent conditions. Steinwart et al. (2014) showed that the convexity of level sets, which was shown by Osband (1985) to be a necessary condition for elicitability, is sufficient for the existence of an oriented identification function. Once the latter has been given, Dawid's and Ziegel's reverse construction yields consistent scoring functions of virtually the same form as obtained by Steinwart et al. (2014); see their equations (8), (12) and (18). Whether or not the construction exhausts all consistent scoring functions for a given functional appears to depend on intricate regularity conditions.

    Multicomponent functionals. Mixture representations for consistent scoring functions in the case of urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1310‐valued elicitable functionals and related settings are highly desirable. However, extensions from the one‐ to higher dimensional cases are not possible in general. This is because proper scoring rules (and hence consistent scoring functions) arise as supergradients of concave functionals, and simple mixture representations of concave (or convex) functions as used in our paper are unavailable even on urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1311. A notable exception concerns interval forecasts in the form of the 1−2α central prediction interval, which corresponds to the urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1312‐valued functional with components given by the univariate quantile functionals at level α and 1−α respectively. By proposition 4.2 of Fissler and Ziegel (2015), any consistent scoring function in this setting is the sum of a consistent scoring function for the α‐quantile and a consistent scoring function for the (1−α)‐quantile, so our results and representations carry over immediately. Although this is a rather unique case from a theoretical perspective, it is one of strong applied relevance, as discussed in Section 6.2 of Gneiting and Raftery (2007).

    Divergences. Another setting arises when higher dimensional aspects are compressed into real‐valued functionals, such as in the case of ϕ‐divergences of probability measures P and Q. In a nutshell, a ϕ‐divergence is of the form
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1313
    where ϕ is a convex function on [0, ∞) such that ϕ(1)=0; see, for example Stummer and Vajda (2012) for a recent treatment and a delineation from Bregman distances. Any function ϕ of this form admits a mixture representation in terms of elementary functions or atoms urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1314 for θ>0. An immediate consequence (for measures P absolutely continuous with respect to Q) is the representation
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1315
    of the divergence as a mixture over what might be called atomic divergences, namely urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1316.

    Mixture representations versus Choquet representations. For all practical purposes, it is the feasibility of a mixture representation in terms of parsimoniously parameterized elementary objects that matters in the above settings. From a more theoretical perspective, it is of interest to seek any additional assumptions under which the mixture representations in fact are Choquet representations in the sense of functional analysis (Phelps, 2001). In this context, it is not immediate to us whether the aforementioned approach via Osband's principle yields Choquet representations, though it clearly yields mixture representations of practical use and appeal. To qualify as a Choquet representation, Dawid's elementary scoring functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-1317 need to be the extreme points of suitable restricted convex sets of scoring functions, which is a condition that in general seems difficult to check.

    5. Other points

    Estimation. Several discussants point to the role of scoring functions in estimation problems from the perspectives of statistics (Meng and Musio), machine learning (Liu and Steinwart) and econometrics (Pelenis). For example, Liu's proposal aims at estimates that are robust with respect to the choice of the scoring function, and Pelenis addresses the issue of dominance relations among forecasts based on estimates from misspecified models. Clearly, the question of the optimal choice of the loss or scoring function to be employed for estimation in regression problems continues to offer substantial challenges.

    Miscellanea. Jolliffe, Jose and Satopää discuss the role of calibration or reliability. To assess these facets of forecast quality, reliability diagrams in the case of probability forecasts and exceedance tabulations in the case of quantile forecasts remain the methods.of choice, particularly when the goal is to diagnose strengths and weaknesses of predictive models. In contrast, the key application of Murphy diagrams is in the comparison and ranking of competing forecasts. We second Mateu's call for theoretical and methodological tools that can cope with spatial dependences in the forecasts. Härdle and Huang hint at the challenging problem of forecast verification for tail events, which has recently been addressed by Lerch et al. (2015).

    Conclusion

    We are much obliged to our colleagues for an inspiring interdisciplinary discussion with contributions from the perspectives of statistics, machine learning, information theory, economics and meteorology. It is our hope that the discussion stimulates further interest in a both classical and topical area for which Philip Dawid coined the very appropriate term ‘reverse decision theory’ at the Pre‐Ordinary‐Meeting in December 2015. We look forward to hearing about future methodological, theoretical and applied progress in this exciting field. An R package accompanying our paper (Jordan and Krüger, 2016) facilitates the implementation of Murphy diagrams in a wide range of applications.

    Appendix A: Proofs

    The specific structure of the scoring functions in expressions (5) and (7) permits us to focus on the case urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0408 in the subsequent proofs, with the general case α ∈ (0,1) then being immediate.

    A.1. Proof of theorem 1

    In the case of quantiles, the mixture representation (9), the fact that dH(θ)=dg(θ) for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0409, and the relationship H(x)−H(y)=S(x,y)/(1−α) for x>y are straightforward consequences of the fact that, for every urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0410 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0411,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0412
    As the increments of H are determined by S, the mixing measure is unique.
    Turning now to the case of expectiles, we associate with any function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0413 the Bregman‐type function of two variables
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0414(29)
    Then the mixture representation (11), the fact that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0415 for urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0416 and the relationship urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0417 for x>y are immediate consequences of the fact that, for all urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0418 and x<y,
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0419
    The case x>y is handled analogously, and the case x=y is trivial. Finally, as the increments of H are determined by S, the mixing measure is unique.

    A.2. Proof of proposition 1

    In the case of the elementary quantile scoring function (12), suppose that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0420, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0421 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0422 are of the form (5) with associated functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0423. Then
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0424
    As urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0425 we have urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0426 if yx, and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0427 if xy, where j=1,2. It follows that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0428 in the first case, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0429 in the second case and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0430 in the third case. This coincides with the value distribution of g(x)−g(y) when urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0431, whence indeed urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0432.
    In the case of the elementary expectile scoring function (12), suppose that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0433, where urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0434 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0435 are of the form (7) with associated functions urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0436. Let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0437 be defined as in expression (29). Then
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0438
    Taking left‐hand derivatives with respect to y, we obtain
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0439
    As urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0440, we may apply the same argument as in the quantile case to show that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0441, whence urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0442.

    A.3. Proof of proposition 2

    In the case of the elementary quantile scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0443 in expression (10) suppose first that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0444. Since
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0445
    we have
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0446
    The second factor on the right‐hand side vanishes unless urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0447, and under that condition we have F(θ)⩽α and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0448, whence the desired expectation inequality. The case urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0449 is handled analogously.
    In the case of the elementary expectile scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0450 in expression (12) we assume first that urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0451, where t denotes the α‐expectile of F. Since
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0452
    we obtain
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0453
    As the first term on the right‐hand side is strictly increasing in θ and has a unique zero at the α‐expectile of F, the proof can be completed in the same way as above.

    Appendix B: Details for the synthetic example

    Here we give details for the synthetic example that was introduced in Table 2 and discussed throughout Section 3. Table 3 shows analytic expressions for the expected score
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0454
    where F is either the perfect, the climatological, the unfocused or the sign‐reversed forecaster, and the functional T(F) is either the α‐quantile urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0455 or the mean urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0456 of the CDF‐valued random quantity F. The scoring function S is the elementary quantile scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0457 in expression (10) or the elementary scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0458 in expression (12). For example, if X is a quantile forecast for Y at level α ∈ (0,1) then
    urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0459(30)
    decomposes into three terms, the first depending on the outcome only, the second depending on the forecast only and the third accounting for the joint distribution. In view of the relationships (14) and (16), the foregoing discussion covers the case of the extremal scoring function urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0460 for event probabilities, also.
    Table 3. Expected extremal scores in the prediction space example of Table 2
    Forecast αquantile Mean
    F urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0461 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0462
    Perfect urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0463 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0464
    Climatological urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0465 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0466
    Unfocused urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0467 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0468
    Sign reversed urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0469 urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0470
    • †For α ∈ (0,1) and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0471, we let urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0472, urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0473 and urn:x-wiley:13697412:media:rssb12154:rssb12154-math-0474, where Φ and φ denote the CDF and the probability density function of the standard normal distribution respectively.

      Number of times cited according to CrossRef: 33

      • The consistency and asymptotic normality of the kernel type expectile regression estimator for functional data, Journal of Multivariate Analysis, 10.1016/j.jmva.2020.104673, 181, (104673), (2021).
      • ExpectHill estimation, extreme risk and heavy tails, Journal of Econometrics, 10.1016/j.jeconom.2020.02.003, (2020).
      • Evaluation of wind power forecasts—An up‐to‐date view, Wind Energy, 10.1002/we.2497, 23, 6, (1461-1481), (2020).
      • Mixed data sampling expectile regression with applications to measuring financial risk, Economic Modelling, 10.1016/j.econmod.2020.06.018, 91, (469-486), (2020).
      • Reconciling solar forecasts: Probabilistic forecasting with homoscedastic Gaussian errors on a geographical hierarchy, Solar Energy, 10.1016/j.solener.2020.06.005, (2020).
      • Capturing Long-Range Dependence and Harmonic Phenomena in 24-Hour Solar Irradiance Forecasting: A Quantile Regression Robustification via Forecasts Combination Approach, IEEE Access, 10.1109/ACCESS.2020.3024661, 8, (172204-172218), (2020).
      • An elastic-net penalized expectile regression with applications, Journal of Applied Statistics, 10.1080/02664763.2020.1787355, (1-26), (2020).
      • User decisions, and how these could guide developments in probabilistic forecasting, Quarterly Journal of the Royal Meteorological Society, 10.1002/qj.3845, 0, 0, (2020).
      • Generic Conditions for Forecast Dominance, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1741376, (1-12), (2020).
      • undefined, 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), 10.1109/BCD.2019.8884865, (90-93), (2019).
      • Monitoring trends in ensemble forecast performance focusing on surface variables and high‐impact events, Quarterly Journal of the Royal Meteorological Society, 10.1002/qj.3523, 145, 721, (1741-1755), (2019).
      • Forecast combinations for value at risk and expected shortfall, International Journal of Forecasting, 10.1016/j.ijforecast.2019.05.014, (2019).
      • Estimating Value-at-Risk and Expected Shortfall Using the Intraday Low and Range Data, European Journal of Operational Research, 10.1016/j.ejor.2019.07.011, (2019).
      • An Expectile Factor Model for Day-ahead Wind Power Forecasting, SSRN Electronic Journal, 10.2139/ssrn.3363164, (2019).
      • Comparing the Estimations of Value-at-Risk Using Artificial Network and Other Methods for Business Sectors, , 10.1007/978-3-030-16841-4_28, (267-275), (2019).
      • Unrestricted and controlled identification of loss functions: Possibility and impossibility results, International Journal of Forecasting, 10.1016/j.ijforecast.2018.11.007, 35, 3, (878-890), (2019).
      • undefined, 2019 Second International Conference on Artificial Intelligence for Industries (AI4I), 10.1109/AI4I46381.2019.00034, (103-106), (2019).
      • Properization: constructing proper scoring rules via Bayes acts, Annals of the Institute of Statistical Mathematics, 10.1007/s10463-019-00705-7, (2019).
      • Robust Forecast Evaluation of Expected Shortfall*, Journal of Financial Econometrics, 10.1093/jjfinec/nby035, (2019).
      • Comparing Possibly Misspecified Forecasts, Journal of Business & Economic Statistics, 10.1080/07350015.2019.1585256, (1-23), (2019).
      • Backtesting VaR and expectiles with realized scores, Statistical Methods & Applications, 10.1007/s10260-018-00434-w, 28, 1, (119-142), (2018).
      • On Exactitude in Financial Regulation: Value-at-Risk, Expected Shortfall, and Expectiles, SSRN Electronic Journal, 10.2139/ssrn.3136278, (2018).
      • The diagonal score: Definition, properties, and interpretations, Quarterly Journal of the Royal Meteorological Society, 10.1002/qj.3293, 144, 714, (1463-1473), (2018).
      • Assessing Tail Risk Using Expectile Regressions with Partially Varying Coefficients, Journal of Management Science and Engineering, 10.3724/SP.J.1383.304011, 3, 4, (183-213), (2018).
      • Skill of Global Raw and Postprocessed Ensemble Predictions of Rainfall over Northern Tropical Africa, Weather and Forecasting, 10.1175/WAF-D-17-0127.1, 33, 2, (369-388), (2018).
      • On Exactitude in Financial Regulation: Value-at-Risk, Expected Shortfall, and Expectiles, Risks, 10.3390/risks6020061, 6, 2, (61), (2018).
      • Estimation of tail risk based on extreme expectiles, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12254, 80, 2, (263-292), (2017).
      • Random‐projection ensemble classification, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12228, 79, 4, (959-1035), (2017).
      • Backtesting VaR and Expectiles with Realized Scores, SSRN Electronic Journal, 10.2139/ssrn.3012932, (2017).
      • Quantile Evaluation, Sensitivity to Bracketing, and Sharing Business Payoffs, Operations Research, 10.1287/opre.2017.1588, 65, 3, (712-728), (2017).
      • Bias-corrected score decomposition for generalized quantiles, Biometrika, 10.1093/biomet/asx004, 104, 2, (473-480), (2017).
      • Dynamic Predictive Density Combinations for Large Data Sets in Economics and Finance, SSRN Electronic Journal, 10.2139/ssrn.2633388, (2015).
      • Proper Scoring Rules That Are Sensitive to Bracketing, SSRN Electronic Journal, 10.2139/ssrn.2628322, (2015).