Inference for extreme values under threshold-based stopping rules

There is a propensity for an extreme value analyses to be conducted as a consequence of the occurrence of a large flooding event. This timing of the analysis introduces bias and poor coverage probabilities into the associated risk assessments and leads subsequently to inefficient flood protection schemes. We explore these problems through studying stochastic stopping criteria and propose new likelihood-based inferences that mitigate against these difficulties. Our methods are illustrated through the analysis of the river Lune, following it experiencing the UK's largest ever measured flow event in 2015. We show that without accounting for this stopping feature there would be substantial over-design in response to the event.


Introduction
The UK currently spends £400-500M per year on coastal and river flood defence infrastructure, with 2 million properties exposed to the risk of flooding. The agencies responsible for this spend monitor the effectiveness of their investment at giving the level of protection expected. After major flooding events renewed analysis is performed to assess both existing flood defences and the cost benefit of potential new schemes, proposed in response to the flooding.
Statistical extreme value methods, with likelihood-based inference, have proved a core component of the required analysis in terms of minimising the costs without jeopardising the level of accepted risk, and hence have financial and societal benefits. However, there is a problem with using these methods when the statistical analysis has been prompted by the occurrence of a recent large event, since in this case the data-set size itself is also random. This can lead to substantially biased inference and poor coverage properties and so result in inefficient flooddefence designs. Omitting the new extreme data value from the data set also seems unsuitable, as intuition suggests that flood risk will then be underestimated; moreover it would appear perverse to flood management agencies to ignore events of the type most relevant to the design specification.
This paper aims to identify the extent of the inference problems when an analysis has been triggered by a large event and to develop new conditional-likelihood methods which appear to overcome these problems. We do not suggest when the timing of the data analysis should take place but study the analysis given that its timing has been determined by a data-dependent decision making process. We consider modelling the extreme events of a time series of independent and identically distributed (iid) random variables X 1 , X 2 , . . .. The classical approach to do this is to split the time series into blocks of equal size (often a year) and to model the maxima of these blocks. Linear normalisation is necessary since as the block size tends to infinity the distribution of the maxima degenerates to a point mass at the upper end point of the distribution of X. The generalised extreme value (GEV) distribution (Coles, 2001) is the only non-degenerate limiting distribution of the normalised maxima as the block size tends to infinity. The GEV has distribution function: where µ, σ > 0 and ξ are the location, scale and shape parameters respectively and [z] + = max(z, 0). The shape parameter determines the behaviour of the upper tail of the distribution: for ξ < 0 the distribution has an upper end point, for ξ = 0 the tail is exponential and for ξ > 0 the distribution has a power-decaying tail. There is particular interest in the occurrence of extreme events and so an important part of the analysis is the estimation of return-levels (quantiles). Under stationarity, the y-year return-level, x y , is the value which is exceeded on average once every y years. For the GEV distribution this can be calculated as: One can also consider modelling daily observations above some high threshold (rather than just modelling the block maxima) by the asymptotically justified generalised Pareto distribution (GPD) (Davison and Smith, 1990). Threshold methods typically benefit from using more extreme-value data and hence are more efficient in their inferences than block maxima methods (Coles, 2001). We focus most of our analysis and developments on the GEV case, as similar benefits are found for both GEV and GPD inference, but with the GPD also sensitive to threshold choice. We illustrate some GPD results in §S.2 of the supplementary material.
We consider the analysis of annual maxima of daily peak river flow data obtained from CEH (2018) for the Lune at Caton, just outside Lancaster, from 1968 to 2015 (Figure 1,left panel) and illustrate the inference issues due to the timing of analysis being determined by the occurrence of a flood event. Under the assumption that the annual maxima are independent and identically distributed (i.i.d.) we can fit the GEV distribution to the annual maxima using likelihood-based inference (the likelihood is n i=1 g(x i ; (µ, σ, ξ)) where n is the sample size and g is the density, g = dG/dx), and estimate return-levels using (1.2). The estimated 10, 100 and 1000-year return levels are shown in Figure 1 (left panel) for the data up to 2014 (black) and including 2015 (red). The December 2015 floods resulted in the river Lune recording the highest peak river flow (1740 m 3 /s) of all UK rivers over all years of records. This value is higher than the 1000year return-level estimate based on the observations up to 2014. However, once the 2015 event is included in the analysis the return-level estimates become much higher. If we were to take these 2015 point estimates as the truth we would expect to observe an event as extreme as that in 2015 approximately once every 200 years. For design purposes this level of sensitivity is highly undesirable, as the costs for flood protection would change dramatically. At the beginning of the data collection the return-level estimates vary considerably, but they become more stable as the number of years increases, with the width of the confidence intervals generally decreasing over time. However, even after many years of data collection, the largest events can be seen to cause sharp increases in the estimates and their associated uncertainty. For example, the return-level estimate following the January 1995 floods and the 2015 floods are larger than those of previous years.
This illustrative example is typical of when an analysis is performed immediately after a large event. Unless further analysis is undertaken it is unclear whether by analysing the data with the final extreme event we are introducing a positive bias into the inference. For example, the lower bound of the 95% confidence interval of the 200 year return-level after the 1995 event is larger than almost all previous point estimates -directly after the event (without the knowledge of later years) this could have been seen as an indication of positive bias in the standard estimator. However, after the 1995 event the return-level estimates were fairly stable and larger than those before 1995, so it would seem the standard estimator for 1995 may not have been overestimating and before the event the shape parameter was estimated too low.
An alternative approach is to simply ignore the most recent year of data when an analysis has been requested because we have large data in that year, in which case the return-level estimate is lower and the confidence interval is narrower -in particular the upper bound is lower. However, we speculate (see also §2.3) that this estimator is now negatively biased due to the loss of information about the extreme event. Moreover the estimator is inefficient since the larger data values are the most informative about the upper tail (Davison and Smith, 1990). Finally, it would be hard to convince practitioners to exclude the largest events; for example, an event may be observed which is larger than the upper end point estimated from previous data, in which case it would be perverse not to make some update to the previously estimated return-levels.
The key issue that the Lune example illustrates is that when meeting the flood management agencies' needs, the time to undertake the extreme value analysis is stochastic and triggered by a large event. Thus, there is effectively some form of unwritten stopping rule, determined by the flood management agencies, which determines the timing of the analysis. In contrast with a standard iid sample of fixed size, when we use a stopping rule the time at which we stop (the sample size) is variable, we denote this by N .
One can attempt to mathematically formulate the characteristics of the stopping rule, though in reality a precise mathematical rule does not exist. The stopping decision may depend on (i) some absolute threshold, such as the height of existing flood defences or a critical level which when exceeded leads to severe flooding, or (ii) an assessment, based upon all observations to date, of what might constitute an 'exceptional' event. We consider two simple stopping rules based on a series of iid random variables, X 1 , X 2 , . . . which, in a sense, bracket this range of possibilities and we discuss other possibilities in §6.

Fixed-threshold stopping rule
Stop when an observation exceeds a specified value, c k , i.e.: where k is the true (but unknown) return period of c k .

Variable-threshold stopping rule
Stop when an observation exceeds the return-level estimate,x k , corresponding to the fixed return period of k years, calculated using previous observed values, i.e., when: We do not suggest the stopping rule to use but study the analysis given that its timing has been determined by stopping rule. As far as we are aware, there has been no study of stopping rules and their effects on likelihood estimation in the extreme-value setting.
Using a stopping rule to determine the sample size can lead to estimators, such as the maximum likelihood estimator (MLE), having different sampling properties to the fixed-sample case. To illustrate this feature consider an iid sequence of Bernoulli random variables, Y 1 , Y 2 , . . ., each with probability of success of θ. If one fixes the number of trials, n, the number of successes, R, in these trials is binomially distributed andθ = R/n =:θ 1 ; whereas if the number of successes is fixed as r, the number of trials, N , is negative-binomially distributed andθ = r/N =:θ 2 . In both cases the MLE of the probability of success is the proportion of successes, however, = θ by Jensen's inequality. The presence of a stopping rule affects the performance of the estimator which motivates an investigation into the performance of return-level estimation under stopping rules.
Testing the data against some 'stopping criterion' at regular intervals falls into the setting of sequential analysis, which has a rich literature covering applications from quality control (Wald, 2004), to clinical trials (Todd et al., 1996) and abundance modelling (Barry and Coggan, 2010). Many studies have considered the influence of such stopping rules on likelihood inference, e.g., Barndorff-Nielsen and Cox (1984) consider the distribution of the likelihood-ratio statistic under different stopping rules for systems with Brownian motion and Poisson processes, and in the clinical trial setting Whitehead (1986) derives an expression for the bias of the MLE of the treatment effect tested under a sequential probability ratio test. Some papers compare the bias under different experimental designs or stopping criteria (e.g., Bauer et al. (2010)). Cox (1952), Whitehead (1986) and Stallard and Todd (2005) propose bias-reduced estimators by approximating the bias and subtracting this from the usual estimate. One such approach uses an iterative method corresponding to a bootstrap bias correction (Efron, 1990). Kenward and Molenberghs (1998) consider iid sampling from a Normal distribution using a deterministic stopping rule and study the estimation of the mean parameter of this Normal distribution. Molenberghs et al. (2014) extend this setting to the use of a probabilistic stopping rule. They note that an unbiased estimator of the mean parameter can be obtained from the conditional likelihood (we derive such estimators in §3) however, at the cost of an increased mean squared error (MSE) in comparison to the MSE of the sample average (the standard estimator if the sample size was fixed). The increased variance of a bias-reduced estimator appears to be an issue for many of the proposed bias reduction methods. For example, bias reduction using Rao-Blackwellisation (Bowden and Glimm, 2008) and shrinkage estimators (Carreras and Brannath, 2013) often have a worse MSE than the standard MLEs.
In §2 we introduce the notation used throughout the paper and discuss likelihood inference under stopping rules. In §2.3 and §2.4 we discuss the bias under the fixed-threshold and variablethreshold stopping rules respectively and derive expressions for the bias when sampling from some simple distributions. We introduce two conditioning-based likelihood estimators in §3. In §4 we perform a simulation study for sampling from the GEV distribution using the two stopping rules and discuss the properties of the estimators in this setting. We apply our estimators to the Lune river flow data in §5 and discuss our conclusions, the practical usage of the methods and extensions in §6.
2 Inference under stopping rules

Introduction
Throughout this paper we restrict our attention to sequences of iid observations arising from some distribution with a density of f (x; θ), where θ is the parameter vector for the distribution. We sample consecutively until some stopping criterion is met and denote the (random) time at which the stopping criterion is met by N . We write x n for the nth observation and x 1:n for the vector of observations (x 1 , . . . , x n ). We define the log-likelihoods of the sample, both including ( std ) and excluding ( ex ) the final, large observation as follows: ex (θ; n, x 1:n ) = std (θ; n, x 1:n ) − log f (x n ; θ) .
(2.2) Given data (n, x 1:n ) an estimate of the parameter vector is obtained by maximising the log likelihood:θ(n, x 1:n ) = arg max θ (θ). When the nature of the data is clear we abbreviate this to θ, and depending on the likelihood used we have estimatorsθ std orθ ex .
In practice we would not consider estimating return-levels (particularly for large return periods) from a sample of only a very small number of observations. However, the fixed-threshold stopping rules can result in samples of size 1, and this can lead to parameter identifiability issues for data sets simulated from the hypothesised data-generating mechanism. In reality, if an analysis has been requested then sufficient information would be available to derive a meaningful estimate. This information could be historical information, hydrological knowledge, data from other sites, or data at the current site collected before the instigation of a stopping rule. We call this the historical data and, for simplicity in this article, code the historical data as some number, n 0 of data values collected before the stopping rule could be invoked. Real decisions will incorporate this information, and our analysis should allow for this.
We are interested in a set of y-year return-levels, x y (θ) (y ∈ Y), for some set Y, such as {50, 200, 1000}. In particular, we wish to understand the behaviour of the estimators x y ( θ(N, X 1:N )) (with x y given by expression (1.2) for GEV sampling) when the dataset arises from a stopping rule. In this section we focus on the relative bias, and in §4 we look at other properties including the relative root-mean-squared error, given respectively by: where x y (θ) is the true y-year return-level.
In §2.2 we detail a well-known result that the likelihood for the data (n, x 1:n ) with a random stopping time is the same as for data x 1:n with n fixed. However, the properties of the estimator, such as its bias and variance as well as the coverage of any confidence interval, may be influenced by the different data-generating mechanism.
The properties of likelihood-based estimators of tail quantiles under our stopping rules are intractable for data arising from the GEV or GPD distributions. However, for a particular special case of the GPD, the exponential distribution, certain properties are tractable and this provides insight into the behaviour observed in the simulation studies of §4 for the GEV. Specifically, in §2.3 we derive the bias in quantile estimates for exponential data under the fixed-threshold stopping rule, and in §2.4 show that, under the variable-threshold stopping rule, quantile estimates for gamma data (including the exponential as a special case) with a known shape parameter are unbiased.

Likelihood in presence of a stopping rule
In practice it is usual to assume the sample size, n, is fixed in which case the likelihood for data x 1:n is L f ixed (θ; x 1:n ) · · = n i=1 f (x i ; θ). Now, following Pawitan (2013), we derive the true likelihood for the data sampled using a general stopping rule which is a function of the data and not the unknown parameter vector. We define a stopping region S n = S n (x 1:n−1 ) such that we stop sampling if X n ∈ S n and continue to sample otherwise. We abbreviate P (X i ∈ S i ) by p i and we let f X i |S i and f X i |S c i denote the densities of X i conditional on X i ∈ S i and X i ∈ S c i . The likelihood for the full data is L std (θ; n, x 1:n ) = P (N = n, X 1:n = x 1:n |θ) = P (N = n) P (X 1:n = x 1:n |N = n, θ) The logic here is that to have a sample of size n the final observation must be in the stopping region and all other observations outside their respective stopping regions, hence P (N = n) includes indicator functions of the observations being in the correct sets. The last step follows since the indicator functions do not depend on the unknown parameter, θ, and so are absorbed into the proportionality constant. Thus inference purely from the likelihood leads to the same conclusions whether we have a random sample size according to some stopping rule or a fixed sample size. Thus, the MLE, θ, and the observed Fisher information are the same in both cases. However, the properties of the estimators are different since the distribution of {N, X 1 , . . . , X N } is different to the distribution of {X 1 , . . . , X n } for some fixed n. In particular, estimators obtained from L std can be biased even when estimators from L f ixed are unbiased, as seen in §1 for Bernoulli sampling.

Fixed-threshold stopping rule with exponential observations
Let X i have an exponential distribution with an unknown rate parameter of β, which is a special case of the GPD used to model the tails of a distribution and is given by expression (??) with ξ = 0 and σ = β −1 . The y-observation return-level is x y = (log y)/β and, since this is proportional to 1/β, the relative bias is β/ β − 1 whatever the value of y. The MLE of β −1 for a sample of size n, whether fixed or random is simply x, where x is the sample mean. When n is fixed, the MLE, ( β f ixed ) −1 = X n , is an unbiased estimator of 1/β; however with the fixed-threshold stopping rule N follows a geometric distribution where 1/k is the probability of a 'success' i.e., an exceedance. The geometric distribution is a special case of the negative-binomial distribution and so we know the estimator of the probability of exceedance of a fixed threshold is positively biased ( §1 under (1.4)). Now ( β std ) −1 = X N and, similarly, the MLE when excluding the final observation is ( . It is straightforward to show (see Appendix A) the following.
Proposition 1 Let X 1 , X 2 , . . . be a sequence of iid random variables with X i ∼ Exp(β). Let N arise from the fixed-threshold stopping rule (1.3) giving data (N ; X 1:N ). Letx std y = x y ( β std (N ; X 1:N )) be the estimator of the 1 − 1 y th quantile (equivalently the y-observation return-level) obtained from the MLE for β from the full likelihood and letx ex y = x y ( β ex (N ; X 1:N )) be the estimator from the likelihood excluding the final observation.
Then the relative biases of the return-level estimators are In Proposition 1, when excluding both the final observation (and the fact that it is the final observation), when N = 1 the MLE is undefined since there are no data; x 1 is unknown and the fact that N would be greater than zero was known before the data-collection process began; we therefore condition on N > 1.
Proposition 1 shows that the estimator of any return-level using the full likelihood is always positively biased, whereas if the final observation is omitted the estimator of any return-level is always negatively biased. The final data observation is the largest and has been shown by Davison and Smith (1990) to be the most influential on the MLE fit so when this value, together with the information that it exceeded the threshold, is omitted from the dataset this changes the bias and, potentially, also the variance of the return-level estimator and risks being inefficient. Nevertheless, for thresholds with only a small chance of exceedance, i.e., large values of βc k , RelBias(x std y ) ∼ (βc k ) 2 exp(−βc k ), whereas RelBias(x ex y ) ∼ −βc k exp(−βc k ), that is, the bias is a factor (βc k ) −1 smaller for estimates where the final observation is ignored. The higher the threshold, the larger the typical data set that is generated before the stopping criterion is met and the less biased the estimate of any return-level.
In Figure 2 we compare the relative bias of the estimates of β −1 (and hence also for the returnlevel estimates) both when including and excluding the final observation when varying k, the true return period of the stopping threshold, c k . The two additional curves correspond to estimators that will be introduced in §3. The maximum relative bias in the standard returnlevel estimator is 0.4; i.e., the estimator is around 1.4 times the true value. This occurs for a threshold corresponding to k ≈ 7, i.e., when we stop sampling if an observation exceeds the 7-observation return-level. Clearly this will generally result in a very small sample so we would expect return-level estimates to also be highly variable in this case.

Variable-threshold stopping rule with gamma observations
The positive bias in return-level estimates that arises from the fixed stopping rule is partly a result of the geometric distribution of N (see §2.3). For the variable-threshold rule N no longer has a geometric distribution and we find empirically for the GEV (see §4.3) that the bias is typically reduced; as we now show, at least for one parametric family of distributions, the bias disappears entirely.
As with the exponential distribution, the return-levels of the gamma distribution are proportional to β −1 , with the constant of proportionality depending on α. Furthermore the MLE from the full likelihood satisfies β −1 = x/α. Thus, for some constant of proportionality γ (depending on α and the return period, y), the variable-threshold stopping rule is equivalent to Theorem 1 With N , X 1:N and X 0 as defined in (2.5) and (2.6), for all n ∈ N: A proof for this theorem is provided in the Appendix A.2. From Theorem 1 we see that E X n |N = n = α/β, and hence: Corollary 1 For a sample obtained as in Theorem 1, the sample mean and y-year return-level estimate are unbiased: Contrasting Corollary 1 with Proposition 1, both of which apply to the exponential distribution, we see that the standard estimator can be unbiased for the variable-threshold stopping rule even though it is strongly positively biased for the fixed-threshold stopping rule.

New conditional likelihoods
Motivated by the lack of bias in the stopping rule of Molenberghs et al. (2014), we propose a similar estimator for our scenarios by conditioning on the fact that only the final observation met the stopping criterion. The likelihood, therefore, consists of the conditional densities of the data values given that each of the first n − 1 is outside its stopping region and the nth is inside its stopping region. The log likelihood, f c , is as given in (3.1) and the corresponding estimate is denotedθ f c . For the fixed-threshold stopping rule this effectively conditions out the geometric distribution for N ( §2.3); it might be hoped, therefore, that it might remove that part of the positive bias that is due to the randomness of N .
By conditioning on the final observation exceeding its stopping threshold and all other observations not exceeding theirs we are effectively losing all of this information which will lead to larger uncertainty in the estimates, e.g., giving wider confidence intervals. Hence, we consider a further likelihood which conditions only on the fact that the final observation exceeds its threshold. Like full conditioning, this results in the stochasticity of N being less influential. We refer to this method as partial conditioning with log likelihood, denoted by pc , given in (3.2). The corresponding estimate is denoted byθ pc .
In summary, the two new log likelihoods we consider are: where s k,i is the lower boundary of the stopping set for the ith observation; then for the variablethreshold stopping rule s k,i =x std k (x 1:i−1 ), i.e., it is the standard estimate of the k-year returnlevel using all the data up to and including the previous observation, and for the fixed-threshold stopping rule s k,i = c k for all i.
The examples of §2 have exponential tails with an unknown scale parameter. When data values are modelled using the GEV, uncertainty in the shape parameter, ξ, has a much larger impact on estimates of high quantiles than the uncertainty in the other two parameters, µ and σ (Coles, 2001). So we now consider estimation of the shape parameter and high quantiles using the standard and partial conditioning likelihoods. For simplicity we focus on an idealised scenario where we take µ = 0 and σ = 1 as known, so X has a distribution function of F (x; ξ) = exp −[1 + ξx] −1/ξ + and a survivor function ofF = 1 − F .
For quantile estimation the standard likelihood estimator of ξ, i.e.,ξ std , leads to a positive bias for high quantiles. This can be seen as follows. The y-year return level can be written where a y ≥ 0 provided y ≥ e/(e − 1) ≈ 1.6. Return levels as low as 1.6 years are of no practical interest in our setting. When a y > 0,F −1 is an increasing, convex function of ξ ∈ R, so, whatever the likelihood, Jensen's inequality gives E ξ F −1 (1/y; ξ) ≥F −1 (1/y; E ξ [ξ]). The monotonicity ofF −1 implies that even ifξ std is unbiased, we should expect a positive bias in all quantile estimates, and this will only be exaggerated if (as we find in our stopping-rule simulations) ξ is positively biased.
This bias in the estimator for high quantiles is guaranteed to be less positive when using the partial conditioning likelihood rather than the standard likelihood. To see this first note that where c, the stopping threshold, has been standardised. The resulting MLE,ξ pc , satisfies ⇒ξ pc <ξ std .
Thus, if the standard estimator is positively biased the partial conditioning method will be less positively biased. Given that this effect is magnified for return levels, as shown above, we should expect improvements in return level estimates using the partial conditioning likelihood. An analogous argument to the above also applies to data modelled using the GPD, except that there is no restriction on y, and a y = log y.

Application to exponential observations
Consider iid sampling from the exponential distribution with rate parameter, β, using the fixedthreshold stopping rule. The relative bias for return-level estimators using the log-likelihoods (2.1) and (2.2), are detailed in Proposition 1 and plotted in Figure 2. Figure 2 also plots the relative bias for the likelihoods in (3.1) and (3.2), the latter has the form Estimatorx pc y is negatively biased but the bias is smaller than that forx ex y . The bias ofx pc y tends to 0 as c k tends to infinity at the same fast rate as forx ex y ( §2.3). We were unable to obtain a tractable expression for the bias of the full-conditional estimator. In Figure 2 this bias was found using Monte Carlo methods. The bias is very low and tends towards 0 much faster than any of the other estimators considered. This finding is similar to that of Molenberghs et al. (2014) for the mean of normally distributed observations with a probabilistic stopping rule; however, the MSE of the unbiased estimator was found to be poor compared to that of the standard estimator. In §4 we show that in our 'extremes' setting, the full-conditional MSE for a return-level is often lower relative to the MSE of the standard estimator since the high variance of return-level estimators using the standard likelihood is in part due to the final observation being large. Furthermore in §4 we show that the partial conditioning approach results in estimators with much reduced variance and that this leads to lower MSE compared to the standard likelihood approach.

Simulation results
In this section we focus on the return-level inference when sampling from the GEV distribution with the two stopping rules of §1. In §4.2 we calculate the fixed stopping threshold, c k , for a range of return periods, k, using (1.2) and our knowledge of the true parameters µ, σ and ξ. In §4.3 we consider the variable threshold stopping rule over a range of k. Similar simulation results are given in the supplementary material for the GPD.

Simulation design
We investigate true return-periods, k, between 20 and 2000. When generating the data, for each k, for the fixed-threshold stopping rule, we set c k to be the true (1 − 1/k)th quantile of the data-generating distribution (i.e., the k-yr return-level) whereas for the variable-threshold rule the threshold is the estimated (1 − 1/k)th quantile; with both rules we stop at the first exceedance. In the simulation study, for each combination of θ, stopping rule and k, a large number of data sets were simulated to evaluate the RMSE, bias and variance of the estimators. Given the likelihood M for M ∈ {std, ex, fc, pc}, detailed in equations (2.1), (2.2), (3.1) and (3.2), profile likelihood confidence intervals for a return-level are studied in terms of their coverage and width.
One major issue with simulating data sets with stopping rules is parameter identifiability. For observation i, the stopping decision of the variable-threshold rule is based on the parameter MLEs using observations 1, . . . , i − 1. However, with N ≤ 2 observations contributing to a likelihood the GEV parameters are strictly not identifiable, and for larger but low values of N the parameters are still not practically estimable. As discussed in §2.2, in practice there is typically additional information which is incorporated into decisions, and our analysis should allow for this also. Such historical information is treated as fixed and introduces a fixed extra penalty term, P hist (θ), into the log-likelihood; in a Bayesian analysis it would constitute prior information about the parameter vector. As our simulation studies are conducted without such evidence we treat the first n 0 simulated values as providing historical information on θ; we call x := (x 1 , . . . , x n 0 ) the historical data. Thus each simulated data set has the penalty contribution to the likelihood: a contribution that does not depend on the stopping rule since we imagine that these data were available before decisions to stop and analyse the data were being made.
We fix the historical data,x, using an even spread of values: x j = G −1 (j/(n 0 + 1); θ) for j = 1, ..., n 0 (4.2)  where G is the distribution function of the data-generating GEV distribution. In addition to providing a natural spread of values and stabilising the likelihood, for the fixed-threshold rule, provided c k is greater than the 1/(n 0 + 1) return-level, no historical value exceeds the stopping threshold. The stopping threshold,x k (x 1:i−1 ) is now, implicitly, also a function ofx. We take n 0 = 10, the smallest value that gave reliable numerical estimates forx k (x 1:i ) with i ≥ n 0 , across the set of different true values for θ that were used in the simulation study.

Fixed threshold stopping rule
In the appendix §B we describe in detail the behaviour of the shape parameter estimates in our simulations. In particular we found the standard shape parameter estimator has both large positive bias and large variance. The formulae in §3.1 show that return level estimates are exponential in the shape parameter. For high return periods moderately large ξ estimates can lead to unrealistically high return-level estimates which exert unwarranted influence on statistics based on empirical averages, such as estimated bias. Hence, we use trimmed averages here. (µ, σ, ξ) = (0, 1, 0.2) using the variable-threshold stopping rule over a range of k. See Figure 3 for associated detail. Based on 10000 replicated samples with the historical data created using approach (4.2). Coverage is based on 3000 replicated samples.
due to the low variance of these estimates whereas x f c y has the lowest bias. Both conditioning estimators,x pc y andx f c y , improve upon thex std y especially when we are estimating very high return-levels (i.e., for larger y) and/or the underlying distribution is heavy tailed. Althougĥ x ex 200 has somewhat similar properties tox pc 200 for ξ = 0.2 it has larger RRMSE for ξ = −0.2. The fitted distribution using ex typically has a lighter tail and can even have an upper end point which is less than the excluded observation.
The coverage for all likelihoods gets closer to the correct value (here 95%) as k increases for any return period, y. For std we have overcoverage and the widest confidence intervals on average and using f c we have good coverage, particularly when the distribution is heavy tailed. For the other likelihoods there is mostly undercoverage (coverage ranging from 80-95%) due to upper bounds being too low. The exclusion of upper tail information results in relatively narrow confidence intervals from ex . In contrast, f c produces a higher upper confidence limit and hence a wider confidence interval than pc and ex because the likelihood essentially neglects the distribution of N , i.e., the threshold exceedance counts, which contain some information about the upper tail of the distribution. The confidence intervals produced using f c vary greatly in width across our simulations with a larger median width than those using std .

Variable threshold stopping rule
Within the samples simulated we find the stopping thresholds,x k (x 1:m ), over m < N are generally less than the true k-year return level, x k . As a result the samples are both smaller in size and consist of smaller values than when using the fixed-threshold stopping rule. So return level estimates calculated using std have a small positive bias and those calculated using the other three likelihoods have a larger negative bias than observed for the fixed-threshold stopping rule.
The properties of the 200 year return-level estimators for ξ = 0.2 are shown in Figure 4, the 50 and 1000 year return-levels and ξ = −0.2 are considered in the supplementary material ( Figures ??-??). For ξ = 0.2 the conditioning methods provide the best return-level estimators in terms of RMSE despite the estimators having a larger squared bias thanx std y . The reason for this is that for heavy tailed distributions the variances of return-level estimators are generally larger than the bias. However, for lighter tailed distributions the bias plays a larger role as the relative variances of the different estimators are much closer together. As a resultx pc 200 andx f c

200
can perform marginally worse thanx std 200 in terms of RRMSE when the distribution has a light tail.
The coverage for std is high (96-98%), decreasing only slightly as k increases but it has the widest confidence intervals generally. Using either pc or ex leads to undercoverage, as k increases ranging from approximately 95% to 83-87% for ex and from 93% to 78-85% for pc with coverage higher when the distribution is heavy tailed. The coverage for f c also reduces with increasing k from 99 − 100% for k = 20 to approximately 90% for larger k. On average the confidence intervals using f c are narrower than using std but generally wider than those using ex or pc .
Overall, f c provides the 'best' results when using the variable-threshold stopping rule. The RRMSE ofx f c 200 is generally lower than that ofx std 200 , coverage is above 90% and the confidence intervals are narrower on average than those using the std . Although pc provides estimators with a lower RRMSE than f c , particularly when the distribution is heavy tailed, it has more severe undercoverage.

Use in Practice
In practice, for the analysis of data that we believe has been obtained by the flood management agencies using a the fixed-threshold rule we must set a threshold, c, and if they use a variablethreshold rule we must set a return period, k, neither of which may be known. This is important since the behaviour of the estimators can vary depending on the return period, k, associated with the stopping threshold (as we have seen in §4.2 and 4.3). For the fixed-threshold rule, c should lie between max i<n x i and x n . For the variable-threshold rule k should be such that x i ≤x k (x 1:i−1 ) for all i < n, but x n >x k (x 1:n−1 ). To use the simulation study results to understand the properties of the estimators it is useful to narrow down a range of feasible k. For the variable rule, a range of possible k can be determined from the data. However, for the fixed-threshold stopping rule k is unknown. Nevertheless, we are likely to have some idea of the range of k which corresponds to c, i.e., we have a prior belief for k.
The 'historical data' also needs to be determined, maybe incorporating prior knowledge in some way. The simplest approach is to start using the stopping rule after the first n 0 observations of the data set and use these n 0 values as the historical data. The choice of n 0 only affects the point estimates and confidence intervals using f c . However n 0 and the historical data itself can have a large impact on the properties of the estimators, particularly when the sample size is small. In the simulation study, out of necessity, we have restricted ourselves to a particular fixed historical sample, so for low k the properties of the estimators will differ slightly in practice.

Case Study -Lune at Caton
We now consider the analysis of the 48 annual maximum river flow observations from the Lune at Caton introduced in §1. Figure 1, right panel, shows the inference for the 200 year return-level of the data, at yearly intervals as new data are observed, with the analysis not accounting for any stopping rule. We now estimate this return-level using the four inference methods (standard, exclude, and our full-and partial-conditional) for both fixed-and variable-threshold stopping rules for a range of levels (c and k respectively), where we drop the subscript of c as the return period of the stopping threshold is unknown. The following discussion assumes that the sampling procedure is well approximated by these respective stopping rules for the selected c and k. In all cases we take the historical data to be the first n 0 = 10 observations as in practice no estimates of long period return levels would be attempted from smaller samples. We also consider the implications if a trend in the annual maxima is also simultaneously estimated.

Fixed-threshold stopping rule
First we discuss the inference using the fixed-threshold stopping rule with c = 1568m 3 /s, where, for illustration purposes, c is taken to be the mid-point between the 1995 and 2015 levels and the realised value of N is 38, i.e., we stop after 2015. . The confidence interval for 2015 using std is wider than the intervals of the previous 15 years, especially the 2014 interval (i.e., using ex ), and both the lower and upper bounds are much larger. In this case study, the confidence interval of x 200 using pc is similar but slightly narrower than when using ex . However using f c the interval is wider (since the upper bound increases) than if we just ignored the 2015 event (using ex ) so we are capturing some of the increased uncertainty in the heaviness of the tail that this event has caused. Nevertheless, the upper confidence bound of x 200 is lower than that using std .
The behaviour of the confidence intervals of these methods appears to be in line with our coverage and width results in §4.2. Indeed, here the shape parameter estimates, (ξ std ,ξ ex ,ξ f c ,ξ pc ), are (0.04, −0.07, −0.04, −0.05) so we expect coverage to be between the coverage values found in the simulation study for ξ = 0.2 and ξ = −0.2. In the study we found that using std with the fixed-threshold stopping rule leads to overcoverage (95-98% for ξ = 0.2, 97.5-99.5% for ξ = −0.2) and the upper bound of the confidence interval found using std is lower than x 200 only 1-2% of the time, so it is likely that for the Lune data the upper bound of the confidence interval using std is too high. This is further emphasised for the Lune estimates by the upper bound for 2015 exceeding the associated values for the previous 30 years (Figure 1). In §4.2 we found thatx ex 200 andx pc 200 exhibited narrow confidence intervals which together with their negative bias led to undercoverage, with the upper bounds being too low, especially when ξ = −0.2 and k is low. For our chosen c = 1568 we can obtain estimates of the corresponding return period, k, of c; in particulark std = 90 andk ex = 550 and we expect k to lie between these two values. Thus, using the simulation study results, we expect that the coverage of the pc and ex confidence intervals to lie between 85 and 95%. However the lower bounds of these confidence intervals were found to be less than x 200 for almost 100% of simulated samples so it is highly likely that Caton with 95% profile likelihood-based confidence intervals: left with the fixed-threshold stopping rule over a range of c and right with the variable-threshold stopping rule over a range of k: standard likelihood (red), excluding the final observation (black), full conditioning (green) and partial conditioning (blue). Each group of 4 estimates applies for the same c/k as for the standard estimate in each group and have been horizontally shifted for clarity.
the true 200-year return level for the Lune data is above the lower bounds given by the pc and ex confidence intervals. For f c and 90 < k < 550, the coverage is 94-95% with the percentage of upper bounds too low being 3-6% suggesting that with the Lune data the upper bound of the f c confidence interval is likely to be higher than the true 200-year return level, x 200 . The above discussion assumed that c was known. In some cases this may be true as c could represent a known physical limit linked to flooding. This is not the case for the Lune at Caton, with our value chosen subjectively for illustrative purposes although it could be argued that lower c values in this range would be more reasonable since the 1995 river flow observation was considered high as it led to flooding. To assess the impact of c we consider a range of values for c between the 1995 and 2015 observations, with the inference for the four methods presented in Figure 5, left panel (when c = 1568 the estimates are those shown in Figure 1 right panel). Now, x std 200 andx ex 200 and the corresponding confidence intervals are invariant to c but as c increaseŝ x f c 200 andx pc 200 both decrease. As noted earlier,x f c 200 >x pc 200 but they become closer as c tends to the 2015 event level because the information that f c discards, i.e., the probability of the event that c was not exceeded on the first n − 1 observations, becomes less informative. The confidence intervals using the conditioning likelihoods notably narrow with increasing c; the lower bounds slightly decrease but the largest reduction is in the upper bounds. For lower c values the f c intervals are wider than for std , in contrast for the largest possible c values the interval is very narrow (a reduction in size of factor 14 over the range of c possible). For pc the upper bounds are smaller than those using the std for all values of c and are slightly larger than those for ex for low c. However, for large c the upper bounds of both conditioning confidence intervals are much lower than that using ex since the information that c was exceeded on this observation becomes more informative about the tail of the distribution as c approaches the 2015 observation. Thus if we stop after the first minor exceedance of c we can be reasonably sure the tail is short. This is an unexpected but helpful finding. Further investigation into the confidence intervals can be found in Barlow (2019).

Variable-threshold stopping rule
Now we consider the variable-threshold stopping rule and first determine a range of k from the data. In the Lune data the maximum river level in 2015 corresponds tok = 2561 given the data up to 2015 and tok = 188 using all the data. However, the river level in 1995 corresponds tô k = ∞ (i.e., it is larger than the point estimate of the upper end point of the GEV fitted to the data up to 1995) so the variable threshold rule as given in (1.4) cannot have been applied for any k < ∞. Furthermore, the river level in 1980 corresponds tok = 111. If the variable-threshold stopping rule had motivated a request for an analysis of the data up to and including 2015, the request must have been triggered by the second such exceedance. In our analysis we explore values of k between 200 and 2500 and simply amend f c slightly by replacing the f c contribution of the 1995 observation (i = 28), g(x 28 ; θ)/G(x std k (x 1:27 ); θ), by g(x 28 ; θ). The intervals using the conditioning likelihoods and variable-threshold stopping rule behave similarly to those using the fixed-threshold stopping rule. Again the f c intervals are highly influenced by the 'extremeness' of the stopping threshold. With the lowest possible k for this data set (ignoring the 1995 exceedance) the f c interval is almost double the width of the confidence interval using std whereas for a large k value it is less than half the width. The pc confidence intervals also reduce in width with increasing k but not as dramatically.

Non-stationarity
The implications of using stopping rules on the estimation of trends in extreme levels is also a concern, as stopping with the final observation being large is likely to have a similar biasing effect as found in §2 and §4 for return-levels. This is particularly important given the interest in whether trends in extreme values differ from trends in mean levels (Eastoe and Tawn, 2009;Hannaford and Marsh, 2008). In Figure 6 we illustrate the analysis of the Lune data with a GEV distribution including a linear trend µ t = α 0 + βt, showing both the resulting estimates of the 200 year return-level for 2015, i.e., the estimates of the 0.995 quantile of the annual maximum in 2015, and the associated trend estimateβ using progressively more data over time. With few data used the trend is estimated to be unrealistically large, with huge uncertainty, and this results in very different point estimates of return-levels relative to the analysis with no trend. As more data are observed we can see that the trend estimates generally decrease, with reduced uncertainty, with positive jumps inβ estimates after the large 1995 and 2015 events. Although the 2015 river flow is more extreme than that of 1995 its impact onβ is much less. Furthermore, we see thatβ is not statistically significantly different from β = 0 at the 2.5% level. Thus here the effect of including the estimated trend is small on the 200 year return-level estimate and the stopping rule seems to have almost no effect on the trend estimate. Left: 200yr return-level estimates for 2015 using progressively more data over the years with and without a trend in the location parameter (pink and black respectively). Each group of 2 estimates applies for the same year and have been horizontally shifted for clarity. Right: Slope parameter,β, and it's 95% confidence interval.

Discussion
In this paper and the associated supplementary material we have shown that return-level estimators based on the standard likelihood are positively biased when sampling from the GEV or GP distributions using certain stopping rules. The extent of the stopping bias is lower for lighter tailed distributions and when estimating low return-levels. We have proposed conditioning upon the stopping threshold in the likelihood. In most cases we have found that conditioning on the final observation exceeding the stopping threshold (partial conditioning) results in return-level estimates with the lowest RMSE despite the estimator being negatively biased.
A balance must be struck between low RMSE and good coverage, however. Partial conditioning results in undercoverage despite the low RMSE ofx pc y . The full-conditional likelihood, which also conditions on the non-exceedance of all previous observations, gives the closest to 95% coverage and though the intervals are wide, they are typically narrower than the confidence intervals obtained from the standard likelihood. The interval widths using the full and partial conditional likelihoods are smaller the closer the stopping threshold is to the final observation as the occurrence of the final exceedance becomes more informative on the tail of the distribution (see §5.1).
Overall, the conditioning estimators presented here outperform the standard estimator when the decision to analyse data at a particular time was triggered by what was perceived to be a large observation. For the fixed-threshold stopping rule, partial conditioning has the best combination of RMSE and coverage for a range of ξ with moderate k and particularly when the distribution is heavy tailed, as is the case for most UK rivers (Institute of Hydrology, 1999). For the variablethreshold stopping rule, full conditioning provides the best balance of coverage and low RMSE. To apply the conditioning estimators in practice if the rule of the flood management agency is unknown the statistician needs to choose a suitable stopping threshold, c, for the fixed-threshold stopping rule and a suitable stopping 'period', k, for the variable-threshold stopping rule if the values are unknown. A range of c and k can be considered provided that the observed data are below the resulting stopping threshold(s) up to the final observation.
The decision to analyse data will likely be based on a confluence of many factors. Our work attempts to simplify the true decision making procedure by using stopping rules based on the occurrence of a single large observation exceeding some threshold. An analysis may instead be prompted by a prolonged period of quite large (but not necessarily 'extreme') observations or the observation of large values at many locations simultaneously, requiring more complex multivariate analysis since the observations at nearby locations will be dependent in some way (Keef et al., 2009;Asadi et al., 2015).
In practice if the stopping rule is unknown and the analysis is triggered by a large event, we suggest using the full conditional return-level estimator. However if k is thought to be less than 50, or the full-conditional estimate and/or confidence interval are clearly too large then partial conditioning should be used instead. We argue that the decision to 'stop' and analyse data would in part be based on both past return-level estimates and thresholds set due to current infrastructure and so the 'true' stopping rule is a mixture of the two rules considered here. Hence the 'true' bias, RMSE and coverage of the estimators can be expected to lie between those which we found under the two stopping rules. It should be noted that this work does not address the question of when the data should be analysed, but rather how we can reduce the bias given the use of a particular stopping rule. Nevertheless, if we are at a point in time where a stopping criterion has been met and triggered an analysis, this study can give guidance on the behaviour of return-level estimators calculated at the current time whether based on the full likelihood, partial or full conditioning, or even excluding the most recent, 'triggering' event.
In our theoretical and simulation studies we have not accounted for the possibility of a trend in the data, such as river flows gradually increasing over the years. We saw in §5 that the Lune data has a slight positive trend in the location parameter and fitting such a model at an earlier point in time resulted in a very large positive trend. This could cause problems for the fixed-threshold stopping rule, in particular it might become necessary to change c after a certain number of years. Nonetheless, doing this is probably not too unrealistic since, for example, the height of a flood defence might be increased if there has been evidence of higher flow in recent years. On the other hand the variable-threshold stopping rule is more robust to data with an underlying trend as it is directly a function of the observed data.
where F (x) andF (x) = 1 − F (x) are the CDF and survival function of the distribution of X i , i = 1, . . . , n. Thus, Specifically for sampling from the exponential distribution: By the memoryless property of the exponential distribution: .
Therefore, for the exponential distribution, For the standard estimator based on the full sample we have 1/β std = X N , the sample mean. The first part of Proposition 1 then follows from (A.4).
If the final data point is excluded from the sample then all included samples are from the distribution truncated at c, so, from (A.3), leading to the expression in the second part of Proposition 1.

A.2 Proof of Theorem 1
We start by defining the following key quantities for each k ≥ 1, Marginally S k ∼ Gamma((n 0 + k)α, β) and V k ∼ Beta(α, (n 0 + k)α); we denote their marginal densities as: The stopping time, N , is n if X n > γX n−1 and X i < γX i−1 for 1 ≤ i < n. However, X n > γX n−1 ⇔ X n > γ n + n 0 − 1 (S n − X n ) So the stopping rule can be written purely as function of the V s. Explicitly, we stop at time n if V n > γ n+γ+n 0 −1 and V i < γ i+γ+n 0 −1 for 1 ≤ i < n. We define the statement A n · · = "V 1 , . . . , V n , S n are mutually independent". Below, we will show by induction that A n holds for all n ≥ 1. Thus X n ⊥ ⊥ V i ∀i ≤ n; the distribution of X n is independent of whether or not the stopping rule has been triggered. Therefore, X N conditioned on N = n is equivalent to the mean of n i.i.d. Gamma(α, β) random variables, as stated in the theorem.
A n−1 ⇒ A n : If A n−1 holds then the joint pdf of V 1 , . . . , V n−1 , S n−1 can be factorised: Consider the change of variables (V 1 , . . . , V n−1 , S n−1 , X n ) → (V 1 , . . . , V n−1 , V n , S n ), where X n = S n V n and S n−1 = S n (1 − V n ). The Jacobian for this transformation is: where I n−1 is the (n − 1) × (n − 1) identity matrix and So, since S n−1 and V 1 , . . . , V n−1 are independent of X n , So A n holds.
A 1 holds: We must show that V 1 and S 1 are independent. We do this by using the change of variables (X 0 , X 1 ) → (V 1 , S 1 ) to show that the joint pdf of V 1 and S 1 factorises.
We have and X 1 = S 1 V 1 and X 0 = 1 n 0 S 1 (1 − V 1 ). So Jacobian for the transformation is: Thus the joint pdf of V 1 , S 1 is: as required. using the fixed-threshold stopping rule with threshold c k and ξ = 0.2 (top) and ξ = −0.2 (bottom) both plotted against k. Left: relative bias, centre: relative variance, right: relative RMSE, using: standard likelihood (red), excluding the final observation (black), full conditioning (green) and partial conditioning (blue). Based on 10 5 replicated samples with the historical data created using approach (4.2).
B Properties of the GEV shape parameter B.1 Fixed-threshold stopping rule The shape parameter, ξ, is important in determining the tail behaviour. Figure A.1 shows the relative bias, variance and RMSE of each of the estimators when sampling using the fixedthreshold stopping rule for ξ = 0.2 and −0.2 (top and bottom rows respectively). Judged by RRMSE, we find that pc is generally best for moderate to large k, with clear benefits for ξ = −0.2; however f c has generally quite similar RRMSE and low bias. As one would expect the lighter the tail of the distribution, the smaller both the relative variance and, in most cases, the relative bias of the shape parameter estimators resulting in smaller RRMSE. To help understand why these RRMSE results arise we now look at more detail at the performance of the four estimators.
The standard MLE for the shape parameter,ξ std , is almost always positively biased whileξ ex leads to quite large negative bias (with E(ξ ex ) < 0.1 when ξ = 0.2 and k < 50 ( Figure A.1)) since we lose information about the upper tail of the underlying distribution. In particular, the fitted distribution typically has a lighter tail and can even have an upper end point which could be less than the excluded observation. Unlike all other estimators considered, the variance ofξ ex is not substantially lower when the tail is lighter and so has quite large RRMSE when ξ = −0.2.
The partial conditioning method generally hasξ pc lower than the truth however, for moderate k, they consistently have low variance relative to the other methods over a range of ξ. Therefore, partial conditioning provides ξ estimators with the lowest RRMSE for k > 100. In contrast, f c leads to very little bias in ξ estimates for k > 100 but the variance can be large, particularly when ξ = 0.2 with k < 100. This is in agreement with Molenberghs et al. (2014) findings that the full-conditional estimator has poor precision despite it's unbiasedness. However, unlike in Molenberghs et al. (2014), we find that, in our context, full conditioning can improve upon the standard estimator especially when the stopping threshold is high (i.e., for large k).

B.2 Variable-threshold stopping rule
Properties of the shape parameter estimators under the variable stopping rule are shown in the supplementary material, Figure ??, for ξ = 0.2 and −0.2. We find that in the variable threshold settingξ std has very low bias (similarly recall in §2.4 when sampling from the gamma distribution with this stopping rule we found the standard return-level estimator was unbiased) whereas all other ξ estimators are negatively biased, withξ ex having the largest negative bias out of all the estimators for both values of ξ considered. We find thatξ std also has the lowest RRMSE of the estimators. Despiteξ std performing well under the variable threshold stopping rule, this is not always the case for the std return-level estimators.

Supplementary material for 'Inference for extreme values under threshold-based stopping rules' Anna Maria Barlow, Chris Sherlock, Jonathan Tawn
Here we present extra results from the simulation study for sampling from the GEV distribution and in §S.2 discuss our four likelihoods when modelling threshold exceedances.

S.1 GEV
For the fixed-threshold stopping rule the bias, variance and RRMSE results are based on 10 5 replicated samples and the coverage results on 5000 replicated samples. The variable-threshold stopping rule requires much more computational effort than the fixed-threshold stopping rule since the parameters must be estimated after each observation in order to calculate the subsequent stopping threshold. The results shown here are based on 10,000 and 20,000 replicated samples for ξ = 0.2 and ξ = −0.2 respectively. There are more simulations for ξ = −0.2 as the sample sizes generated, N , using the variable-threshold stopping rule are smaller when the distribution has light tails.
We used the profile-likelihood method to create confidence intervals for the return-levels. Confidence intervals could also be calculated via the delta method or via the bootstrap. In the context of return-level estimation the delta method can produce contradictory confidence intervals (the lower bound for a particular return-level may be lower than that of a return-level with a smaller return period). Bootstrap methods were explored but these were found to have poor coverage and are much more computationally expensive (see Barlow, 2019).
The results for the 50 and 1000 year return level estimators follow the same pattern as the 200 year return level estimators shown in the paper but as noted in §4.2 of the paper the bias and variance worsen as the return level estimated becomes more extreme.

S.1.1 Other Methods
We also considered a range of alternatives to our two conditioning likelihoods. The two most effective are detailed here.
Firstly, we considered a truncated likelihood: replacing in std the final observation, x n , by the stopping threshold, c k . The bias of these truncation estimators will always lie between that of the standard estimator and the exclude estimator. When sampling from the exponential distribution with the fixed-threshold stopping rule the relative bias of the truncation estimator is the sum of the relative biases of the standard and exclude estimators, with the return-level estimators from this method essentially always improving upon the standard method in RMSE.
For the variable-threshold stopping rule with ξ = −0.2 this estimator performs well however it is outperformed by the conditioning estimators for most cases with positive ξ and the fixedthreshold stopping rule.  show various properties of the truncation estimator (in grey) compared to the those using std , ex , f c and pc . We did not explore this method further as we wanted to obtain an estimator that works well over both stopping rules and for a range of c k and k.
Bootstrap bias correction estimators (Efron, 1990) were assessed. This method is based on the assumption that the bias of the estimator is approximately the same when the same sampling procedure is used but with the true parameter replaced by an estimate. So one can estimate the bias and correct the standard estimate by taking away this approximated bias. This procedure can be computationally heavy because it involves generating many samples using the stopping rule and evaluating estimates from which to calculate the bias. We do not explore them further here as we found that the bootstrap bias corrected return-level estimates were generally no better than the partial conditional estimates but required considerable additional computation.

S.2 GPD
Consider modelling daily observations above some high threshold, v, rather than just modelling the block maxima (Coles, 2001). For suitably large v the exceedances of this threshold are typically assumed to be exactly modelled by their limiting distribution as the threshold tends to the upper end point of the distribution. This limiting distribution is the generalised Pareto distribution (GPD) (Davison and Smith, 1990) which has distribution function for x > 0: where ξ and σ v > 0 are the shape and scale parameter respectively. Note that if ξ is zero then the GPD is equivalent to the exponential distribution with rate parameter σ −1 v . The shape parameter ξ is the same as that under the GEV distribution whereas the scale parameter changes with threshold with σ v = σ + ξ(v − µ) where (µ, σ, ξ) are the associated GEV parameters. For modelling using the GPD we also need to model the rate at which the threshold v is exceeded. Now consider the estimation of return-levels from a data set of threshold exceedances where the sample size is determined by the fixed-threshold stopping rule. A threshold is chosen above which observations are considered to be extreme and the exceedances of this threshold are modelled using the generalised Pareto distribution (GPD). Thus in this setting we have two thresholds: the one above which we fit the GPD, which we denote by v, and the fixed stopping threshold c k which determines the sample size with c k > v.
For comparison with the GEV results, we set ξ = ±0.2 and σ v = 1 + ξv and have an average of 10 exceedances per year. We start using the stopping rule after 10 exceedances have been simulated (i.e., the first 10 exceedances are the historical data). When the stopping threshold is exceeded we continue sampling to the end of the current year and take this as the full sample. We omit the final year of observations to calculate the exclude estimator.
Let τ v denote the expected number of exceedances of v in a year, then the y-year return-level is: The y-year return-level estimate is calculated by substituting the parameter estimates, (ξ,σ v ), andτ v = n v /n y into (S.2), where n v is the total number of exceedances and n y is the number of years.  The results of our GPD simulation are similar to that for the GEV with fixed-threshold stopping rule ( §4.2 of the paper); the partial conditioning method performs best in terms of RRMSE for estimating return-levels. The main difference to the GEV setting is in estimating ξ; we find that Var(ξ ex ) > Var(ξ std ). In the GPD setting the standard likelihood includes multiple data points from the final year (both that which triggered the stopping rule and smaller points) which are informative about the shape parameter. The exclude likelihood does not include this information hence the larger variance of the shape parameter estimates compared to the standard estimates. using the fixed-threshold stopping rule with threshold c k with ξ = 0.2(top) and ξ = −0.2 (bottom) both plotted against k. Left: relative bias, centre: relative RMSE, right: relative variance, using: standard likelihood (red), excluding the final observation (black), full conditioning (green), partial conditioning (blue) and truncating (grey). Based on 10 5 replicated samples with the historical data created using approach (4.2) of the paper. (µ, σ, ξ) = (0, 1, 0.2) using the fixed-threshold stopping rule over a range of thresholds. From left to right. Top: relative bias, relative RMSE and relative variance. Bottom: coverage, average CI width, difference in RRMSE to RRMSE of standard estimator. Colour scheme is the same as in Figure S.2. Based on 10 5 replicated samples with the historical data created using approach (4.2) of the paper. Coverage is based on 5000 replicated samples. (µ, σ, ξ) = (0, 1, 0.2) using the fixed-threshold stopping rule over a range of thresholds. From left to right. Top: relative bias, relative RMSE and relative variance. Bottom: coverage, average CI width, difference in RRMSE to RRMSE of standard estimator. Colour scheme is the same as in Figure S.2. Based on 10 5 replicated samples with the historical data created using approach (4.2) of the paper. Coverage is based on 5000 replicated samples. (µ, σ, ξ) = (0, 1, −0.2) using the fixed-threshold stopping rule over a range of thresholds. From left to right. Top: relative bias, relative RMSE and relative variance. Bottom: coverage, average CI width, difference in RRMSE to RRMSE of standard estimator. Colour scheme is the same as in Figure S.2. Based on 10 5 replicated samples with the historical data created using approach (4.2) of the paper. Coverage is based on 5000 replicated samples. (µ, σ, ξ) = (0, 1, −0.2) using the fixed-threshold stopping rule over a range of thresholds. From left to right. Top: relative bias, relative RMSE and relative variance. Bottom: coverage, average CI width, difference in RRMSE to RRMSE of standard estimator. Colour scheme is the same as in Figure S.2. Based on 10 5 replicated samples with the historical data created using approach (4.2) of the paper. Coverage is based on 5000 replicated samples.  Colour scheme is the same as in Figure S.2. Based on 5000 replicated samples from the GEV with (µ, σ) = (0, 1)created using the fixed-threshold stopping rule and historical data created using approach (4.2) of the paper. (µ, σ, ξ) = (0, 1) using the variable-threshold stopping rule with ξ = 0.2(top) and ξ = −0.2 (bottom) both plotted against k. Left: relative bias, centre: relative RMSE, right: relative variance, using: standard likelihood (red), excluding the final observation (black), full conditioning (green), partial conditioning (blue) and truncating (grey). Based on 10000/20000 (top/bottom) replicated samples with the historical data created using approach (4.2) of the paper.