Volume 76, Issue 1
Original Article
Open Access

Improved estimation in cumulative link models

Ioannis Kosmidis

Corresponding Author

University College London, UK

Address for correspondence: Ioannis Kosmidis, Department of Statistical Science, University College London, Gower Street, London, WC1E 6BT, UK. E‐mail: i.kosmidis@ucl.ac.ukSearch for more papers by this author
First published: 03 July 2013
Citations: 16

This article was first published online on 3 July 2013, under a subscription licence. This article has since been made Open Access and the licence statement has been updated in this version in June 2014.

Summary

For the estimation of cumulative link models for ordinal data, the bias reducing adjusted score equations of Firth in 1993 are obtained, whose solution ensures an estimator with smaller asymptotic bias than the maximum likelihood estimator. Their form suggests a parameter‐dependent adjustment of the multinomial counts, which in turn suggests the solution of the adjusted score equations through iterated maximum likelihood fits on adjusted counts, greatly facilitating implementation. Like the maximum likelihood estimator, the reduced bias estimator is found to respect the invariance properties that make cumulative link models a good choice for the analysis of categorical data. Its additional finiteness and optimal frequentist properties, along with the adequate behaviour of related asymptotic inferential procedures, make the reduced bias estimator attractive as a default choice for practical applications. Furthermore, the estimator proposed enjoys certain shrinkage properties that are defensible from an experimental point of view relating to the nature of ordinal data.

1. Introduction

In many models with categorical responses the maximum likelihood estimates can be on the boundary of the parameter space with positive probability. For example, Albert and Anderson (1984) derived the conditions that describe when the maximum likelihood estimates are on the boundary in multinomial logistic regression models. Although there is no ambiguity in reporting an estimate on the boundary of the parameter space, as is for example an infinite estimate for the parameters of a logistic regression model, estimates on the boundary can
  1. cause numerical instabilities to fitting procedures,
  2. lead to misleading output when estimation is based on iterative procedures with a stopping criterion and, more importantly,
  3. cause havoc with asymptotic inferential procedures, and especially with those that depend on estimates of the standard error of the estimators (e.g. Wald tests and related confidence intervals).

The maximum likelihood estimator in cumulative link models for ordinal data (McCullagh, 1980) also has a positive probability of being on the boundary.

1.1. Example 1

As a demonstration consider the example in Christensen (2012a), section 7.. The data set in Table 1 comes from Randall (1989) and concerns a factorial experiment for investigating factors that affect the bitterness of white wine. There are two factors in the experiment: temperature at the time of crushing the grapes (with two levels, ‘cold’ and ‘warm’) and contact of the juice with the skin (with two levels ‘yes’ and ‘no’). For each combination of factors two bottles were rated on their bitterness by a panel of nine judges. The responses of the judges on the bitterness of the wine were taken on a continuous scale in the interval from 0 (‘none’) to 100 (‘intense’) and then they were grouped correspondingly into five ordered categories, 1, 2, 3, 4 and 5.

Table 1. Wine tasting data (Randall, 1989)
Temperature Contact Responses on the following
bitterness scale:
1 2 3 4 5
Cold No 4 9 5 0 0
Cold Yes 1 7 8 2 0
Warm No 0 5 8 3 2
Warm Yes 0 1 5 7 5
Consider the partial proportional odds model (Peterson and Harrell, 1990) with
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0001(1)
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0002 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0003 are dummy variables representing the factors temperature and contact respectively, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0004, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0005 and θ are model parameters and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0006 is the cumulative probability for the sth category at the rth combination of levels for temperature and contact. The clm function of the R package ordinal (Christensen, 2012b) is used to maximize the multinomial likelihood that corresponds to model (1). The clm function finds the maximum likelihood estimates up to a specified accuracy, by using a Newton–Raphson iteration for finding the roots of the log‐likelihood derivatives. For the current example, the stopping criterion is set to that the log‐likelihood derivatives are less than urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0007 in absolute value. The maximum likelihood estimates, estimated standard errors and the corresponding values for the Z‐statistics for the hypothesis that the respective parameter is 0 are extracted from the software output and shown in Table 2. At those values for the maximum likelihood estimator the maximum absolute log‐likelihood derivative is less than urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0008 and the software correctly reports convergence. Nevertheless, an immediate observation is that the absolute values of the estimates and estimated standard errors for the parameters urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0009, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0010 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0011 are very large. Actually, these would diverge to ∞ as the stopping criteria of the iterative fitting procedure used become stricter and the number of iterations allowed increases.
Table 2. Maximum likelihood estimates for the parameters of model (1), the corresponding es timated standard errors (in parentheses) and the values of the Z‐statistic for the hypothesis that the corresponding parameter is 0
Parameter Maximum likelihood Z‐statistic
estimate
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0012 −1.27 (0.51) −2.46
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0013 1.10 (0.44) 2.52
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0014 3.77 (0.80) 4.68
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0015 28.90 (193125.63) 0.00
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0016 25.10 (112072.69) 0.00
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0017 2.15 (0.59) 3.65
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0018 2.87 (0.82) 3.52
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0019 26.55 (193125.63) 0.00
θ 1.47 (0.47) 3.13

For model (1) interest usually is on testing departures from the assumption of proportional odds via the hypothesis urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0020. Using a Wald‐type statistic would be adventurous here because such a statistic explicitly depends on the estimates of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0021, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0022, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0023 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0024. Of course, given that the likelihood is close to its maximal value at the estimates in Table 2, a likelihood ratio test can be used instead; the likelihood ratio test for the particular example has been carried out in Christensen (2012a), section 7..

Furthermore, the current example demonstrates some of the potential dangers that are involved in the application of cumulative link models in general; the behaviour of the individual Z‐statistics—being essentially 0 for the parameters urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0025 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0026 in this example—is quite typical of what happens when estimates diverge to ∞. The values of the Z‐statistics converge to 0 because the estimated standard errors diverge much faster than the estimates, irrespective of whether or not there is evidence against the individual hypotheses. This behaviour is also true for individual hypotheses at values other than 0 and can lead to invalid conclusions if the output is interpreted naively. More importantly, the presence of three infinite standard errors in a non‐orthogonal (in the sense of Cox and Reid (1987)) setting like the current setting may affect the estimates of the standard errors for other parameters in ways that are difficult to predict.

An apparent solution to the issues that are mentioned in example 1 is to use a different estimator that has probability 0 of resulting in estimates on the boundary of the parameter space. For example, for the estimation of the common difference in cumulative logits from ordinal data arranged in a 2×k contingency table with fixed row totals, McCullagh (1980) described the generalized empirical logistic transform. The generalized empirical logistic transform has smaller asymptotic bias than the maximum likelihood estimator and is also guaranteed to give finite estimates of the difference in cumulative logits because it adjusts all cumulative counts by urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0027. However, the applicability of this estimator is limited to the analysis of 2×k tables, and particularly in estimating differences in cumulative logits, with no obvious extension to more general cumulative link models, such as that in example 1.

A family of estimators that can be used for arbitrary cumulative link models and are guaranteed to be finite are ridge estimators. As one of the referees highlighted, if we extend the work in le Cessie and van Houwelingen (1992) for logistic regression to cumulative link models, then the shrinkage properties of the ridge estimator can guarantee its finiteness. Nevertheless, ridge estimators have a series of shortcomings. Firstly, in contrast with the maximum likelihood estimator, the ridge estimators are not generally equivariant under general linear transformations of the parameters for cumulative link models. A reparameterization of the model by rescaling the parameters or taking contrasts of those—which are often interesting transformations in cumulative link models—and a corresponding transformation of the ridge estimator will not generally result in the estimator that the same ridge penalty would produce for the reparameterized model, unless the penalty is also appropriately adjusted. For example, if we wish to test the hypothesis in example 1 by using a Wald test, then an appropriate ridge estimator would be one that penalizes the size of the contrasts of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0028, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0029, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0030 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0031 instead of the size of those parameters themselves. Secondly, the choice of the tuning parameter is usually performed through the use of an optimality criterion and cross‐validation. Hence, the properties of the resultant estimators are sensitive to the choice of the criterion. For example, criteria like the mean‐squared error of predictions, classification error and log‐likelihood that have been discussed in le Cessie and van Houwelingen (1992) will each produce different results, as was also shown in le Cessie and van Houwelingen (1992). Furthermore, the resultant ridge estimator is sensitive to the type of cross‐validation that is used. For example, k‐fold cross‐validation will produce different results for different choices of k. Lastly, standard asymptotic inferential procedures for performing hypothesis tests and constructing confidence intervals cannot be used by simply replacing the maximum likelihood estimator with the ridge estimator in the associated pivots. For these reasons, ridge estimators can only offer an ad hoc solution to the problem.

Several simulation studies on well‐used models for discrete responses have demonstrated that bias reduction via the adjustment of the log‐likelihood derivatives (Firth, 1993) offers a solution to the problems relating to boundary estimates; see, for example, Mehrabi and Matthews (1995) for the estimation of simple complementary log–log‐models, Heinze and Schemper (2002) and Bull et al. (2002), for binomial logistic regression, Kosmidis and Firth (2011) for multinomial logistic regression and Kosmidis (2009) for binomial response generalized linear models.

In the current paper the aforementioned adjustment is derived and evaluated for the estimation of cumulative link models for ordinal responses. It is shown that reduction of bias is equivalent to a parameter‐dependent additive adjustment of the multinomial counts and that such adjustment generalizes well‐known constant adjustments in cases like the estimation of cumulative logits. Then, the reduced bias estimates can be obtained through iterative maximum likelihood fits to the adjusted counts. The form of the parameter‐dependent adjustment is also used to show that, like the maximum likelihood estimator, the reduced bias estimator is invariant to the level of sample aggregation in the data.

Furthermore, it is shown that the reduced bias estimator respects the invariance properties that make cumulative link models an attractive choice for the analysis of ordinal data. The finiteness and shrinkage properties of the estimator proposed are illustrated via detailed complete enumeration and an extensive simulation exercise. In particular, the reduced bias estimator is found to be always finite, and also the reduction of bias in cumulative link models results in the shrinkage of the multinomial model towards a binomial model for the end categories. A thorough discussion on the desirable frequentist properties of the estimator is provided along with an investigation of the performance of associated inferential procedures.

The finiteness of the reduced bias estimator, its optimal frequentist properties and the adequate performance of the associated inferential procedures lead to its proposal for routine use in fitting cumulative link models.

The exposition of the methodology is accompanied by a parallel discussion of the corresponding implications in the application of the models through examples with artificial and real data.

2. Cumulative link models

Suppose observations of n k‐vectors of counts urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0032, on mutually independent multinomial random vectors urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0033, where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0034 and the k multinomial categories are ordered. The multinomial totals for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0035 are urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0036 and the probability for the sth category of the rth multinomial vector is urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0037, with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0038 (r=1, …,n). The cumulative link model links the cumulative probability urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0039 to a p‐vector of covariates urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0040 via the relationship
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0041(2)
where q=k−1 denotes the number of the non‐redundant components of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0042, and where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0043 is a (p+q)‐vector of real‐valued model parameters, with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0044. The function G(·) is a monotone increasing function mapping (−∞,∞) to (0,1), usually chosen to be a known distribution function (like, for example, the logistic, extreme value or standard normal distribution function). Then, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0045 can be considered as cut points on the latent scale that is implied by G.

Special important cases of cumulative link models are the proportional odds model with G(η)= exp (η)/{1+ exp (η)} and the proportional hazards model in discrete time with G(η)=1− exp {− exp (η)} (see McCullagh (1980) for the introduction of and a thorough discussion on cumulative link models).

The cumulative link model can be written in the usual multivariate generalized linear models form by writing the relationship that links the cumulative probability urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0046 to δ as
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0047(3)
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0048 is the (s,t)th component of the q×(p+q) matrix
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0049
To be able to identify δ, the matrix Z with row blocks urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0050 is assumed to be of full rank.
Direct differentiation of the multinomial log‐likelihood l(δ) gives that the tth component of the vector of score functions has the form
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0051(4)
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0052, with g(η)=dG(η)/dη. If g(·) is log‐concave then urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0053 (t=1, …,p+q) has unique solution the maximum likelihood estimate urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0054 (see Pratt (1981), where it is shown that the log‐concavity of g(·) implies the concavity of l(δ)).

All generalized linear models for binomial responses that include an intercept parameter in the linear predictor are special cases of model (2).

3. Maximum likelihood estimates on the boundary

The maximum likelihood estimates of the parameters of the cumulative link model can be on the boundary of the parameter space with positive probability. Under the log‐concavity of g(·), Haberman (1980) gave conditions that guarantee that the maximum likelihood estimates are not on the boundary (‘exist’ in an alternative terminology). Boundary estimates for these models are estimates of the regression parameters urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0055 with an infinite value, and/or estimates of the cut points urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0056 for which at least a pair of consecutive cut points have equal estimated value.

As far as the regression parameters β are concerned, Agresti (2010), section 3.4.5, gave an accessible account on what data settings result in infinite estimates for the regression parameters, how a fitted model with such estimates can be used for inference and how such problems can be identified from the output of standard statistical software.

As far as boundary values of the cut points α are concerned, Pratt (1981) showed that, with a log‐concave g(·), the cut points urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0057 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0058 have equal estimates if and only if the observed counts for the sth category are 0 (s=1, …,k) for all r ∈ {1, …,n}. If the first or the last category has zero counts then the respective estimates for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0059 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0060 will be −∞ and ∞ respectively, and, if this happens for category s for some s ∈ {2, …,q}, then the estimates for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0061 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0062 will have the same finite value.

4. Bias correction and bias reduction

4.1. Adjusted score functions and first‐order bias

Denote by b(δ) the first term in the asymptotic expansion of the bias of the maximum likelihood estimator in decreasing orders of information, usually sample size. Call b(δ) the first‐order bias term, and let F(δ) denote the expected information matrix for δ. Firth (1993) showed that, if A(δ)=−F(δb(δ), then the solution of the adjusted score equations
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0063(5)
results in an estimator that is free from the first‐order term in the asymptotic expansion of its bias.

4.2. Reduced bias estimator

Kosmidis and Firth (2009) exploited the structure of the bias reducing adjusted score functions in expression (5) in the case of exponential family non‐linear models. Using Kosmidis and Firth (2009), expression (9), for the adjusted score functions in the case of multivariate generalized linear models, and temporarily omitting the argument δ from the quantities that depend on it, the adjustment functions urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0064 in expression (5) have the form
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0065(6)
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0066 is the asymptotic variance–covariance matrix of the estimator for the vector of predictor functions urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0067 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0068. Furthermore, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0069 is the urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0070 matrix with sth block the Hessian of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0071 with respect to urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0072 (s=1, …,q), urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0073 is the q×q identity matrix and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0074 is the q×q Jacobian of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0075 with respect to urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0076. A straightforward calculation shows that
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0077
The matrix urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0078 is the incomplete q×q variance–covariance matrix of the multinomial vector urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0079 with (s,u)th component
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0080
Substituting in equation 6, some tedious calculation gives that the adjustment functions urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0081 have the form
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0082(7)
where
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0083(8)
with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0084, and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0085. The quantity urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0086 is the sth diagonal component of the matrix urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0087 (s=1, …,q;r=1, …,n).
Substituting expressions (4) and (7) in expression (5) gives that the tth component of the bias reducing adjusted score vector (t=1, …,q+p) has the form
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0088(9)

The reduced bias estimates urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0089 are such that urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0090 for every t ∈ {1=1, …,q+p}. Kosmidis (2007a), chapter 6, shows that, if the maximum likelihood is consistent, then the reduced bias estimator is also consistent. Furthermore, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0091 has the same asymptotic distribution as urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0092, namely a multivariate normal distribution with mean δ and variance–covariance matrix urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0093. Hence, estimated standard errors for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0094 can be obtained as usual by using the square roots of the diagonal elements of the inverse of the Fisher information at urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0095. All inferential procedures that rely on the asymptotic normality of the estimator can directly be adapted to the reduced bias estimator.

4.3. Bias‐corrected estimator

Expression (7) can also be used to evaluate the first‐order bias term as urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0096, where
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0097
If urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0098 is the maximum likelihood estimator then
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0099(10)
is the bias‐corrected estimator which has been studied in Cordeiro and McCullagh (1991) for univariate generalized linear models. The estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0100 can be shown to have no first‐order term in the expansion of its bias (see Efron (1975) for analytic derivation of this result).

4.4. Models for binomial responses

For k=2, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0101 has a binomial distribution with index m and probability urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0102, and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0103. Then model (2) reduces to the univariate generalized linear model
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0104
From equation 9, the adjusted score functions take the form
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0105
Omitting the category index for notational simplicity, a re‐expression of the above equality gives that the adjusted score functions for binomial generalized linear models have the form
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0106(11)
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0107 are the working weights and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0108 is the rth diagonal component of the ‘hat’ matrix urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0109, with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0110 and
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0111
This expression agrees with the results in Kosmidis and Firth (2009), section 4.3., where it is shown that for generalized linear models reduction of bias via adjusted score functions is equivalent to replacing the actual count urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0112 with the parameter‐dependent adjusted count urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0113 (r=1, …,n).

5. Implementation

5.1. Maximum likelihood fits on iteratively adjusted counts

When expression (9) is compared with expression (4), it is directly apparent that bias reduction is equivalent to the additive adjustment of the multinomial count urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0114 by the quantity urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0115 (s=1, …,k;r=1, …,n). Noting that these quantities depend on the model parameters in general, this interpretation of bias reduction can be exploited to set up an iterative scheme with a stationary point at the reduced bias estimates: at each step,
  1. evaluate urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0116 at the current value of δ (s=1, …,q;r=1, …,n), and
  2. fit the original model to the adjusted counts by using some standard maximum likelihood routine.

However, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0117 can take negative values which in turn may result in fitting the model on negative counts in step (b) above. In principle this is possible but then the log‐concavity of g(·) does not necessarily imply concavity of the log‐likelihood function and problems may arise when performing the maximization in step (b) (see, for example, Pratt (1981), where the transition from the log‐concavity of g(·) to the concavity of the likelihood requires that the latter is a weighted sum with non‐negative weights). That is the reason why many published maximum likelihood fitting routines will fail if supplied with negative counts.

The issue can be remedied through a simple calculation. Temporarily omitting the index r, let urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0118 (s=1, …,k). Then the kernel urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0119 in expression (9) can be re‐expressed as
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0120
where I(E)=1 if E holds and I(E)=0 otherwise. Note that
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0121
uniformly in δ. Hence, if step (a) in the above procedure adjusts urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0122 by urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0123 evaluated at the current value of δ, then the possibility of issues relating to negative adjusted counts in step (b) is eliminated, and the resultant iterative procedure still has a stationary point at the reduced bias estimates.

5.2. Iterative bias correction

Another way to obtain the reduced bias estimates is via the iterative bias correction procedure of Kosmidis and Firth (2010); if the current value of the estimates is urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0124 then the next candidate value is calculated as
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0125(12)
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0126, i.e. urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0127 is the next candidate value for the maximum likelihood estimator obtained through a single Fisher scoring step, starting from urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0128.

Iteration (12) generally requires more effort in implementation than the iteration that was described in the Section 5.1.. Nevertheless, if the starting value urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0129 is chosen to be the maximum likelihood estimates then the first step of the procedure in expression (12) will result in the bias‐corrected estimates defined in expression (10).

6. Additive adjustment of the multinomial counts

6.1. Estimation of cumulative logits

For the estimation of the cumulative logits urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0130 (s=1, …,q) from a single multinomial observation urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0131 the maximum likelihood estimator of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0132 (s=1, …,q) is urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0133, where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0134 is the sth cumulative count. The Fisher information for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0135 is the matrix of quadratic weights urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0136. The matrix W is symmetric and tridiagonal with non‐zero components
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0137
with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0138 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0139. By use of the recursion formulae in Usmani (1994) for the inversion of a tridiagonal matrix, the sth diagonal component of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0140 is urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0141. Hence, using expression (8) and noting that urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0142 for the logistic link, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0143 (s=1, …,q). Substituting in equation 9 yields that reduction of bias is equivalent to adding urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0144 to the counts for the first and the last category and leaving the rest of the counts unchanged.

The above adjustment scheme reproduces the empirical logistic transforms urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0145, which are always finite and have smaller asymptotic bias than urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0146 (see Cox and Snell (1989), section 2.1.6, under the fact that the marginal distribution of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0147 given urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0148 is binomial with index m and probability urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0149 for any s ∈ {1, …,q}).

6.2. A note of caution for constant adjustments in general settings

Since the works of Haldane (1955) and Anscombe (1956) concerning the additive modification of the binomial count by urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0150 for reducing the bias and guaranteeing finiteness in the problem of log‐odds estimation, the addition of small constants to counts when the data are sparse has become a standard practice for avoiding estimates on the boundary of the parameter space of categorical response models (see, for example, Hitchcock (1962), Gart and Zweifel (1967), Gart et al. (1985) and Clogg et al. (1991)). Especially in cumulative link models where g(·) is log‐concave, if all the counts are positive then the maximum likelihood estimates cannot be on the boundary of the parameter space (Haberman, 1980).

Despite their simplicity, constant adjustment schemes are not recommended for general use for two reasons.
  1. Because the adjustments are constants, the resultant estimators are generally not invariant to different representations of the data (e.g. aggregated and disaggregated view), which is a desirable invariance property that the maximum likelihood estimator has, and which allows the practitioner not to be concerned with whether the data at hand are fully aggregated or not.

    For example, consider the two representations of the same data in Table 3. Interest is in estimating the difference β between logits of cumulative probabilities of the samples with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0151 from the samples with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0152.

    The maximum likelihood estimate of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0153 is ∞. Irrespective of the data representation the maximum likelihood estimate of β is finite and has value −1.944 with estimated standard error 0.895. Now suppose that the same small constant, say urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0154, is added to each of the counts in the rows of the alternatives in Table 3. The adjustment ensures that the parameter estimates are finite for both representations. Nevertheless, a common constant added to both alternatives causes—in some cases large—differences in the resultant inferences for β. For alternative 1 the maximum likelihood estimate of β based on the adjusted data is −1.097 with estimated standard error 0.678, and for alternative 2 the estimate is −1.485 with estimated standard error 0.741. If Wald‐type procedures were used for inferences on β with a normal approximation for the distribution of the approximate pivot urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0155, where S(β) is the asymptotic standard error at β based on the Fisher information, then the p‐value of the test β=0 would be 0.106 if alternative 1 was used and 0.045 if alternative 2 was used.

  2. Furthermore, the moments of the maximum likelihood estimator generally depend on the parameter values (see, for example, Cordeiro and McCullagh (1991) for explicit expressions of the first‐order bias term in the special case of binomial regression models) and thus, as is also amply evident from the studies in Hitchcock (1962) and Gart et al. (1985), there cannot be a universal constant which yields estimates which are optimal according to some frequentist criterion.
Table 3. Two alternative representations of the same artificial data set
x Y
1 2 3 4
Alternative 1
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0156 8 6 1 0
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0157 10 0 1 0
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0158 8 1 0 0
Alternative 2
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0159 8 6 1 0
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0160 18 1 1 0

Both of the above concerns with constant adjustment schemes are dealt with by using the additive adjustment scheme in Section 5.1.. Firstly, by construction, the iteration of Section 5.1. yields estimates which have bias of second order. Secondly, because the adjustments depend on the parameters only through the linear predictors which, in turn, do not depend on the way that the data are represented, the adjustment scheme leads to estimators that are invariant to the data representation. For both representations of the data in Table 3 the bias‐reduced estimate of β is −1.761 with estimated standard error 0.850.

7. Invariance properties of the reduced bias estimator

7.1. Equivariance under linear transformations

The maximum likelihood estimator is exactly equivariant under one‐to‐one transformations ϕ(·) of the parameter δ, i.e. if urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0161 is the maximum likelihood estimator of δ then the maximum likelihood estimator of ϕ(δ) is simply urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0162. In contrast with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0163, the reduced bias estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0164 is not equivariant for all ϕ; bias is a parameterization‐specific quantity and hence any attempt to improve it can violate exact equivariance. Nevertheless, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0165 is equivariant under linear transformations ϕ(δ)=Lδ, where L is a (p+q)×(p+q) matrix of constants such that ZL is of full rank and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0166 has urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0167.

To see that, assume that we fit the multinomial model with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0168 where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0169 (r=1, …,n;s=1, …,q). Because urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0170, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0171 is a linear combination of δ. Using expression (9), the tth component of the adjusted score function for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0172 is
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0173(13)
for t ∈ {1, …,p+q}, where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0174, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0175 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0176 are evaluated at urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0177. Note that all quantities in equation (13) depend on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0178 only through the linear combinations urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0179. Thus, comparing equation 9 with equation 13, if urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0180 is a solution of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0181 (t=1, …,p+q), then urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0182 must be a solution of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0183 (t=1, …,p+q).

The bias‐corrected estimator defined in equation 10 can be shown also to be equivariant under linear transformations, using the equivariance of the maximum likelihood estimator and the fact that b(δ) depends on δ only through the linear predictors.

7.2. Invariance under reversal of the order of categories

One of the properties of proportional odds models, and generally of cumulative link models with a symmetric latent distribution G(·), is their invariance under the reversal of the order of categories; a reversal of the categories along with a simultaneous change of the sign of β and change of sign—and hence order—to urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0184 in model (2) results in the same category probabilities. Given the usual arbitrariness in the definition of ordinal scales in applications this is a desirable invariance property for the analysis of ordinal data.

The maximum likelihood estimator respects this invariance property, i.e. if the categories are reversed then the new fit can be obtained by merely using urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0185 for the regression parameters and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0186 for the cut points.

The reduced bias estimator respects the same invariance property, also. To see this, assume that we fit the multinomial model with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0187 (r=1, …,n;s=1, …,q) with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0188. Because g(·) is symmetric about zero, G(η)=1−G(−η), and so urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0189. This is a reparameterization of model (3) to urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0190 where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0191. Hence, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0192 with
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0193
and, using the results of Section 7.1., urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0194 (and also urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0195).

8. Properties of the reduced bias estimator and associated inferential procedures: a complete‐enumeration study

8.1. Study design

The frequentist properties of the reduced bias estimator are investigated through a complete‐enumeration study of 2×k contingency tables with fixed row totals. The rows of the tables correspond to a two‐level covariate x with values urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0196 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0197, and the columns to the levels of an ordinal response Y with categories 1, …,k. The row totals are fixed to urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0198 for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0199 and to urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0200 for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0201. Alternative 2 in Table 3 is a special case with k=4, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0202, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0203, and row totals urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0204 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0205. The present complete enumeration involves urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0206 tables. We consider a multinomial model with
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0207(14)
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0208 are regarded as nuisance parameters but are essential to be estimated from the data, because they allow flexibility in the probability configurations within each of the rows of the table.

For the estimation of β we consider the maximum likelihood estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0209, the bias‐corrected estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0210, the reduced bias estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0211 and the generalized empirical logistic transform urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0212 which is defined in McCullagh (1980), section 2.3, and is an alternative estimator with smaller asymptotic bias than the maximum likelihood estimator specifically engineered for the estimation of β in 2×k tables with fixed row totals. The estimators urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0213, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0214 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0215 are the β‐components of the vectors of estimators urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0216, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0217 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0218 respectively, where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0219 is the vector of all parameters. The estimators are compared in terms of bias, mean‐squared error and coverage probability of the respective Wald‐type asymptotic confidence intervals. The following theorem is specific to 2×k and cumulative link models and can be used to reduce the parameter settings that need to be considered in the current study for evaluating the performance of the estimators.

Theorem 1.Consider a 2×k contingency table T with fixed row totals urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0220 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0221, and the multinomial model that satisfies expression (14). Furthermore, consider an estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0222 of δ, which is equivariant under linear transformations. Then, if urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0223, the bias function and the mean‐squared error of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0224 satisfy

urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0225
and
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0226
respectively.

Proof.Define an operator R which when applied to T results in a new contingency table by reversing the order of the rows of T. Hence, R{R(T)}=T.

Because urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0227 is equivariant under linear transformations, it suffices to study the behaviour of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0228 when urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0229 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0230. Then, any combination of values for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0231 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0232 results by an affine transformation of the vector urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0233, and equivariance gives that a corresponding translation of the vector urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0234 and change of scaling of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0235 results in exactly the same fit. Hence, the shape properties of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0236 remain invariant to the choice of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0237.

Denote with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0238 the set of all possible 2×k tables with fixed row totals urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0239 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0240. By the definition of the model, P(T;β,α)=P{R(T);−β,α} for every urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0241. Because urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0242 there is a subset urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0243 of tables with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0244. The complement of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0245 can be partitioned into the sets urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0246 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0247 which have the same cardinality, and where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0248 if and only if urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0249. For urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0250 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0251, equivariance under the linear transformation ϕ(β)=−β gives that urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0252. Then, for any urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0253, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0254. Hence,
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0255(15)
Adding −β to both sides of this equality gives the identity on the bias. For the identity on the mean‐squared error one merely needs to repeat a corresponding calculation to equation 15 starting from
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0256

A similar line of proof can be used to show that if urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0257 the coverage probability of Wald‐type asymptotic confidence intervals for β is symmetric about β=0, provided that the estimator S(T) of the standard error of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0258 satisfies S(T)=S{R(T)}.

8.2. Special case: proportional odds model

For demonstration purposes, the values of the competing estimators are obtained for a proportional odds model (G(η)= exp (η)/{1+ exp (η)}) with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0259 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0260 and k=4, for each of the 400, 3136 and 81796 possible tables with row totals urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0261, for m=3, m=5 and m=10 respectively. All estimators considered are equivariant under linear transformations and hence, according to the proof of theorem 1, the outcome of the complete enumeration for the comparative performance of the estimators generalizes to any choice of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0262.

The estimators urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0263 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0264 are not available in closed form and we need to rely on iterative procedures for finding the roots of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0265 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0266 respectively, for every t ∈ {1,2,3,4}. Fisher scoring is used to obtain urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0267 and the iterative maximum likelihood approach of Section 5.1. is used for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0268. The maximum likelihood estimate is judged satisfactory if the current value urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0269 of the iterative algorithm satisfies urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0270 for every t ∈ {1,2,3,4}. For urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0271, the latter criterion is used with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0272 in place of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0273.

For evaluating the performance of the estimators, the probability of each of the tables has been calculated under model (14), for parameter values that are fixed according to the following scheme. The parameter β takes values on some sufficiently fine equispaced grid in the interval [−6,0]. For β in the interval (0,6] the results can be predicted by the symmetry relations of theorem 1. For each value of β, the nuisance parameters take values urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0274 for e ∈ {1,2,3,5,7}. Fig. 1 is a pictorial representation of the probability settings for the two multinomial vectors in the 2×4 contingency table with fixed row totals, at each combination of values for β and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0275. (The left‐hand side of each plot depicts the multinomial probabilities for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0276 and the right‐hand side the multinomial probabilities for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0277. The eight probabilities (four for each x‐value) for each particular combination of values for β and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0278 are connected with line segments. Hence each piecewise linear function on each plot corresponds to a specific probability setting for the 2×4 contingency table with fixed row totals. The plots correspond to particular settings for the nuisance parameters urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0279 determined by e(−1,0,1), and each plot contains all possible piecewise linear functions for the values of β on an equispaced grid of size 50 in the interval [−6,6].) Under the above scheme for fixing parameter values, the probability of the end categories tends to 0 as e increases, and hence more extreme probability settings are being considered as e grows.

image
Pictorial representation of the probability settings considered in the calculation of expectations from the complete‐enumeration study: (a) e=1; (b) e=2; (c) e=3; (d) e=5; (e) e=7

The findings of the current complete‐enumeration exercise are outlined in the following subsection. The same complete‐enumeration design has been applied to various settings, with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0280, with different link functions, with different numbers of categories and/or for different non‐symmetric specifications for the nuisance parameters (the results are not shown here) yielding qualitatively the same conclusions; the current set‐up merely allows a clear pictorial representation of the findings on the behaviour of the reduced bias estimator. An R script that can produce the results of the current complete enumeration for any number of categories, any link function, any configuration of totals and any combination of parameter settings in 2×k contingency tables is available in the on‐line supplementary material.

8.3. Remarks on the results

Remark 1. (on the estimates of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0281 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0282.)According to Section 3., for data sets where a specific category s ∈ {1,2,3,4} is observed for neither urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0283 nor urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0284, the maximum likelihood estimate of α is on the boundary of the parameter space as follows:

urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0285

At least for log‐concave g(·), according to the results in Pratt (1981), these equations extend directly to the case of any number of categories and number of covariate settings and can directly be used to check what happens when two or more categories are unobserved.

Nevertheless, the maximum likelihood estimator of β is invariant to merging a non‐observed category with either the previous or next category and can be finite even if some of the α‐parameters are on the boundary of the parameter space. Hence, maximum likelihood inferences on β are possible even if a category is not observed. The same behaviour is observed for the reduced bias estimators of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0286 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0287 with the difference that, if the non‐observed category is s=1 and/or s=4, then urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0288 and/or urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0289 are finite. A special case of this observation has been encountered in Section 6.1. where reduction of the bias corresponds to adding urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0290 to the end categories, guaranteeing the finiteness of the cumulative logits. Hence, there is no need for non‐observed end categories to be merged with the neighbouring categories when the reduced bias estimator is used. If any of the other categories is empty, then the reduced bias estimator of β is invariant to merging those with any of the neighbouring categories.

It should be mentioned here that if both the second and the third category are empty then the reduced bias estimate of β and the generalized empirical logistic transform are identical. To see that, note that, in the special case of logistic regression, the adjusted scores in Section 4.4. suggest adding half a leverage to each of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0291 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0292 (this result for logistic regressions was obtained in Firth (1993)). Furthermore, the model with q=1 is saturated and hence both leverages are 1. Hence the reduced bias estimate of β coincides with the generalized empirical logistic transform, which for k=2 is urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0293.

Remark 2.(on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0294 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0295). As is expected from the discussion in Section 3., the maximum likelihood estimator of β is infinite for certain configurations of 0s in the table, and for such configurations the bias‐corrected estimator is also undefined owing to its explicit dependence on the maximum likelihood estimator. Hence, for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0296 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0297, the bias function is undefined and the mean‐squared error is infinite. A possible comparison of the performance of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0298 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0299 is in terms of conditional bias and conditional mean‐squared error where the conditioning event is that urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0300 has a finite value.

For detecting parameters with infinite values the diagnostics in Lesaffre and Albert (1989), section 4., for multinomial logistic regressions are adapted to the current setting. Data sets that result in infinite estimates for β have been detected by observation of the size of the corresponding estimated standard error based on the inverse of the Fisher information, and by observation of the absolute value of the estimates when the convergence criteria were satisfied. If the standard error was greater than 200 and the estimate was greater than 100, then the estimate was labelled infinite. A second pass through the data sets has been performed, making the convergence criterion for the Fisher scoring stricter than urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0301. The estimates that were labelled infinite by using the aforementioned diagnostics further diverged towards ∞ whereas the rest of the estimates remained unchanged to high accuracy.

The probability of encountering an infinite urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0302 for the different possible parameter settings is shown in Fig. 2(a). For β ∈ (0,6) the probability of encountering an infinite value is simply a reflection of the probability in (−6,0) across β=0. As is apparent the probability of infinite estimates increases as e increases and for each value of e it increases as |β| increases. As is natural as m increases, the probability of encountering infinite estimates is reduced but is always positive.

image
(a) Probability of infinite estimates, (b) conditional biases and (c) conditional mean‐squared errors of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0303 (image) and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0304 (image) for the parameter settings that were considered in the complete‐enumeration study

Of course, the findings from the current comparison of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0305 with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0306 should be interpreted critically, bearing in mind the conditioning on the finiteness of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0307; the comparison suffers from the fact that the first‐order bias term that is required for the calculation of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0308 is calculated unconditionally. The comparison is fairer when the probability of infinite estimates is small; this happens on a region around zero whose size also increases as m increases.

The conditional bias and conditional mean‐squared error of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0309 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0310 are shown respectively in Fig. 2(b) and Fig. 2(c). The identities in theorem 1 apply also to the conditional and conditional mean‐squared error; to see this set P to be the conditional probability of each table in the proof of theorem 1. Hence, for β ∈ (0,6), the conditional bias is simply a reflection of the conditional bias for β ∈ (−6,0) across the urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0311 line, and the conditional mean‐squared error is a reflection of the conditional mean‐squared error for β ∈ (−6,0) across β=0.

The behaviour of the estimators in terms of conditional bias is similar, with the maximum likelihood estimator performing slightly better than urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0312 for small m. As m increases the bias‐corrected estimator starts to perform better in terms of bias in a region around zero, where the probability of infinite estimates is smallest. The same is noted for the conditional mean‐squared error. The estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0313 performs better than urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0314 in a region around zero, whose size increases as m increases. The same behaviour as for e=7 persists for larger values of e (the figures are not shown here).

Remark 3.(on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0315 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0316). The estimators urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0317 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0318 always have finite value irrespective of the configuration of 0s in the table. Hence, in contrast with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0319 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0320, a comparison in terms of their unconditional bias and unconditional mean‐squared error is possible. Fig. 3(a) shows the bias function of the estimator for the parameter settings that were considered in the complete‐enumeration study. For β ∈ (0,6), the bias function is simply a reflection of the bias for β ∈ (−6,0) across the urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0321 line, and the mean‐squared error is a reflection of the mean‐squared error for β ∈ (−6,0) across β=0.

image
(a) Biases and (b) mean‐squared errors of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0322 (image) and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0323 (image) for the parameter settings that were considered in the complete‐enumeration study

The reduced bias estimator performs better than urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0324 in terms of bias for small values of e and the differences in the bias functions diminish as e increases. A similar limiting behaviour holds for their mean‐squared errors, though, for small values of e, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0325 performs slightly better than urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0326 in terms of mean‐squared error in the range (−4,4) and worse outside that range. The mean‐squared error of both estimators converges to 0 as m increases, which is what is expected from consistent estimators (see Kosmidis (2007a), section 6.3, for a proof of the consistency of the reduced bias estimator).

Remark 4. (on the coverage of 95% Wald confidence intervals.)For a table T and an estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0327, consider the nominally 100(1−a)% Wald‐type confidence interval for β

urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0328
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0329 is the 100ath quantile of a standard normal distribution and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0330 is the estimator of the standard error of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0331. For urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0332, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0333 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0334, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0335 is taken to be the square root of the diagonal element of the inverse of the Fisher information corresponding to β, evaluated at urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0336, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0337 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0338 respectively. For the estimation of the standard error for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0339, the variance formula that was given in McCullagh (1980), section 2.3, is used. If the maximum likelihood estimate is infinite then we make the convention that the confidence intervals based on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0340 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0341 are (−∞,∞). Fig. 4 shows the coverage probabilities of the four competing intervals for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0342 with e ∈ {1,2,3,5,7}, and for β ∈ [−10,0]. The coverage probability for β ∈ (0,10) is simply a reflection of the coverage probability for β ∈ (−10,0) across β=0.

image
Coverage probabilities of nominally 95% asymptotic Wald‐type confidence intervals for β based on (a) urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0343 (image) and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0344 (image) and (b) urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0345 (image) and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0346 (image) and the respective standard errors, for β ∈ [−10,0) and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0347 for e ∈ {1,2,3,5,7}

Wald‐type confidence intervals based on the maximum likelihood estimator demonstrate a conservative behaviour in terms of coverage, and the coverage probability converges to 1 as |β|→∞. Furthermore, the coverage probability seems to approach uniformly the nominal level as m increases. The intervals based on the bias‐corrected estimator also demonstrate conservative behaviour in a neighbourhood around β=0, then tend to undercover for an interval of large |β|‐values and, as for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0348, when |β|→∞ the coverage probability tends to 1.

A more dramatic undercoverage is present for confidence intervals that are based on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0349 when |β| is large. Actually after some value of |β| the confidence intervals that are based on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0350 completely lose coverage (the full range of the coverage probability is not shown here). In contrast, those intervals behave satisfactorily around β=0. This behaviour relates to the fact that the variance estimator for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0351 is obtained under the assumption that β=0 and can seriously underestimate the variance of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0352 when |β| is larger than about 1 (the same observation was also made in McCullagh (1980), section 2.3). Furthermore, it is worth noting that the point where coverage is lost completely moves closer to zero as m increases. Hence, use of Wald‐type confidence intervals that are based on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0353 is not recommended in practical applications.

Apart from being conservative, confidence intervals based on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0354 seem to behave better for a wider range of β around zero, but also completely lose coverage after some value of |β|. The complete loss of coverage for large effects is due to an interplay of discreteness of the response and the fact that urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0355 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0356 take always finite values. Specifically, for any finite m there is only a finite number of possible Wald‐type confidence intervals because the response is multinomially distributed, and any of those confidence intervals has finite end points. Therefore, there will always be a sufficiently large value of |β| which is not contained in any of the confidence intervals, resulting in a complete loss of coverage. Nevertheless, in contrast with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0357, the coverage properties of the Wald‐type confidence intervals based on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0358 improve quickly and the value where coverage is lost moves quickly away from 0 as m increases. This is because the cardinality of the set of the possible confidence intervals increases and the approximation of the necessarily discrete distribution of the reduced bias estimator by a normal distribution with variance the inverse of the Fisher information becomes more accurate. This results in the increasing accuracy of the approximation of the distribution of the Wald pivot by a normal distribution.

As the current study demonstrates, the Wald‐type confidence intervals based on any of the estimators do not behave satisfactorily for the whole range of β and for small sample sizes. For this reason current research focuses on alternative confidence intervals that can have one infinite end point (see Section 12.). Until conclusive results are produced, Wald‐type confidence intervals based on the reduced bias estimator can still be used in practice as asymptotically correct, bearing in mind that they will be generally slightly conservative for moderate effects (like those based on the maximum likelihood estimator) especially in small samples, and also that their coverage properties will deteriorate for extremely large effects.

9. Shrinkage towards a binomial model for the end categories

Table 4 shows the maximum likelihood estimates, the reduced bias estimates and the corresponding estimated standard errors from fitting a proportional odds model and a proportional hazards model of the form (14) to the artificial data that were considered in the example in Section 6.2..

Table 4. Parameter estimates and corresponding estimated standard errors (in parentheses) from fitting a proportional odds model and a proportional hazards model of the form (14) to the artificial data considered in Table 3 in Section 6.2., using maximum likelihood and bias reduction
Model Parameter Maximum Bias reduction
likelihood
Proportional odds β −1.944 (0.895) −1.761 (0.850)
(G(η)= exp (η)/{1+ exp (η)}) urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0359 1.187 (0.449) 1.084 (0.428)
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0360 3.096 (0.787) 2.781 (0.701)
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0361 ∞ (∞) 4.457 (1.440)
Proportional hazards β −0.689 (0.401) −0.635 (0.389)
(G(η)=1− exp {− exp (η)}) urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0362 0.313 (0.220) 0.297 (0.219)
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0363 1.097 (0.260) 1.013 (0.246)
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0364 ∞ (∞) 1.518 (0.357)

There is apparent shrinkage of the reduced bias estimates towards 0, which implies a shrinkage of the cumulative probabilities towards G(0). This implies a shrinkage of the probabilities for the first and the last category of the ordinal scale towards G(0) and 1−G(0) respectively, and a corresponding shrinkage of the probabilities of the intermediate categories towards 0.

To investigate further the apparent shrinkage effect, the maximum likelihood and reduced bias estimates of proportional odds and proportional hazards models of the form (14) are obtained for every possible 2×6 table with row totals urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0365. This setting is chosen because it is one that results in sparse tables, allowing the construction of plots of fitted probabilities that are not massively overcrowded (under this setting there are 3136 tables to be estimated).

For each category of the ordinal response, Fig. 5 shows the fitted probabilities based on the reduced bias estimator against the fitted probabilities based on the maximum likelihood estimator. The grey areas are where the points would all be expected to lie if the shrinkage relationships were strictly satisfied for each pair of fitted probabilities. Clearly this is not so.

image
Fitted probabilities based on the reduced bias estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0366 against the fitted probabilities based on the maximum likelihood estimator urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0367, for each category of the response ((a)–(f) correspond to the proportional odds model and (g)–(l) correspond to the proportional hazards model;  image, areas where the points would be expected to lie if shrinkage was strict): (a), (g) category 1; (b), (h) category 2; (c), (i) category 3; (d), (j) category 4; (e), (k) category 5; (f), (l) category 6

The points on the plots for the first category roughly lie slightly above the urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0368 line for fitted values less than G(0), and slightly below it for fitted values greater than G(0). The points for the last category exhibit similar behaviour but with G(0) replaced by 1−G(0). The shrinkage effect appears to be stronger the further the probability is from the shrinkage points G(0) and 1−G(0).

The points on the plots for the intermediate categories lie mostly under the urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0369 line, except in cases where the maximum‐likelihood‐fitted probability is very close to 0. Hence, the fitted probabilities for the intermediate categories based on the reduced bias estimator tend to shrink towards 0. The plots also suggest that the further the probability is from 0 the stronger is the shrinkage effect.

The shrinkage properties that are observed here are a direct generalization of the shrinkage that is implied by improving bias in the estimation of binomial logistic regression models (Copas, 1988; Cordeiro and McCullagh, 1991; Firth, 1992) to links other than the logistic and to models with ordinal responses.

Corresponding empirical investigations of shrinkage based on both complete enumerations and simulations under models fitted to real data have also been performed but are not shown here. The results are qualitatively the same: reduction of bias in cumulative link models shrinks the multinomial model towards a binomial model that has probability G(0) for the first category and probability 1−G(0) for the last category.

10. Simulation study

To illustrate further the properties of the reduced bias estimator in more complex scenarios than that in the complete‐enumeration study of Section 8., a simulation study was set up based on part of the data that have been analysed in Jackman (2004). The data are publicly available through the R package pscl (Jackman, 2012) and seem to agree with the data that are available for rater F1 in the analysis in Jackman (2004). The data contain the score of rater F1 for 106 applications to the political science doctoral programme at Stanford University along with corresponding applicant‐specific observations. The rater's score is on a five‐point integer‐valued ordinal scale from 1 to 5, with 1 indicating the lowest rating and 5 indicating the highest rating. Consider that the cumulative log‐odds for rating s for the rth candidate is modelled as
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0370(16)
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0371 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0372 are the rth applicant's scores on the quantitative and verbal section of the graduate record examinations respectively (after subtracting the respective mean and dividing by the respective standard deviation), urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0373 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0374 are dummy variables indicating whether the rth applicant has an interest in American politics and political theory respectively (with 1 representing a positive and 0 a negative reply), and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0375 is the gender of the rth applicant (r=1, …,106). The parameters urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0376 are the cut points and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0377 describe the effect of the corresponding applicant‐specific covariates on the cumulative log‐odds.

Model (16) was fitted by using maximum likelihood and the maximum likelihood estimates of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0378 are 1.993, 0.892, 2.816, 0.009 and 1.215 respectively, indicating that an increase in the value of any of the covariates is associated with higher probability for high ratings holding all else in the model fixed. Then an extensive simulation under the maximum likelihood fit is performed for estimating the biases, mean‐squared errors and coverage probabilities of Wald‐type 95% confidence intervals for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0379 when maximum likelihood, bias correction and bias reduction are used. There have been instances of simulated data sets where one or more rating categories were empty. In those cases, empty categories were merged with neighbouring categories according to the discussion in remark 1 of Section 8.. The results are shown in Table 5. There was only one data set for which the maximum likelihood estimate of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0380 was ∞. This data set was excluded when estimating the bias, mean‐squared error and coverage probability for the maximum likelihood and the bias‐corrected estimator and hence the corresponding figures in Table 5 estimate the conditional respective quantities (i.e. given that the maximum likelihood estimator has finite value). In contrast, the reduced bias estimates were finite for all data sets and hence the corresponding figures are estimates of the targeted unconditional quantities. In this particular setting, the probability of the conditioning event is small and a direct comparison of the estimated conditional and unconditional quantities can be informative.

Table 5. Estimated biases, mean‐squared errors MSE and coverage probabilities of 95% Wald‐type confidence intervals from a simulation of size urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0381 under the maximum likelihood fit of model (16) †The last column shows the estimated relative increase in the mean‐squared error from its absolute minimum (the variance) due to bias. The relative increase of the mean‐squared error is the square of the bias divided by the variance. The estimated simulation error is less than 0.004 for the bias and the MSE‐estimates and less than 0.001 for the coverage estimates.
Method Parameter Bias MSE Coverage biasurn:x-wiley:13697412:media:rssb12025:rssb12025-math-0382/variance
(%)
Maximum urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0383 0.132 0.142 0.943 13.928
likelihood urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0384 0.055 0.062 0.943 5.203
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0385 0.208 0.722 0.947 6.347
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0386 0.004 0.630 0.944 0.003
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0387 0.077 0.238 0.944 2.569
Bias correction urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0388 −0.001 0.106 0.948 0.002
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0389 0.001 0.051 0.953 0.001
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0390 −0.004 0.577 0.954 0.002
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0391 0.003 0.551 0.956 0.001
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0392 0.001 0.205 0.954 0.000
Bias reduction urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0393 0.002 0.107 0.949 0.002
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0394 0.002 0.051 0.953 0.007
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0395 0.002 0.579 0.954 0.001
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0396 0.004 0.553 0.956 0.003
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0397 0.003 0.205 0.954 0.003
  • †The last column shows the estimated relative increase in the mean‐squared error from its absolute minimum (the variance) due to bias. The relative increase of the mean‐squared error is the square of the bias divided by the variance. The estimated simulation error is less than 0.004 for the bias and the MSE‐estimates and less than 0.001 for the coverage estimates.

Temporarily ignoring the fact that the maximum likelihood estimator can be infinite, both the bias‐corrected and the reduced bias estimators perform equally well in the current study. Furthermore, the figures in Table 5 demonstrate a significant reduction in terms of both bias and mean‐squared error when either bias correction or bias reduction is used. In the current study the effect of estimation bias is quite significant; the mean‐squared errors of the components of the maximum likelihood estimator are inflated by as much as 13.9% due to bias from their minimum values (the variances). The corresponding inflation factors for the bias‐corrected and reduced bias estimators are quite close to 0, which when combined with the observed reduction in mean‐squared error illustrates the benefits that the reduction of bias can have in the estimation of such models. Lastly, a slight improvement in the coverage properties of Wald‐type confidence intervals is observed when the bias is corrected.

Overall and taking into account that the reduced bias estimator is always finite the current study illustrates its superior frequentist properties from the alternatives.

11. Wine tasting data

The partial proportional odds model of example 1 is here refitted by using the reduced bias estimator. The result is shown in Table 6. All estimates and estimated standard errors are finite. A Wald statistic for testing departures from the assumption of proportional odds via departures from the hypothesis urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0398 is
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0399
where
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0400
is a matrix of contrasts of β. The matrix
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0401
is the inverse of the variance–covariance matrix of the asymptotic distribution of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0402, where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0403 is the β‐block of the inverse of the Fisher information. By the asymptotic normality of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0404, W has an asymptotic urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0405‐distribution with 3 degrees of freedom. The value of W for the data in Table 1 is 0.7502, leading to a p‐value of 0.861, which provides no evidence against the proportional odds assumption. This is also apparent from Table 6 where the reduced bias estimates of urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0406 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0407 are comparable in value.
Table 6. Reduced bias estimates for the parameters of model (1), corresponding estimated standard errors (in parentheses) and the values of the Z‐statistic for the hypothesis that the corresponding parameter is 0
Parameter Reduced bias Z‐statistic
estimate
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0408 −1.19 (0.50) −2.40
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0409 1.06 (0.44) 2.42
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0410 3.50 (0.74) 4.73
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0411 5.20 (1.47) 3.52
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0412 2.62 (1.52) 1.72
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0413 2.05 (0.58) 3.54
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0414 2.65 (0.75) 3.51
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0415 2.96 (1.50) 1.98
θ 1.40 (0.46) 3.02

It is worth noting that, in contrast with the output reported in example 1, the values of the Z‐statistics for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0416, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0417 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0418 are far from being exactly 0.

12. Concluding remarks and further work

On the basis of the results of the complete‐enumeration study, urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0419 appears to be always finite in contrast with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0420 and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0421, and also to have comparable behaviour with urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0422 in terms of bias and mean‐squared error. Furthermore, Wald‐type asymptotic confidence intervals based on urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0423 behave satisfactorily, maintaining good coverage properties for a wide range of β‐values. A complete loss of coverage is still present but the point where this happens is far from zero and diverges as the number of observations increases. The application of the current complete‐enumeration set‐up for complementary log–log‐ and probit link functions, for varying values of the row totals, and for different numbers of categories, resulted in qualitatively the same conclusions.

In remark 1 of the complete‐enumeration study, the finiteness of the reduced bias estimates for urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0424 and/or urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0425 was noted even in cases where the first and/or last category of the ordinal variable is not observed. This behaviour was encountered also in the simulation study of Section 10. and in all of the many settings where the reduced bias estimator has been applied and is defensible from an experimental point of view. When the experimenter sets an ordinal scale, the end categories of that scale largely determine the possible responses. Hence, one might argue that the end categories should play a bigger role than the intermediate categories in the analysis, and a good estimation method should not be as democratic as maximum likelihood is in this respect; accepting that the ordinal scale is well defined, if an end category is not observed then it seems more appropriate to inflate its probability of occurrence slightly, instead of setting it to 0 as the maximum likelihood estimator would do.

The latter point of view does not only apply to non‐observed end categories. It applies to all analyses of ordinal data through cumulative link models and is reinforced by the fact that an improvement in the frequentist properties of the maximum likelihood estimator resulted in the shrinkage of the cumulative link model towards a binomial model for the end categories.

These observations, along with the fact that urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0426 respects the invariance properties of the cumulative link model and can be easily obtained by using the procedures in Section 5., provide a strong case for its routine use in the estimation of cumulative link models.

Laara and Matthews (1985) demonstrated the equivalence of continuation ratio models with complementary log–log‐link and proportional hazards models in discrete time. Hence, the reduced bias estimates for the regression parameters of the former can be obtained by using the results in the current paper for the latter.

The investigation of confidence intervals that maintain good properties without suffering from a complete loss of coverage for extreme effects is the subject of future work. Current research focuses on the use of profiles of the asymptotic pivot urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0427 which can be shown to have an asymptotic urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0428‐distribution, and the combination of the resultant intervals with the profile likelihood intervals. In this way confidence intervals with one infinite end point are possible and are suggested to accompany the reduced bias estimator which appears always to take a finite value. Such intervals seem to reflect uncertainty better when extreme settings are considered and lead to improved coverage properties without loss of coverage.

As is done in Section 11., a comparison of nested models can be performed by using an asymptotic Wald‐type test based on the reduced bias estimator. Another option is the use of the adjusted score statistic
urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0429
where urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0430 are the estimates under the hypothesis that results in the smaller model, and urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0431 and F(δ) are the vector of adjusted score functions and the Fisher information of the larger model respectively. The fact that urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0432 where A(δ)=O(1) guarantees an asymptotic urn:x-wiley:13697412:media:rssb12025:rssb12025-math-0433‐distribution for that statistic. For the example in Section 11. the value of the adjusted score statistic is 0.9357 on 3 degrees of freedom, giving a p‐value of 0.8168 which leads to qualitatively the same conclusion as that of the Wald test. When testing departures from the proportional odds assumption in general, the adjusted score statistic has the same disadvantage as the ordinary score statistic; the Fisher information matrix for the partial proportional odds model can be non‐invertible when evaluated at the estimates of the corresponding proportional odds model.

13. Supplementary material

The accompanying on‐line supplementary material includes an R script (R Development Core Team, 2012) that can be used to produce the results of the complete‐enumeration study for any number of categories, any link function and any configuration of row totals in contingency tables. The current version of the R function bpolr is also included. The bpolr function fits cumulative link models and their extensions with dispersion effects either by maximum likelihood, or bias reduction or bias correction. An updated version of the function will be part of the next major release of the R package brglm (Kosmidis, 2007b). Scripts that reproduce the data analyses that were undertaken in the paper are also available in the on‐line supplementary material.

Acknowledgements

The author is grateful to two referees and the Associate Editor whose comments have significantly improved the current work. Furthermore, the author is grateful to David Firth for the helpful and stimulating discussions on this work, and to Cristiano Varin and Thomas W. Yee for their comments on this paper.

Part of this work was completed between September 2007 and September 2010 when the author was a Research Fellow at the Centre for Research in Statistical Methodology at the University of Warwick, and between January 2012 and July 2012 when the author was Senior Research Fellow at the Department of Statistics of the University of Warwick. The support of the Engineering and Physical Sciences Research Council is gratefully acknowledged for funding both positions.

      Number of times cited according to CrossRef: 16

      • Use of otolith elemental signatures to unravel lifetime movement patterns of Atlantic spadefish, Chaetodipterus faber, in the Southwest Atlantic Ocean, Journal of Sea Research, 10.1016/j.seares.2020.101873, 158, (101873), (2020).
      • Factors Associated With Surgical Mortality and Complications Among Patients With and Without Coronavirus Disease 2019 (COVID-19) in Italy, JAMA Surgery, 10.1001/jamasurg.2020.2713, (2020).
      • One country two systems: a comparative study of national identity between Honk Kong and Macau, Chinese Sociological Review, 10.1080/21620555.2020.1815004, (1-24), (2020).
      • Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models, Biometrika, 10.1093/biomet/asaa052, (2020).
      • Location-adjusted Wald statistics for scalar parameters, Computational Statistics & Data Analysis, 10.1016/j.csda.2019.04.004, (2019).
      • Estimation of log‐odds ratio from group testing data using Firth correction, Biometrical Journal, 10.1002/bimj.201800125, 61, 3, (714-728), (2019).
      • Mean and median bias reduction in generalized linear models, Statistics and Computing, 10.1007/s11222-019-09860-6, (2019).
      • On bias reduction and incidental parameters, Biometrika, 10.1093/biomet/asx079, 105, 1, (233-238), (2018).
      • Age, growth, and reproductive biology of the Atlantic spadefish Chaetodipterus faber in southern Brazil, Ichthyological Research, 10.1007/s10228-018-0663-2, (2018).
      • A Generalized Logistic Link Function for Cumulative Link Models in Ordinal Regression, Neural Processing Letters, 10.1007/s11063-017-9589-3, 46, 1, (251-269), (2017).
      • Inferential tools in penalized logistic regression for small and sparse data: A comparative study, Statistical Methods in Medical Research, 10.1177/0962280216661213, 27, 5, (1365-1375), (2016).
      • Semi-parametric bivariate polychotomous ordinal regression, Statistics and Computing, 10.1007/s11222-015-9622-1, 27, 1, (283-299), (2015).
      • Tree-based varying coefficient regression for longitudinal ordinal responses, Computational Statistics & Data Analysis, 10.1016/j.csda.2015.01.003, 86, (65-80), (2015).
      • VGAMs, Vector Generalized Linear and Additive Models, 10.1007/978-1-4939-2818-7_4, (127-166), (2015).
      • Categorical Data Analysis, Vector Generalized Linear and Additive Models, 10.1007/978-1-4939-2818-7_14, (385-414), (2015).
      • Maximum Likelihood Difference Scaling versus Ordinal Difference Scaling of emotion intensity: a comparison, Quality & Quantity, 10.1007/s11135-014-0100-1, 49, 5, (2169-2185), (2014).