Specification and testing of hierarchical ordered response models with anchoring vignettes

Collection and analysis of self‐reported information on an ordered Likert scale is ubiquitous across the social sciences. Inference from such analyses is valid where the response scale employed means the same thing to all individuals. That is, if there is no differential item functioning (DIF) present in the data. A priori this is unlikely to hold across all individuals and cohorts in any sample of data. For this reason, anchoring vignettes have been proposed as a way to correct for DIF when individuals self‐assess their health (or well‐being, or satisfaction levels, or disability levels, etc.) on an ordered categorical scale. Using an example of self‐assessed pain, we illustrate the use of vignettes to adjust for DIF using the compound hierarchical ordered probit model (CHOPIT). The validity of this approach relies on the two underlying assumptions of response consistency (RC) and vignette equivalence (VE). Using a minor amendment to the specification of the standard CHOPIT model, we develop easy‐to‐implement score tests of the null hypothesis of RC and VE both separately and jointly. Monte Carlo simulations show that the tests have good size and power properties in finite samples. We illustrate the use of the tests by applying them to our empirical example. The tests should aid more robust analyses of self‐reported survey outcomes collected alongside anchoring vignettes.


| INTRODUCTION
It is common in social surveys to use subjective categorical scales to elicit information in the form of self-reports; for example, levels of health, work disability or subjective well-being. Responses to such questions are often used to study differences across countries or social or demographic groups. A problem with relying on subjective responses is that individuals are likely to place different interpretations on the response scale. Information on health status might, for example, be obtained using the question: Overall, how would you rate your health? Respondents are asked to tick one of (typically) five boxes ranging from very bad through to good to excellent. Variation in responses will be due, in part, to genuine health differences, but may also be due to respondents applying different meanings to the available response categories. This type of reporting behaviour is commonly referred to as differential item functioning, or DIF (Holland & Weiner, 1993;Murray et al., 2002). Figure 1 illustrates DIF using an example of a self-reported question about pain. Assume we have two respondents who are asked the question 'Overall in the last 30 days, how much of bodily aches or pains did you have?' and are instructed to respond by selecting one of the following: 'None', 'Mild', 'Moderate', 'Severe' or 'Extreme'. In the diagram, the vertical line represents the underlying latent scale for pain. DIF is depicted by the different locations of the individual-specific boundary parameters along the latent scale, 0 to 3 . Although respondents have identical levels of latent pain (indicated by the bold arrows), respondent B reports mild pain, while respondent A reports no pain. Without knowing the locations of the boundary parameters, researchers would typically conclude that B has worse pain than A.
A number of approaches have been proposed to test for DIF. In the educational literature, where DIF is used to refer to test questions (items) in which individuals with the same underlying ability have differing probabilities of answering a question correctly; popular approaches include the Mantel-Haenszel procedure, item response theory and logistic regression-based methods (Holland & Thayer, 1988;Shepard et al., 1981;Swaminathan & Rogers, 1990). The basic idea is to compare the probability of answering a question correctly across different groups of individuals, while conditioning on underlying ability. A well-known issue is the difficulty in measuring the underlying ability of interest. For example, ability is commonly measured in terms of other test items (e.g. overall test scores) which distance estimator developed by Peracchi and Rossetti (2013) and show that the score test appears to have greater power in detecting departures from the null of RC and VE.
The paper is organised as follows. Section 2 introduces the SHARE data, the self-reported health variable and corresponding vignettes, and illustrates the presence of DIF. Section 3 sets out the modelling approaches for ordered categorical outcomes in the absence of DIF and Section 4 in the presence of DIF. The latter relies on information contained within the responses to the vignettes. Our contributions are developed in Section 5, where we propose an amendment to the usual statistical approach to account for DIF which lends itself to simple score tests of both VE and RC individually and jointly. The tests are also derived in this Section. We apply the amended specification to SHARE data in Section 6 and implement the score test. Section 7 sets out the finite sample properties of the amended specification and test procedures. Section 8 provides concluding remarks.

PAIN FROM SHARE
This section introduces SHARE data including the categorical health outcomes to measure pain. Using the set of corresponding vignettes for pain, we show prima facie evidence of DIF in these data. SHARE is a multidisciplinary and cross-national panel dataset of individuals aged 50 or above and over time has expanded to covering 28 countries. The survey collects information on health, socioeconomic status, and social and family networks. A particular virtue of SHARE is that information on self-reported health together with vignettes are included within the survey.
In the context of a diverse continent like Europe, differences in language and cultural and social norms are likely to lead to differences in the way individuals respond to survey instruments. The application of anchoring vignettes is, therefore, important for enhancing cross-country comparability. Together with self-assessments, vignettes on health were collected on subsamples of respondents in the first two waves of SHARE. The first wave contained three vignette questions for each domain of health and the second wave contained a single vignette only. Due to the increased number of vignettes available, which is helpful to illustrate how the score test might be applied in practical applications, we use data only from the first wave. Data from Belgium, France, Germany, Greece, Italy, the Netherlands, Spain and Sweden were included in the subsample responding to the self-assessed health questions and associated vignettes. SHARE data has been popular for studies investigating differences in reporting behaviour and more generally the method of anchoring vignettes (e.g. see Bago d'Uva et al. (2008), Angelini et al. (2012), Paccagnella (2013), Peracchi and Rossetti (2013), an Van Soest and Vonkova (2014), Jones et al. (2018)).
We consider data for the health domain representing pain and restrict our analysis to respondents aged 50-80 years. In addition to a self-assessment component, respondents were also asked to rate three vignettes for pain, representing different levels of severity, using the same response categories ('None', 'Mild', 'Moderate', 'Severe' and 'Extreme'). Appendix A contains the self-assessment question together with the vignettes, and Table 1 reports the frequencies for the responses observed in the data. The level of pain described in each vignette is increasing from vignette 1 (least pain) to vignette 3 (most pain). Due to the low prevalence of responses in the 'Extreme' category for the self-assessment and the first vignette, the responses for 'Severe' and 'Extreme' have been collapsed.
For modelling the self-assessed pain outcome, we employ the set of covariates presented in Table  2. These represent plausible determinants and indicators of pain and also feature in Peracchi and Rossetti (2013) who also used the SHARE data to illustrate their minimum distance estimator of the underlying assumptions of the CHOPIT model. As we use the data as an illustration of modelling self-reported outcomes in the presence of DIF, and the proposed score test for RC and VE in the CHOPIT model, rather than the substantive focus of the paper, we choose to keep the model parsimonious. The CHOPIT approach (see Sections 3 and 4 for more details), requires two sets of covariates; those which affect the underlying latent scale of the construct of interest, x, and those which shift the inherent boundary parameters of the model, z. In the absence of persuasive information on appropriate exclusion restrictions, we set x = z (this is commonplace in the literature). Thus, the specification includes binary variables for males (Male: 47% of our sample); respondents aged 66-75 years (Age 66−75: 28%) and aged 76 and over (Age > 75: 8%); post-school education (EducPS: 21%); and the presence of health conditions (AnyCond: 71%). An indicator variable representing below average hand grip strength is also included (Grip35: 53%), which is based on up to four measurements conducted by a trained interviewer. Our working sample is 3802 individuals.
As the responses to the survey self-reports of pain are ordinal, they can be modelled as a function of covariates using a ordered probit (OP) model, as set out in Section 3. This approach assumes that individuals are using a given fixed reporting scale that does not differ across respondents, that is, that DIF does not exist in the data. We can illustrate the likely extent of DIF in the self-reports by simply considering responses to the set of vignettes. Since the vignettes describe fixed levels of a given domain that are provided to all respondents, variation in reporting on the vignettes by characteristics of individuals is indicative of systematic reporting behaviour. Table 3 shows reporting differences by covariates for each of the three vignettes. For each characteristic, the table reports the proportion of respondents classifying the vignette as either no or mild difficulties. For gender, Pearson chi-squared statistics and associated p-values are provided, while the corresponding chi-squared statistic from Kendall's τ and associated p-values are provided for the remaining characteristics.
The results in Table 3 indicate the likely presence of DIF in the levels of all covariates considered in response to at least one of the three vignettes. For example, women are more likely to rate vignette 2 as no or mild pain compared to men; individuals reporting no health conditions are more likely to rate vignette 1 (least severe vignette) as no or mild pain than counterparts with health conditions; the more educated are less likely than the less educated to rate vignette 3 (most severe vignette) as no or mild pain. Younger respondents are more likely than older respondents to report vignette 1 as no or mild pain and less likely to rate vignette 2 as no or mild pain. These results provide prima facie evidence of the use of different reporting scales, or DIF, in respondents assessments which is likely to also exist in the self-assessments of the same health construct.

| MODELLING ORDERED OUTCOMES IN THE ABSENCE OF DIF
Our measures of pain are responses on a categorical (Likert) scale which can be estimated using ordered (probit or logit) response models (Greene & Hensher 2010). Underlying the OP model (indeed, both) is a latent variable, y * , which is a linear (in unknown parameters, ̃ ) function of observed characteristics x with no constant term (throughout we denote a no-constant subvector/matrix by use of '∼', and denote a subvector matrix containing a constant by the absence of '∼'). The term y represents a standard normal disturbance term, such that where y * is mapped into observed j = 0, …, J−1 outcomes via the usual mapping (1) y * =x �̃ + y , where −1 = −∞ and J−1 = + ∞, and where to ensure well-defined probabilities; j−1 < j , ∀j. The expressions for the resulting probabilities and likelihood functions are well-known (e.g. see Greene and Hensher (2010)). Applying the OP model to the self-reported outcomes for pain yields the set of estimates presented in column (1) of Table 4. In general, levels of pain are lower for males compared to females, and for respondents who have a post-school qualification. Respondents reporting the presence of health conditions experience greater levels of pain, as do those with below average grip strength. Pain also increases with age (test of joint significance: 2 2 = 6.74; p = 0.034). However, for the OP coefficients to be unbiased, we need to assume that all respondents use the same reporting scales such that the boundary parameters, j , are common to all respondents. This implies an absence of DIF. As we have seen in Table 3, this is unlikely to be the case.

PRESENCE OF DIF
We now consider extensions to the OP model in the presence of DIF. We first describe an approach that does not rely on the use of vignettes, but which imposes strong assumptions. We then consider an approach that incorporates information from vignette responses to identify the model. We conclude the Section with a discussion of approaches used in the literature to investigate the identifying assumptions of the vignette approach.

| Hierarchical ordered probit model (HOPIT)
Differences in reporting scales across individuals can be accommodated by specifying individualspecific boundary parameters, i,j (see, e.g. Terza (1985), Pudney and Shields (2000), Boes and Winkelmann (2006), Greene and Hensher (2010), Greene et al. (2014)). This can be achieved by allowing the boundaries to depend on a set of observed characteristics z i such that i,j = z � i j . Note, however, to secure identification the approach imposes the restriction that z i ∉ x i . To ensure coherent probabilities most authors (see, e.g. Greene and Hensher (2010)) adopt the Hierarchical Ordered Probit (HOPIT) approach by specifying the boundaries as This model can be estimated by maximum likelihood techniques, where the j in Equation (2) are simply replaced by those of Equation (3).

| The compound hierarchical ordered probit model
Empirically, it is often difficult to justify exclusion restrictions between x and z. This can be seen in the above example, where from Table 3 we infer that, for example experiencing health conditions is associated with DIF, but also from Table 4 that health conditions are a significant predictor of pain. However, for any variable that appears in both x and z, since the first threshold in Equation (3) is specified linearly, the corresponding elements of 0 and ̃ are not separately identified in the absence of further information. Identification can be resolved by the availability of (anchoring) vignettes, which are used in conjunction with the main self-report of interest. The following is an example of a vignette for pain taken from the SHARE (vignette m1 in Appendix A.1): "Karen has a headache once a month that is relieved after taking a pill. During the headache she can carry on with her day-to-day affairs. Overall in the last 30 days, how much of bodily aches or pains did Karen have?" The categories (and scale) available to respondents are the same as those used to self-assess levels of pain, namely, in our example, None, Mild, Moderate, Severe and Extreme.
Assume that for a randomly chosen individual, the response to the self report on the latent scale, y * , is given as model (1) and the corresponding response to the k th vignette, v * k , as where k ∼ N(0, 2 k ). Note that the number of vignettes available (K) will vary across surveys used, but in general is likely to be small (typically ≤ 3). When more than one are available (K > 1), there is a trade-off between improved model identification due to using more vignettes, and potentially increased bias due to the heightened probability that one may violate the requisite assumptions (described below). Indeed, the testing procedures developed in this paper would appear to be fundamental in the choice of vignettes used, and hence K, where there are multiple available in a given dataset. The observed response to the self-report, y, and to each vignette, v k , is determined as in Equation (2) before, by considering their relationship with the boundary equations. Heterogeneity across these response scales is once more accommodated by specifying the boundaries as a function of variables, z (see Equation (3)). In this set-up, we do not need to impose exclusion restrictions between x and z and it is common to assume x = z. However, to aid exposition, we retain the labelling x and z throughout.
We refer to the HOPIT model with vignettes as the CHOPIT model (see, e.g. Vonkova and Hullegie (2011), Paccagnella (2013), Van Soest and Vonkova (2014)). Identification of the model follows from the assumptions of RC and VE (King et al., 2004). In practice, RC implies that the boundary parameters are the same across the self-report of interest and all K vignettes. Formally, RC imposes the following restriction where k = 0 indexes boundary equations for the self-report of interest ( 0,0 , …, J−2,0 ) and k = 1, …, K, the corresponding boundary equations for the vignettes, ( 0,k , …, J−2,k ). Note that this equivalence of boundary parameters across the self-report of interest and vignette equations necessitates that all are measured on the same scale (they all have the same set of possible responses and use the same response categories). VE, in contrast, implies that the underlying level of the construct of interest described by a vignette is perceived by all respondents in the same way and on the same unidimensional scale, except for random error (Equation (4)). The alternative is to consider the more general specification where the latent response is a function of respondent characteristics, such that VE therefore imposes the linear restriction(s) that ̃ k = 0, ∀k. In practice therefore, the usual CHOPIT approach simply omits the term x �̃ k in estimation. With all these elements in place, the log-likelihood function for the CHOPIT model consists of two distinct parts: one relating to the self-report of interest (lnL HOPIT ) and the other to the vignette component of the model (lnL V ). When there are several vignettes, lnL V is the sum over the K of these. The first term, lnL HOPIT , is a function of β and j,k=0 ( j,k=0 ), and the second term(s), lnL V , is a function of k , σ and j,k ( j,k ), where k > 0. These two components of the likelihood are then linked by the common boundary parameters. The log-likelihood therefore can be written Column (2) of Table 4 presents CHOPIT estimates using the vignette, m1, for pain described above. Assuming RC and VE hold, the use of vignette responses should adjust for DIF to produce unbiased estimates of the parameters in the outcome equation. The scaling of the primary equation of the CHOPIT model is the same as the OP ( 2 y = 1) and hence the parameter estimates are directly comparable. While the broad effect of covariates on outcomes remains the same across the two models-for example, levels of pain are generally lower for males compared to females, and for respondents who have a post-school qualification, the coefficients are notably changed. The coefficient on male is −0.163 in the OP results and −0.244 for the CHOPIT results. In absolute terms, this represents an increase of approximately 1.6 standard errors on the OP estimate. The estimated effect of any condition and grip strength on pain reduce by approximately 1 and 1.3 standard errors, respectively. Clearly, controlling for DIF appears to be important in these data. Note that the set of covariates used in the boundary equations of the CHOPIT model, z, is the same as the set of covariates in the mean function, x. The sets of boundary coefficients are presented in Appendix C. As noted above, the validity of the CHOPIT approach, however, rests on the assumption of RC and VE.

| Investigating the identifying assumptions of RC and VE
The empirical literature has attempted to investigate the assumptions of RC and VE in applications of the CHOPIT model. However, much of this literature is based on exploratory tests of the assumptions rather than a direct parametric test. For example, tests for VE have largely relied on indirect methods based on the relative rankings of vignettes by respondents to inform whether they are perceived in a consistent way across all survey participants. Results have tended to be ambiguous, for example, while Murray et al.  (2014) illustrate how RC and VE be tested in the absence of objective measures. Using data from SHARE, they consider the ranking of a respondent's self-evaluation among the respondent's evaluations of vignettes and how these vary across socio-economic groups. These are then compared to the rankings obtained following an application of the CHOPIT approach. This leads to a test of the parametric assumptions inherent in the CHOPIT model when compared to a non-parametric alternative.
Of particular relevance to the current paper, Peracchi and Rossetti (2013) provide a direct test of the assumptions of RC and VE by exploiting the fact that under the two assumptions, the CHOPIT model is over-identified. The test, applied to health domains in the SHARE, rejects the joint assumptions of RC and VE. They show that in the absence of the restrictions implied by the joint test for VE and RC only reduced form parameters can be estimated. These are obtained from a set of hierarchical ordered response models estimated in the spirit of Pudney and Shields (2000); see Section 4.1.
Applying the restrictions imposed by RC and VE together with the reduced form estimates, a minimum distance estimator is used to recover the underlying parameters. For example, for a model with a dependent variable containing J ordered outcomes, l regressors and K vignettes, imposing the assumption of RC and VE together with the usual required location and scale normalisation restrictions imposed in OP models, leads to s = {J(l + 1) + 1}(K + 1) parameters to be estimated. Note that we adopt a different notation to Peracchi and Rossetti (2013) to be consistent with the exposition set out in Section 3 (Peracchi and Rossetti (2013), assume R + 1 ordered outcomes (J = R + 1 in the above), J vignettes (K = J in the above) and k regressors (l = k in the above)). Fitting K + 1 (K vignettes plus the self-assessment) generalised ordered probit models leads to q = (J − 1)(l + 1)(K + 1) reduced form parameters. These are composite parameters, since the coefficients in the thresholds and the mean function are not separately identifiable (Peracchi and Rossetti (2013) assume linear specifications of the boundary equations). Assuming RC and VE imposes {(J − 1)(l + 1) + l}K + 2 restrictions, implying there are p = l + (J − 1)(l + 1) + 2K free parameters that can be recovered through a minimum distance approach. With one or more vignettes, the CHOPIT model is over-identified such that under the null hypothesis that RC and VE hold; nQ n (̂ ) ⇒ 2 q−p , as n → ∞. Q n (̂ ) is the minimum distance criterion evaluated at the solution ̂ , with q−p the number of over-identifying restrictions; see Peracchi and Rossetti (2013) for further details.
The mixed findings in support, or otherwise, for RC and VE clearly indicate that whether these two assumptions hold or not, will vary across surveys, the subgroups under comparison, the instruments of interest and the particular vignettes (wording and meaning) used. We set out below an simple to implement test statistic of the assumptions of the CHOPIT model that can be readily used in applications of the approach.

CHOPIT MODEL
This section develops score tests of the assumptions of RC and VC, both jointly and independently. The score test is appealing since only the model under the null requires estimation (i.e. the CHOPIT model). For such an approach to be valid though, the model under the alternative must be theoretically identified, which is not the case for the standard CHOPIT model. However, we can achieve identification with two amendments to the model. First, we restrict the variances 2 k in Equation (4) to be unity, and second, we re-specify the first boundary equation (see Equation (3)) to be an exponential function of the boundary covariates. The approach of re-parameterising a model to facilitate a score test has precedents in the literature. For example, see Greene and McKenzie (2015) with regard to testing for a zero variance in nonlinear panel data models. The amendments, why they are required and their implications, are described in more detail below.

| Restriction on the variance of the vignette equations
It is common in the literature to allow 2 k to be unrestricted or to be equivalent across all K vignettes; for example, see King et al. (2004). We adopt the normalisation; 2 k = 1, ∀k. The variance parameters are generally not identified in ordered choice models (Greene (2018), pp. 730-731). Indeed, these parameters are numerically unidentified under the alternative hypothesis (i.e. failure of RC and VE) in the CHOPIT model. A scale parameter in each vignette of Equation (4) becomes identified under the null hypothesis through information about the cell probabilities and the externally imposed thresholds 0,j in equation (3); see Kapteyn et al. (2011), footnote 7 for discussion on this point.
The suggested score test (to be described in detail below), will essentially consist of an alternative 'model' comprising a series of independent HOPIT models for all of the k = 0, …, K constructs of interest: that is, the self-assessment under scrutiny (k = 0) as well as the available vignette outcomes (k > 0). As noted the scale of these models, in a case-by-case scenario, cannot be separately identified from the structural parameters of the model (without the use of extraneous information, as afforded by the vignettes in the usual CHOPIT set-up). As the score test requires the alternative model to be numerically identified, this requires that the variances in the separate vignettes equation(s) are all restricted to unity.
There are two points of note here. First, in Section 7.2, we consider relaxing this restriction in a Monte Carlo experiment to assess the size of the score test; and the results suggest that the test(s) perform well regardless of whether this restriction is imposed or not. Second, testing the assumption of RC requires that the boundary parameters are equivalent across the self-assessment and vignettes. Taking the first boundary equation and a single vignette as an example, then from Equations (4) and (5) imposing RC (due to the different treatment of the scale effects in both constructs) would require that There are two obvious implications of Equation (7): the asymmetric treatment of the scale variables across constructs appears somewhat arbitrary; and what we actually estimate in practice is 0,0 ∕ 0 and one simply, again arbitrarily, sets 0 = 1. Thus for these reasons, and also to facilitate explicit testing of the RC assumption, we simply set all scale variables throughout equal to unity (although notwithstanding the identification issues raised above, this is likely to be inconsequential, and implicitly any scale effects where σ is not directly estimated, will be absorbed into the estimation of the relevant boundary parameters).

| Specification of the first boundary equations
The exponential form for the boundaries in model (3) is useful as it ensures the necessary ordering of the resulting boundary parameters. However, the implementation of this approach treats the first boundary parameters ( 0,k ) asymmetrically with respect to the other boundary parameters (which enter in a linear, and non-linear fashion, respectively). Moreover, as with the treatment of the scale effects of the vignettes as described above, our alternative/generalised model (required for the score test, see below) requires that all separate HOPIT models for all constructs, be numerically identified.
Clearly with x ≡ z and the first boundary equation specified as i,0 = z � i 0 , this will not be the case. To yield a model numerically identified under the alternative (where RC and/or VE do not hold), we suggest the following modification to the specification of the first boundary parameter where, again, k = 0 indexes boundary equations for the self-report of interest ( 0,0 , …, J−2,0 ) and k = 1, …, K, the corresponding boundary equations for the vignettes ( 0,k , …, J−2,k ).
Due to the presence of the leading term, 0,k , 0,k is free to lie anywhere on the real number line (that is, there is no restriction that 0,k > 0). The remaining (J−2) boundaries follow an analogous specification to that set out in Equation (3).
The simple non-linear transformation of the first boundary equation (along with the scale restrictions described above) therefore numerically identifies a HOPIT model of the form described in Section 4.1 without the need for exclusion restrictions for all of the models/constructs in the system (that is, the self-report of interest as well as all vignettes).
Note that we parameterise the model such that the linear constant term, 0,0 , enters in the main effects equation for y * , and not in this first boundary equation. This follows from location normalisations in OP-type models which typically restrict the constant in the main equation to zero. Alternatively, one does not constrain this parameter in the main equation, but instead restrict the constant in the first boundary equation to zero. These approaches are numerically identical (Greene & Hensher, 2010).

| A generalised alternative model and score test
While leaving the underlying model essentially unchanged, the amended specification described above both improves model identification (by removing the linearity in the first boundary equation and restricting scale effects) while lending itself to a score test of the explicit assumptions of RC and VE in the usual CHOPIT set-up. That is, they allow for a generalised model to be considered (being numerically identified), consisting of a system of independent HOPIT models for all constructs, that collapse to the usual CHOPIT model under the set of parameter restrictions implied by both RC and VE.
More formally, following the amended specification we have the usual underlying index function for the self-report of interest, of the form together with the generalised form for the vignette equation(s) given by Equation (6). Under VE, that is, ̃ k = 0, ∀k, this form collapses to Equation (4) with the exception that now k ∼ N(0, 1), as described above.
To allow us to test for RC in this modified set-up, we have for all of the k = 0,1, …, K constructs, boundary equations of the form Note that in Equation (10) the treatment of 0,k differs from that in Equation (8) in that the constant term has (equivalently) been moved into the mean Equation (9). Finally, as noted above, RC implies equivalence of parameters 0,k and j,k across all boundary equations for k = 0,1, …, K. This then provides us with a simple parameter restriction test of RC. So here, under RC, Equation (10)

collapses to simply
This amended specification of the standard CHOPIT model identifies separate HOPIT models for all of the k = 0,1, …, K constructs, defined by Equations (6), (9) and (10). Under the null of RC and VE, the set of generalised HOPIT models collapse to the (boundary-amended) CHOPIT model. As all of these restrictions have been shown to be simple linear ones, they can be tested both individually and jointly by using standard score tests based on the likelihoods of the respective unrestricted model evaluated at parameter values under the null; that the restricted CHOPIT model is correctly specified (Greene, 2018). Not only does the score test lends itself to separate and joint tests for the assumptions of RC and VE, it does not require estimation of the more complex alternative models. Full analytical derivatives of the appropriate score vector(s), the formal null and alternative hypotheses and the corresponding form of the score test, are all presented in Appendix B (although one could also use numerical derivatives). Gauss code to undertake the tests are available at http://github.com/aptec h/chopi tlib.

APPLIED TO SELF-REPORTS OF PAIN
In this section, we consider the practical implications of the suggested amendments to the boundary equation(s) by comparing to commonly used specifications. We then apply our suggested score tests to our empirical example of modelling pain in the SHARE data.

| Amended specification
Before applying our score test to the modelling of self-reported pain, we investigate the implications of our suggested amendments to model estimates. Table 5 reports the results of applying the amended (9) y * = x � + y , y ∼ N(0, 1), CHOPIT model to the self-reported data on pain from SHARE (column 1). The Table also compares these results to those obtained by the standard CHOPIT model (column 2) and a model where the boundary equations are all specified as linear functions of the covariates (column 3). While a linear specification fails to ensure the correct ordering of the boundaries, i,j , j = 0, 1, …, J − 2, it has often been applied in empirical applications (see, for example, Bago d' Uva et al., 2008). We include this specification for completeness. The results illustrate the difference in model estimates from changing the specification of the boundaries. To ensure that the estimated coefficients are comparable, all models restrict the variance of the error term in the vignette equation 2 k to unity as per the amended CHOPIT model. Accordingly, the scale of the estimates differ from the standard CHOPIT model (for which 2 k is freely estimated) and hence are not directly comparable to the estimates provided in Table 4. However, a comparison of the relative effects of coefficients (to remove the scaling of parameter estimates) reveals very similar results. For example, the estimated coefficient for male relative to any conditions (AnyCond) in the CHOPIT model is −0.244 0.588 = −0.41 (Table 4). For the amended specification in Table 5, the corresponding relativity is −0.271 0.657 = −0.41 . Similar relative estimates are apparent for the other covariates. While restricting the vignette variance to unity changes the scaling of the estimates, their relative interpretation remains the same as in the standard CHOPIT model. Accordingly, marginal effects will be unaffected by the scaling. Note that full results for the three specifications, including the boundary equations, are reported in Appendix C, Table C.
To further investigate the model implications of the amended specification, Table C3 of the Appendix presents averaged estimated boundaries for the three approaches. The standard exponential and linear specifications provide similar estimates of 0 , 1 and 2 . Those of the amended exponential approach are substantially larger, but by a constant amount relative to those of the standard (0.999) and linear (approximately 1.008) approaches. The following two panels of the table consider the location of the boundaries with respect to the estimated linear index, x ′̂ , and separately the estimated vignette constant term ( 1 in Equation (4)). As with standard OP-type models, it is not just the value of the index function defining y * that is of relevance, but the position of this index in relation to the boundaries that are essential for generating predictions from the model. Across the three specifications, we see that these quantities are essentially identical indicating (at least approximately) equivalence of the three approaches. Further evidence of these findings are reported in Tables C4-C6. Table C4 contains the sample correlations of estimated boundary values and y * values. These are clearly all very highly correlated, and in all cases close to one. Table C5 considers estimated probabilities. The averages of these are identical across specifications, and the correlation across the individual estimates are 1, or very close to 1, in all cases. Finally, Table C6 contains the implied partial effects for each specification. Again these are essentially equivalent across model specifications.
In summary, while individual parameter estimates may vary across the different boundary specifications, for each essentially the same model results. Importantly, the amended specification of the boundaries does not unduly enforce any implicit/explicit restriction(s) on the model that might adversely affect results and tests statistics.

| Tests of RC and VE for self-reports of pain
An application of the score tests to SHARE data is presented in Table 6. The data and specification follow that used in column (1) of Table 5. However, we make use of all possible permutations of the three available vignettes. The joint test of the null of both RC and VE is rejected at conventional levels for all vignettes used singularly or in combination. The test for VE alone (assuming RC holds) fails to reject the null when vignettes V1 or V3 are used singularly and when vignettes V2 and V3 are used in combination. However, VE is rejected in all other combinations. When we consider only RC (assuming VE holds) the score test rejects the null for all vignettes and their combinations, as does the joint test. The results emphasise the importance of testing for the identifying assumptions of RC and VE in applications of the CHOPIT model when attempting to correct for DIF.

| MONTE CARLO EVIDENCE
To fully explore both the general implications of the proposed change in boundary specification and the score tests, we consider a series of Monte Carlo experiments. Throughout we simulate data by drawing from SHARE data the set of covariates used in the empirical example described in Section 6.1.
The Monte Carlo experiment simulates data as follows: (i) use all N = 3802 observations and their corresponding covariates x i from the SHARE data, (ii) construct the latent outcome y * =x �̃ + y using the parameter estimates from the empirical example presented in Section 6.2 as column (2) T A B L E 6 Score tests for combinations of vignettes (J = 4) of Table 5 together with a randomly generated standard normal error, N(0,1), (iii) the latent vignette outcome, v * i,1 , is constructed by random normal draws from the distribution N(α,1), with α set to the value obtained by estimation of the model in column (2) of Table 5 (full model estimates including  boundary parameters are provided in Table C), (iv) the corresponding observed outcomes, y i , v i,1 , are then constructed from their latent counterparts together with knowledge of the boundary parameters (̃ i,0 , …,̃ i,2 ) estimated from the model reported in column (2), Table 5. CHOPIT estimation of the simulated y i and v i,1 on the set of covariates x i is then undertaken. This is repeated for M = 2000 simulations (as are all other Monte Carlo subsequent experiments) and results for models for which convergence was achieved (S) summarised in Table C8 (convergence was deemed to have failed after 500 maximum likelihood iterations).

| Boundary specification
We first illustrate the difference that the amended specification has on the estimated vector of coefficients, ̃ , when compared to standard exponential or linear specifications. Typically, these are the parameters of most interest in empirical applications. Table C8 presents the results. Data are generated assuming standard exponential specification of the boundaries and estimated separately assuming amended exponential, standard exponential and linear specifications.
Monte Carlo coefficients are close to their 'true' values across the different specifications of the boundaries. This can be seen by the small values reported for mean bias. The 5% coverage rate is also within expected range across all parameter estimates. However, while the standard exponential and amended exponential specifications display high convergence rates with S/M = 0.998 for both, the convergence rate for the linear specification of the boundaries is low (S/M = 0.289) illustrating the fragility of that specification. This reflects the lack of identification through not imposing non-linearity in the boundaries.

| Finite sample performance of the score tests
We evaluate the performance of the score tests by generating data under the null in a similar way to that described above again based on the estimated coefficients from the empirical models presented in Table 5. We consider three sets of test size experiments, where we generate under the null hypothesis with linear boundaries (column (3), Table 5); with standard exponential boundary thresholds (2); and amended exponential boundaries (1). We then conduct the tests as if the boundaries were of the amended exponential form.
When the data generating process (DGP) is as the test assumes, (amended exponential boundaries), the tests are correctly sized for the score joint and score RC variants ( Table 7). The score VE variant appears to be marginally undersized (at 4.10% for a nominal 5%). When the true DGPconsists of linear thresholds or standard exponentials, the score joint and score RC tests appear to be marginally oversized, however, overall the tests remain within an acceptable range. Note that relaxing the assumption of 2 k = 1, ∀k does not materially affect size results, as evidenced in Appendix C Table C7.
We next consider power experiments using a similar Monte Carlo experimental set-up as above, but where the assumptions of RC and VE are violated. In the experiments for departures from RC, we perturb the parameter vector corresponding to the boundary equations for the vignettes (i.e. j1 in Equation (5)), perturbing at increasing values away from zero. These are undertaken for a model generated assuming amended exponential boundaries. This is achieved by first generating a vector of standard normal random variates of the same dimension as z(= x). These draws are held fixed.
We then move away from the null of RC by perturbing j1 in the vignette equation only, by adding successively larger quantities to the value under the null. These quantities are dictated by the set of (fixed) random normal variates with increases achieved by multiplying by a scalar, s rc in the range 0.0 ≤ s rc ≤ 0.20. This ensures greater departures from the null for increasing values of s rc . For violations of VE, a vector of random variates of dimension x is first drawn. We then perturb the corresponding implicit vector of zero coefficients, ̃ (under the null), on the covariates x (Equation (6)) by multiplying the random draws by a scalar, s ve , and substituting these as parameters for ̃ . This process is repeated for successively larger values of s ve such that 0.0 ≤ s ve ≤ 0.50. For the joint experiments, we simultaneously employ both approaches.
The results are then summarised as power curves, plotting rejection probabilities against the size of the perturbation from the null of zero. Three curves are shown: a joint test for RC and VE; and separate ones for RC and VE.
The left-hand side of Figure 2 displays the power curves for all three tests when we violate RC only. The curves are well behaved. Departures from RC results in S-shaped power curves for the test of RC alone and for the joint test (RC and VE). As expected, the test for RC uniformly dominates that for the joint test. This is due to the test maximising power in the single direction while the joint test is also testing for VE. In comparison, the test for VE remains fairly flat over the range of values for which RC is violated. This is encouraging as clearly VE is exhibiting some power as a general specification test, when VE is not failing but RC is. The right-hand side of Figure 2 presents the power curves when we violate VE alone. While the power curve for the test of VE adopts an approximate S-shape, it appears relatively sensitive (that is powerful) to small departures from VE and increases fairly rapidly across relatively small increments. Moreover, departures from VE are also reflected in the test for RC. The joint test also adopts the S-shaped curve, but rejects less than the test for VE alone, again due to the latter only testing for departures from the null in that particular direction.
A priori one would not expect increasing departures from the null with respect to VE (RC) to affect the power properties when testing for RC (VE). However, it has been well-known that it is possible to reject a false model against an alternative model, even if that alternative model is not correct (Davidson & MacKinnon 1987). In this sense, such tests that tend to reject a false model in favour of a similarly false alternative model, are often referred to as general specification tests. In this sense, the test for RC can be considered a useful general specification test, as it tends to similarly pick-up departures in the direction of VE. However, the same cannot be said of the test for VE, and power only marginally increases with departures from the null with respect to RC. It is unclear what the specific reasons for these results are, and also whether the results will hold more generally.
A probable reason for the strong performance of the RC test in identifying departures from the null of VE, is that in misspecifying the assumed outcome function in the vignette equation(s), may result in the boundary parameters having to adjust to ensure their relationship with this outcome function. In as such, by imposing non-VE, this may also manifest itself as a form of RC violation. In contrast, there appears to be a less persuasive argument for the reverse situation. By moving the boundary parameters further away from the null, the outcome function in the vignettes equation has much more limited ability to move itself to maintain the assumed relationship with the boundary parameters as implied by the null model. Prior Three-dimensional planes of rejection rates against simultaneous departures from both RC and VE are shown in Figure 3. The joint test and the test for RC perform in a similar fashion, and reflective of the results of the power curves, the joint test appears to be dominated by the test for RC. While the test for VE appears to respond to departures from the null of response consistency (̃ k ≠ 0), the test remains fairly flat over the range of values for which RC is violated.
In summary, the experiments show: (i) the individual tests have greatest power in their particular direction, (ii) both individual tests have increasing power in distance from the null with respect to the alternative violation, (iii) power increases with distance away from the null in all cases; and (iv) the joint test has increasing power in all directions, and is maximised with simultaneous deviations from the null in both directions. Note that allowing 2 k to be unrestricted does not impact the power of the test substantially, as shown in Appendix C, Figures C1 and C2.

| Comparison of the proposed Score test with the minimum distance estimator approach of Peracchi and Rossetti (2013)
Section 3 of Peracchi and Rossetti (2013) investigates the finite sample performance of their minimum distance estimator using a Monte Carlo experiment. We undertake the same Monte Carlo exercise to compare their results to our score tests (full details of the Monte Carlo experiment can be found on p712, Peracchi and Rossetti (2013)). Table 8 presents the results for the situation where there are J = 2 threshold boundaries, k = 1,2 vignettes and a single covariate: x = z (note that Peracchi and Rossetti (2013) use a different notation by indexing boundaries as r = 1, …, R, vignettes as j = 1, …, J and exogenous regressors k = 1, …, K). The sample size for the draws is N = 250, and each Monte Carlo exercise consists of M = 1,000 runs (as per Peracchi and Rossetti (2013)).
The first column of results presents rejections rates at a nominal 5% level for the minimum distance estimator. These are followed by the joint score test and the separate score tests for VE and RC. The row labelled H 0 reports observed size of the various tests at the 5% level. All rejection frequencies are close to nominal level under the null. VE and RC simultaneously. Rejection rates are reported for increasing departures from the null. The score joint test displays greater power than the test of Peracchi and Rossetti. This is the case for departures from the null for VE and RC separately and for joint departures. While the test of Peracchi and Rossetti lacks power when only a single vignette is used (but not with multiple vignettes), this is not the case for the score test which generally increases in power with increasing departure from the null even with a single vignette. The power of the test, however, also generally increases with when including a second vignette. When comparing the rejection rates across the three score tests for the different departures from the null, results closely reflect those of the Monte Carlo exercise reported and summarised in Section 7.2 above. Again, the tests have greatest power in their particular direction; power generally increases with increasing departure from the null and the score test for VE has the lowest power amongst the three tests when considering violations in their own respective direction.

| CONCLUSIONS
Inter-individual comparison of phenomena such as health status or life satisfaction that are typically self-reported on an ordered categorical scale are often subject to differential item functioning due to survey respondents' adopting different response scales. Vignettes are increasingly being collected alongside self-reports to anchor such scales and provide greater comparability across individuals. This is particularly relevant when undertaking cross-country comparisons where differences in cultural norms may lead to the use of very different response scales. We illustrate the effects of correcting for DIF using information on vignettes in an example on self-assessments of levels of pain. We stress that the legitimate use of the vignette approach relies on the two assumptions of RC and VE. In light of this, the paper then develops single and joint tests of these assumptions based on a score approach.
Implementation of the test is within the parametric CHOPIT model that imposes a hierarchical structure on the boundary equations to preserve coherency of the model's probabilities. The score approach requires the model to be identified under the alternative hypothesis; the null being that RC and VE hold. This is achieved by augmenting the specification of the boundary equations by including an exponential function in the first boundary equation, together with restricting the variance of the vignette equations to unity. Such changes are innocuous in terms of parameter estimates (and marginal effects) of the coefficients in the mean function, which are typically the focus of empirical work.
An advantage of the test is its ease of implementation, requiring estimation of the restricted model under the null only. This is undertaken using the CHOPIT model. The test may be seen as a complement, or alternative, to Peracchi and Rossetti (2013) who also develop a joint test of RC and VE.  However, an advantage of the current approach is that separate tests of both RC and VE are available, and also that there was evidence of the current test(s) being more powerful. We find that for the empirical example of self-assessed pain the joint null of RC and VE is rejected, such that the adjustments using vignettes may be unreliable. Monte Carlo simulations drawn from these data show that the tests have good size and power properties in finite samples, particularly for the joint test and the individual test for RC. Our results suggest that the assumption of VE may be more problematic in empirical applications than RC. This finding mirrors that of Peracchi and Rossetti (2013). In particular, failure of VE may also be picked up through rejection of RC. This is an area where the design of vignette questions to aid respondents' common understanding of the descriptions of the hypothetical individuals may best improve vignette equivalence. The majority of applications of the vignette approach rely on cross-sectional data. Future research might consider extensions to panel data to control more fully for individual unobserved heterogeneity in reporting behaviour. For example, extensions of such an approach to the modelling of ordered self-assessed health outcomes is provided by Bartolucci and Bacci (2014). In the context of applications to the vignette approach, where feasible, a flexible treatment of reporting heterogeneity may make reliance on the assumptions of RC and VE more plausible.

A.1. Vignette descriptions
The three vignettes available for pain within SHARE are: Vignette (m1): "Karen has a headache once a month that is relieved after taking a pill. During the headache she can carry on with her day-to-day affairs. Overall in the last 30 days, how much of bodily aches or pains did Karen have?" Vignette (m2): "Maria has pain that radiates down her right arm and wrist during her day at work. This is slightly relieved in the evenings when she is no longer working on her computer. Overall in the last 30 days, how much of bodily aches or pains did Maria have?" Vignette (m3): "Alice has pain in her knees, elbows, wrists and fingers, and the pain is present almost all the time. Although medication helps, she feels uncomfortable when moving around, holding and lifting things. Overall in the last 30 days, how much of bodily aches or pains did Alice have?"

A.2. CHOPIT model probabilities
Expressions for the probabilities derived for the self-report from the CHOPIT model.
Corresponding probabilities for the vignette outcome(s), for k = 1, …, K, are

APPENDIX B SCORE VECTORS FOR THE TESTS OF RESPONSE CONSISTENCY AND VIGNETTE EQUIVALENCE
In this Appendix, we set out formally the various score vectors required for the score tests described in Section 5 and how these are combined for the separate tests of RC and VE, together with the joint test of both RC and VE. Note that although analytical expressions are given, one could also use numerical derivatives.
First we derive the score vector for the restricted CHOPIT model (Equations (4) and (11)), and show how particular elements can be adapted to derive the proposed score statistic(s). That is, the score is derived for the boundary-amended CHOPIT model under the assumption, or restrictions, that both VE and RC hold. Testing for VE, on the assumption that RE holds, in this setting requires replacing the appropriate elements of the score corresponding to the derivatives with respect to k , with those of the more general model-Equation (6)-which does not impose VE. As usual, the score test evaluates the score for the more general (here the model allowing for VE), but at parameter values under the null hypothesis (assuming VE).
Similarly, testing for RC, on the assumption that VE holds, is based on the full score of this restricted boundary-amended CHOPIT model, but now replacing the appropriate elements corresponding to the derivatives with respect to the boundary parameters implied by the restricted specification of Equation (11), with those implied by the generalised version of Equation (10). Again, the test is based on evaluating the generalised score at parameter values under the null hypothesis.
Finally, the joint test of both VE and RC involves replacing the elements of the restricted CHOPIT score with respect to both k and the boundary parameters, with those implied by the more general specifications. Again, the test involves evaluating the score of the generalised model at parameter values under the null hypothesis.
The score vector for this model consists of a series of partitions. The first corresponds to β, (∇β), such that  (4)). x ′̂ is the estimated linear index including the constant. Boundary equations are estimated hierarchically such that j = j−1 + exp(x �̂ ). Accordingly, subtracting the linear index, x ′̂ , from the first boundary, 0 , affects all subsequent boundaries. This is denoted above for 1 and 2 as '̂ 1 with (̂ 0 − x �̂ )' and '̂ 2 with (̂ 0 − x �̂ )', respectively. Similarly for subtracting the vignette constant, ̂ 1 from each boundary.    Simulations are generated assuming set of covariates x and parameters from column 2 of Table (C) that is assuming standard exponential boundaries. Models are estimated assuming (i) amended exponential boundaries, (ii) standard exponential boundaries, and (iii) linear boundaries. S represents the number of model repetitions that converged. MB is mean bias; Cov is the 5% coverage rate.