Estimating the binary endogenous effect of insurance on doctor visits by copula‐based regression additive models

The paper estimates the causal effect of having health insurance on healthcare utilization, while accounting for potential endogeneity bias. The topic has important policy implications, because health insurance reforms implemented in the USA in recent decades have focused on extending coverage to the previously uninsured. Consequently, understanding the effects of those reforms requires an accurate estimate of the causal effect of insurance on utilization. However, obtaining such an estimate is complicated by the discreteness inherent in common measures of healthcare usage. The paper presents a flexible estimation approach, based on copula functions, that consistently estimates the coefficient of a binary endogenous regressor in count data settings. The relevant numerical computations can be easily carried out by using the freely available GJRM R package. The empirical results find significant evidence of favourable selection into insurance. Ignoring such selection, insurance appears to increase doctor visit usage by 62% but, adjusting for it, the effect increases to 134%.


Introduction
The Affordable Care Act, which was passed by the US Congress and signed into law by President Obama in 2010, represented one of the largest expansions of the US social safety net since the 1960s. Through a combination of mandates, subsidies and regulations, the law's primary goal was to extend health insurance coverage to the approximately 44 million Americans who, before the law's passage, lacked coverage. Such a large expected increase in insurance coverage, in turn, raised questions about whether the newly insured would respond by increasing their usage of medical services, possibly straining existing health services infrastructure. Concerns about medical infrastructure have recently reached new heights in the midst of the 2020 coronavirus pandemic.
Estimating the effect of insurance on healthcare usage is surprisingly difficult, because of which can be regarded as variants of the modelling framework that is discussed in Section 2 (e.g. Goldman (1995), Cardon and Hendel (2001), Mello et al. (2002), Deb and Trivedi (2006) and Zheng and Zimmer (2008)). For the most part, existing studies find that insurance increases healthcare usage, but there seems to be widespread disagreement about the extent to which the simultaneity of insurance and usage affects those findings. When unobserved information-in this case, knowledge of future health needssimultaneously affects both the treatment (insurance) and the outcome (healthcare usage), statisticians refer to the treatment as 'endogenous' (Cameron and Trivedi (2013), pages 386-388). Ignoring endogeneity might produce incorrect estimates of the causal effect of insurance. For example, if relatively unhealthy people, knowing that they will need to consume healthcare services, seek insurance coverage to help to pay for those services, then insurance will appear positively linked to healthcare usage. But such relationships cannot be interpreted as causal, because (at least part of) the positive link owes to unhealthy people selecting into insurance coverage. Because healthcare policy in the USA in recent decades has focused on extending coverage to the previously uninsured, obtaining an accurate estimate of the causal effect of insurance on usage is crucial to determine whether such reforms will stretch existing healthcare infrastructure.
From a more general statistical modelling perspective, many microeconometric models, especially those that rely on observational data, are plagued by unobserved heterogeneity that simultaneously correlates with the outcome variable and a right-hand-side explanatory variable of interest. When the outcome variable is continuous with a distributional shape that lends itself to linear regression modelling, the standard approach involves finding 'instruments' that correlate with the endogenous right-hand-side regressor, but not with the outcome variable. In the healthcare usage case-study, a possible instrument might be firm size, which is likely to correlate with a person's probability of having private insurance, but should not (directly) alter healthcare usage. When such instruments are available, the method of instrumental variables, which is also known as two-stage least squares, can be employed; see, for example, Greene (2008), chapter 12, for a textbook treatment of instrumental variables methods. But when the outcome variable has discreteness or shows distributional patterns that call for non-linear models (such as generalized linear and additive models), the two-stage least squares method no longer yields consistent estimates. A number of studies that have sought to quantity the effect of insurance on healthcare usage, some of which are mentioned below, have introduced methods that attempt to import the logic of two-stage least squares to more general settings. Unfortunately, as discussed in the following section, those approaches either lack general applicability and/or may require substantial computational resources to implement. This paper emphasizes cross-sectional settings, where the main focus centres on consistent estimation of the coefficient of a binary endogenous explanatory variable. Throughout, we assume that the researcher has access to valid instruments. As such, the paper does not consider panel data methods for addressing endogeneity, such as fixed effects or difference in differences; nor does it explore instrument-free methods, such as matching (Rosenbaum and Rubin, 1983), synthetic control (Kreif et al., 2016) or approaches that exploit quirks in the higher order moments of the distribution of the outcome variable (Lewbel, 2012).
The primary goal of the paper is to investigate the aforementioned case-study by using a general and flexible estimation approach, based on copula functions, for consistently estimating the coefficient of a binary endogenous regressor in count data settings. Some existing studies have attempted different versions of the method that we discuss (e.g. Han and Vytlacil (2017), Park and Gupta (2012), Radice et al. (2016), Tran and Tsionas (2015), Winkelmann (2012) and Zimmer (2018)). The work by Zimmer (2018) is the closest to ours since it introduces a copula model with binary and Poisson margins, and with an endogenous binary variable. However, our proposal is far more general in that it allows for a wider range of discrete distributions for the count variable, for several link functions for the binary margin, for the specification of flexible covariate effects, for the use of a wider set of copula functions and for the dependence parameter to be specified as a function of covariates. Using our method, we find statistically significant evidence that insurance is endogenous with respect to usage of doctors' services, and that, when endogeneity is taken into account, the effect of insurance is larger than when endogeneity is ignored. Health economists refer to such a pattern as 'favourable selection', with relatively light users predisposed towards being insured.
The paper also highlights the newly revised software package GJRM (Marra and Radice, 2020), written for the programming language R (R Core Team, 2020), which greatly eases the implementation of our model. To the best of our knowledge, there are no alternative copula regression models, nor respective software implementations, of the type that is discussed in this paper. Although the construction and estimation of the model proposed rely on many of the modular functions and routines that are already available in GJRM, extending the software to accommodate the model, the functional forms and nuances that are needed to address the aforementioned case-study required a large amount of programming work. Those developments made it possible to estimate the flexible class of models that are discussed in this paper and hence allow for more flexible specifications than are possible by using extant methods.

Existing methods
Instrument-based approaches for addressing endogeneity in non-standard settings fall into two main categories: two-stage techniques and simultaneous estimation methods. The simplest twostage procedure, which also most closely resembles linear two-stage least squares, is the 'control function' approach (Heckman and Robb, 1985;Terza et al., 2008). The first stage involves regressing the endogenous variable on all explanatory variables, including instruments. This regression is then used to calculate residuals, which, in the second stage, appear alongside the endogenous variable as an additional regressor in the main regression of interest. The control function method is simple and quite general, but it encounters problems when the endogenous variable is not continuous (Wooldridge (2010), page 746). In our case-study the endogenous variable is binary and this muddles what exactly constitutes a 'residual' from the first stage, with different types of definition potentially leading to conflicting results.
A second category of approaches, labelled 'simultaneous estimation methods', attempt to assemble the full joint distribution of the outcome variable and the endogenous regressor. The joint distribution is then typically used for likelihood-based estimation; see, for example, Terza (1998) and Cameron and Trivedi (2013), pages 385-412. Such approaches usually decompose the (unknown) joint distribution into the marginal distribution of the outcome conditionally on the endogenous regressor and the marginal distribution of the endogenous regressor. These methods, however, can be quite computationally intensive if, as is often assumed, the two marginals share an unobserved random effect, which then must be integrated out. For example, Deb and Trivedi (2006) reported that one variant of this approach, based on simulated maximum likelihood, shows 'quite slow' convergence, which led them to suggest several simulation acceleration tricks, including alternative methods for drawing random numbers. Indeed, the simulated maximum likelihood based code that was generously provided by Deb and Trivedi (2006)-complete with their 'accelerators'-applied to the case-study that is explored in this paper required almost 30 min to converge on a desktop computer, compared with mere seconds for our proposed copula-based method.
The approach that is proposed in this paper falls into the 'simultaneous estimation methods' category, but it side-steps the numerical integration obstacle by joining (via copulas) the two marginal distributions, yielding a closed form expression for the likelihood function. Consequently, parameter estimation and inference are far less computationally taxing.
The data that are analysed in the paper and the programs that were used to analyse them can be obtained from https://rss.onlinelibrary.wiley.com/hub/journal/14679876/seriesc-datasets.

Methodology
This section discusses a recursive copula additive model to estimate the effect of a binary endogenous variable on a count outcome. Details on identification, parameter estimation and software implementation are also provided.

The model
To simplify the notation, and without loss of generality, we drop observation index i, while noting that n observations are available for modelling. We begin by assuming that the joint cumulative distribution function (CDF) of a binary (endogenous) variable and a discrete outcome variable, Y 1 ∈ {0, 1} and Y 2 ∈ N 0 respectively can be expressed as F 12 .y 1 , y 2 |ϑ/ = C{F 1 .y 1 |π/, F 2 .y 2 |μ, σ/; θ}, .1/ where ϑ = .π, μ, σ, θ/ . Terms F 1 .y 1 |π/ and F 2 .y 2 |μ, σ/ denote marginal CDFs of Y 1 and Y 2 taking values in .0, 1/, whereas the symbols π, μ and σ represent marginal distributional parameters. Function C : .0, 1/ 2 → .0, 1/ is a two-place copula function which does not depend on the marginals, and θ is an association copula parameter measuring the dependence between the two random variables. A substantial advantage of the copula approach is that a joint CDF can be conveniently formed by utilizing two (in this case) arbitrary univariate marginal CDFs and a function C that binds them together. As opposed to what is found in classical copula regression settings, in this work the binary variable y 1 appears as an explanatory variable inside μ in the marginal F 2 .y 2 |μ, σ/. Thus, the copula has a recursive structure. This recursive structure implies that y 1 is endogenous with respect to y 2 if dependence between the two marginals, as captured by θ, is statistically significant. See Han and Vytlacil (2017) and references therein for some works which have adopted the same logic in a copula regression context.
The copulas that were implemented in GJRM are reported in Table 1. Table 1 also shows the relationship between θ and Kendall's τ -coefficient, which is a measure of association that lies in the customary range τ ∈ [−1, 1]. Counterclockwise rotated versions of the Clayton, Gumbel and Joe copulas are obtained by using the formulae in Brechmann and Schepsmeier (2013). For more details on copulas see, for example, Nelsen (2006). In the current setting, the result of Sklar (1973) can only guarantee that the copula is unique over the range of the outcomes. In a regression context, however, this potential issue is less likely to be a concern as noted by several researchers including Joe (2014), Nikoloulopoulos and Karlis (2010) and Trivedi and Zimmer (2017), primarily because regression structures in the marginals generate additional variation in the outcomes and thus more completely cover the outcome domains.
In equation (1), the marginals for Y 1 and Y 2 are assumed to be specified via one-and two- ·, ·; θ/ denotes the CDF of a standard bivariate normal distribution with correlation coefficient θ, and Φ.·/ the CDF of a univariate standard normal distribution. t 2,ζ .·, ·; ζ, θ/ indicates the CDF of a standard bivariate Student t-distribution with correlation θ and fixed ζ ∈ .2, ∞/ degrees of freedom, and t ζ .·/ denotes the CDF of a univariate Student t-distribution with ζ degrees of freedom.
Quantities Q and R are given by 1 + .θ − 1/.p 1 + p 2 / and Q 2 − 4θ.θ − 1/p 1 p 2 respectively. Kendall's τ for "PL" is computed numerically as no analytical expression is available. The argument BivD of gjrm() in GJRM enables the user to employ the desired copula function and can be set to any of the values within parentheses next to the copula names in the first column; for example, BivD = "J0". parameter distributions respectively: hence the notation that is adopted. However, the computational framework can be conceptually easily extended to distributions with more parameters. For the binary endogenous variable Y 1 , we have considered the Bernoulli distribution with parameter π ∈ [0, 1] (representing the probability of 'success'). For the outcome variable, Y 2 , GJRM has been extended to include four possible choices for discrete distributions; see Table 2 for the expressions of their probability mass functions (PMFs) f , expectations and variances.
Finally, each model's parameter can be related to covariates and regression coefficients via an additive predictor η ∈ R (defined in generic terms in the next paragraph) and a known monotonic one-to-one transformation function (which ensures that the restriction on the respective parameter space is not violated). For example, if we wish to employ a Gumbel copula model with Bernoulli and Poisson margins and would like to express π, μ and θ as functions of additive predictors then g π .π/ = η π , g μ .μ/ = η μ and g θ .θ/ = η θ , where g π .·/ = logit.·/, g μ .·/ = log.·/ and g θ .·/ = log.· − 1/. For the binary margin, in this example, we have assumed a logit link function; however the probit and cloglog-functions have also been implemented.
Note that specifying the dependence parameter as a function of covariates allows for the strength of the (upper tail, in the above example) dependence between the marginals to vary across observations (e.g. , and references therein). Furthermore, because the dependence parameter in our proposed set-up captures the magnitude (and direction) of endogeneity, specifying the dependence term as a function of covariates allows endogeneity to vary across observations, which is something that has not been previously explored. In Table 2. Definition and some properties of the discrete distributions implemented in GJRM † Poisson-inverse Gaussian ("PIG") 2α π 0:5 μ y exp.1=σ/K y−0:5 .α/ .ασ/ y y! μ μ+ σμ 2 †These have been parameterized according to Rigby and Stasinopoulos (2005) and are defined in terms of μ and σ. In all cases, y ∈ N 0 and μ, σ ∈ .0, ∞/. Since the distributional parameters can take only positive values, the trans- x + x −1 /}dx is the modified Bessel function of the third kind. Argument margins of gjrm()in GJRM enables the user to employ the desired binary and discrete marginals. For the discrete margin, the possible values are indicated within parentheses next to the names of the distributions.
our case-study-which seeks to estimate the effect of insurance on healthcare use-we present evidence that the magnitude of endogeneity is larger among females than among males.
Predictor η i (where the subscript denoting which parameter the predictor belongs to has been dropped for simplicity) takes the additive form where β 0 ∈ R is an overall intercept, z ki denotes the kth subvector of the complete covariate vector z i (containing, for example binary, categorical, continuous and spatial variables) and the K functions s k .z ki / represent generic effects which are chosen according to the type of covariate(s) that is considered. Each s k .z ki / can be approximated as a linear combination of J k basis functions b kj k .z ki / and regression coefficients β kj k ∈ R, i.e. (e.g. Wood (2017)) This formulation implies that the vector of evaluations {s k .z k1 /, : : : , s k .z kn /} can be written as Z k β k with β k = .β k1 , : : : , β kJ k / and design matrix Z k .i, j k / = b kj k .z ki /. This enables the predictor in equation (2) to be written as η = β 0 1 n + Z 1 β 1 +: : : where 1 n is an n-dimensional vector made up of 1s. Equation (4) can also be written in a more compact way as η = Zβ, where Z = .1 n , Z 1 , : : : , Z K / and β = .β 0 , β 1 , : : : , β K / . Each β k has an associated quadratic penalty λ k β k D k β k whose role is to enforce specific properties on the kth function, such as smoothness. Note that D k depends only on the choice of basis functions, but not on β k . Smoothing parameter λ k ∈ [0, ∞/ controls the trade-off between fit and smoothness, and plays a crucial role in determining the shape ofŝ k .z ki /. The overall penalty can be defined as β D λ β, where D λ = diag.0, λ 1 D 1 , : : : , λ K D K /. Finally, the smooth functions are subject to centring (identifiability) constraints (Wood, 2017). The above smooth function representation enables us to specify a rich variety of covariate effects (such as linear, non-linear and geographic effects) and we refer the reader to (Wood, 2017) for full details.
For the previous example (of a Gumbel copula model with Bernoulli and Poisson margins), the predictors for π i , μ i and θ i can be written in a compact way as η π = β 10 1 n + Z 11 β 11 +: : : + Z 1K β 1K , η μ = β 20 1 n + β end y 1 + Z 21 β 21 +: : : + Z 2K β 2K , η θ = β 30 1 n + Z 31 β 31 +: : : where y 1 appears as an (endogenous) predictor in η μ , thus giving the set-up a recursive structure. Our main interest is the parameter β end , which represents the effect of the binary endogenous regressor on the predictor of the outcome of interest. In the particular context of a recursive model with an endogenous regressor the set of covariates might be common across the predictors, except for the endogenous equation which must include at least one instrument that is not included in the other predictors (e.g. Han and Vytlacil (2017) and Meango and Mourifie (2014)).

Identification
Han and Vytlacil (2017) considered a version of this model, but where the outcome variable is binary instead of a non-negative integer count. Borrowing from the familiar binary probit set-up, they provided proofs that establish situations where recursive copula constructions have a full rank Jacobian. Their result requires two conditions. The first is that the copula must show first-order stochastic dominance with respect to θ, which says that, as F 1 increases, so also does F 2 , and that such a relationship becomes stronger as θ increases. The second condition is the presence of an instrument that affects the endogenous regressor but not the outcome variable. Without such an instrument, it remains possible to write down copula expressions with recursive structures, but there is no guarantee that specific parameter values yield a unique maximum to the likelihood function that is formed from the copula expression.
Zimmer (2018) extended Han and Vytlacil's argument to settings in which the outcome variable follows a Poisson distribution. The extension hinges on the famous 'law of rare events', which states that a Poisson outcome may be viewed as the sum of 'successes' from many independent Bernoulli trials, so long as the number of trials is large, and the probability of a success in any individual trial is small (Cameron and Trivedi (2013), page 5). And, because Han and Vytlacil's result holds for each individual 'trial', their result should also hold for the sum of many trials, so long as those trials remain independent.
More generally, copula specifications should support recursive structures beyond just binary and Poisson settings. The reason is that parametric copula functions can be generated by 'mixing' marginals distributions that share a common random effect (Trivedi and Zimmer (2007), pages 36-38), and such a mixing method has long been recognized as a way to combine marginals of disparate forms, beyond just binary and Poisson (Fridman and Harris, 1998). For example, in trying to form the (unknown) joint distribution for the pair .Y 1 , Y 2 /, we can start by decomposing the joint distribution into a product of marginals: where y 1 appears as a conditioning variable in the marginal for y 2 . Then, by assuming the presence of a shared random effect u, with an assumed probability density function f u .u/, and then numerically integrating out the random effect, f 2 .y 2 |y 1 , u/ f 1 .y 1 |u/ f u .u/du, we arrive at a valid joint distribution, albeit one without an analytical expression.
However, as argued in Section 1.1, the numerical integration step can be very computationally taxing, especially for applications with large estimation samples and many explanatory variables. To side-step such a computational headache, the recursive copula approach replaces the assumption of a particular distribution for the random effect, f u .u/, with an assumption about the final form of the joint distribution itself, which, of course, implies some (unknown) distribution for the random effect.
It is not obvious, neither a priori nor ex post, which assumption is stronger: assuming a distribution for the random effect, or assuming an analytical form for the final joint distribution. But the latter offers two advantages. First, it makes estimation more manageable. Second, the availability of many off-the-shelf copula functions enables a researcher to test the robustness of the distributional assumption by easily changing the form of the copula. Such checks of robustness are not as straightforward in shared random-effects settings, because the distribution of the random effect, f u .u/, must be symmetric about zero for the signs of the coefficient estimates to have natural interpretations, which strongly restricts available choices for the form of f u .u/.
Assuming that a random sample {.y 1i , y 2i , z i /} n i=1 is available, the log-likelihood function can be written as where π i = g −1 π .η π i /, μ i = g −1 μ .η μ i /, σ i = g −1 σ .η σ i / and θ i = g −1 θ .η θ i /. Parameter vector δ is given as .β T π , β μ , β σ , β θ / which contains the coefficient vectors of additive predictors η π i , η μ i , η σ i and η θ i . Because of the flexible predictors' structures that are employed here, the use of a classic (unpenalized) optimization algorithm is likely to result in smooth function estimates which may not reflect the true underlying trends in the data (e.g. Wood (2017)). Therefore, we maximize l p .δ/ = l.δ/ − 1 2 δ S λ δ, .6/ where S λ = diag.λ π D π , λ μ D μ , λ σ D σ , λ θ D θ /, with each smoothing parameter vector containing all the smoothing parameters that are related to the corresponding D-component, and the overall λ is defined as .λ π , λ μ , λ σ , λ θ / . To estimate δ and λ, we have extended the efficient and stable trust region algorithm with integrated automatic multiple-smoothing-parameter selection that was proposed by  to the context of copula models with binary and discrete margins. For all the copulas and margins in Tables 1 and 2, the analytical score and Hessian of l p .δ/ that is required for estimation have been tediously derived and verified by using numerical derivatives. It is worth noting that these quantities have been implemented in a modular fashion; hence it will be in principle easy to extend our algorithm to other parametric copulas and marginal distributions that are not considered in this work. As stressed in Section 1, although the implementation of the models proposed exploited many of the functions and routines that are already available in GJRM, extending the software to accommodate the new developments required a large amount of programming work, which resulted in a set of newly introduced functions. At convergence, confidence intervals for any linear and non-linear function of δ can be reliably obtained by using the Bayesian large sample approximation δ ∼ a N{δ, − H p .δ/ −1 }, where H p .δ/ −1 is the model's penalized Hessian and superscript 'a' denotes 'asymptotically distributed as'. Furthermore, it can be proved thatδ − δ 0 = O P .n −1=2 / as n → ∞, where δ 0 is the 'true' parameter vector. Appendices A and B in the on-line supplementary material provide details on the estimation algorithm and confidence intervals, as well as some asymptotic results.

The R GJRM package
The models can be employed via the gjrm() function in the R package GJRM. An example of the syntax is fl <-list(y1˜z1 + s(z2) + s(z3), y2˜y1 + z1 + s(z2), z1 + s(z3), z1 + s(z2)) md <-gjrm(fl, margins = c("probit", "NBI"), BivD = "PL", Model = "B") where fl is a list containing four equations. The first equation is for parameter π of the Bernoulli distribution of the binary endogenous regressor y1 with probit link function (logit and cloglog are also allowed for). The second and third equations are for parameters μ and σ of the discrete distribution of the outcome of interest y2, which in this example is negative binomial (NB) type I ("NBI"). Finally, the fourth equation is for the copula dependence parameter θ. Argument BivD specifies the copula function and Model="B" means that a bivariate model is employed. Symbol s() refers to the smooth function that was mentioned in Section 2.1. The default is bs="tp" (penalized low rank thin plate spline) with k=10 (the number of basis functions) and m=2 (the order of derivatives). However, argument bs can also be set to, for example, cr (penalized cubic regression spline), ps (P-spline) and mrf (Markov random field), to name but a few. Functions such as AIC(), summary() and predict() can be employed in the usual manner. Function post.check() will produce, for the discrete margin, a histogram and normal Q-Q-plot of randomized normalized quantile residuals (Dunn and Smyth, 1996). Building on the results of Kalliovirta (2008), we have been looking into implementing diagnostics based on bivariate randomized normalized quantile residuals; more theoretical work is required here and these may be available in future releases of GJRM. Appendix C in the on-line supplementary material presents the results of a simulation study and supports the empirical effectiveness of the approach proposed.

Case-study
The Affordable Care Act, which aimed to extend health insurance coverage to the previously uninsured, might have led to increased usage. For this, assessing the effect of the law requires an accurate estimate of the effect of insurance on the usage of medical services.
This section applies the proposed method to estimate the effect of insurance status (a binary variable) on doctor visits (a count variable). The method finds statistically significant evidence that insurance is endogenous with respect to usage of doctor services. When endogeneity is taken into account, the effect of insurance is larger than when endogeneity is ignored.

Data
The estimation sample, drawn from the 2010 wave of the Medical Expenditure Panel Survey, was originally used by Zimmer (2018). The sample considers all respondents in the 2-64 years age range who are not covered by any form of federal or state public health insurance programme. The outcome variable records a person's number of visits in January 2010 to a family or general practice physician. (Most private insurance plans in the USA reset deductibles at the start of the calendar year, so focusing on January usage should capture a stronger insurance effect.) Following standard practice in health economics research, we opt to study discrete count measures of usage, rather than total spending, for three reasons. First, count measures of usage link better to economic theory, where 'demand' reflects the number of units that are consumed, not the total spending on those units. Second, discrete measures of usage avoid confusion about 'charges' versus 'spending', which usually differ. Third, with nearly 90% of medical spending in the USA channelled through third-party payers, discrete count measures are likely to suffer from less recollection error. The final estimation sample contains n = 13137 unique individuals.
Sample means appear in Table 3. Insurance appears to correlate with a person's number of visits to a doctor, but that relationship cannot be interpreted as causal, because insured and uninsured people appear to differ along several other dimensions. Specifically, insured subjects are older, on average, than their uninsured counterparts and are more likely to be female, employed and healthy.
As discussed above, the model requires that at least one variable appears in the insurance marginal, but not in the utilization equation. The bottom two rows of Table 3 show two candidates: (a) firm size and (b) an indicator of whether the firm has multiple locations.
(These variables equal 0 for non-employed subjects.) Firm size should influence insurance status because of economies of scale that make it cheaper for larger firms to offer health coverage. Literature on human resource information systems (Chae et al., 2011) argued that larger firms with dedicated human resource staffs should also have more developed infrastructures for enrolling employees into insurance plans (Hendrickson, 2003). Furthermore, because of differences in state level mandates, firms with multiple locations are likely to harmonize their insurance offerings in accordance with the strictest jurisdiction in which they operate. However, these two variables are unlikely to affect usage (directly), especially after controlling for employment status. It also is worth noting that health economics research offers many examples of employer-specific information serving as identifying instruments in contexts that are similar to the one that is explored in this case-study (Dowd et al., 1991;Meer and Rosen, 2004;Deb and Trivedi, 2006;Selden and Hudson, 2006).

Model specification
Each marginal includes as control variables age, gender, health status (fair or poor health) and a proxy for income (employment status). The presence of these socio-economic measures-called 'demand shifters' in economics-in models of healthcare and insurance demand follows from economic theory (Cameron et al., 1988), and, as a consequence, they have become standard in such settings. US law precludes insurance companies from using other potential demand shifters in setting premiums, implying that, economically, such information remains 'unobserved' when people purchase insurance. Consequently, we do not include other demand shifters in either marginal. Rather, the copula dependence parameter will pick up the influence of those factors. The marginal distribution for visits to a doctor and the copula are chosen by using the Akaike information criterion AIC evaluated for all possible combinations of discrete distributions and copulas that are available in GJRM. For the insurance status variable, all the three link functions (probit, logit and cloglog) led to the same conclusions in our case-study; hence we present only the results that were obtained by using the classical probit link. Note that we used only the rotated 90 • and 270 • versions of the Clayton, Gumbel and Joe copulas because the data support the presence of negative dependence between the responses, and therefore it would not make sense to consider rotations allowing for positive dependence. From Table 4, we can see that the Plackett copula with NB II margin appears to offer the best fit. This corroborates other evidence that this distribution often outperforms alternative count data distributions, especially in settings, such as that considered here, where the count variable shows large probability mass at zero (Cameron and Trivedi (2013), pages 84-85). The fact that the Plackett, Gaussian, Frank, Farlie-Gumbel-Morgenstern and Ali-Mikhail-Haq copulas produce similar AIC-values suggests a lack of asymmetric dependence since these copulas' shapes show relatively symmetric dependence patterns in each tail (as opposed to the Clayton, Gumbel and Joe copulas, which exhibit asymmetric dependence).

Results
The main results appear in Table 5. The left-hand panel shows estimates from a simple NB II specification, with no attempt to address the endogeneity of insurance. Focusing first on the control variables, usage appears to increase with age and health problems. Females have more visits to a doctor than their male counterparts, and employed subjects report fewer visits to a doctor than those not working. Finally, turning to the main result of interest, the simple NB II specification shows that insurance correlates with an approximate 62% increase in visits to a doctor. Attempting to correct for possible endogeneity bias, the second panel shows the results for the preferred Plackett copula specification. Most of the control variables exhibit similar estimates to those of the simple NB II set-up, although the effect of being employed becomes larger (more negative). As for insurance, the coefficient estimate suggests that, after correcting for endogeneity, insurance leads to an approximate 134% increase in visits to a doctor, which is substantially larger than the effect that was reported by the simple NB II model. Results for the dependence parameter (converted to Kendall's τ ) appear at the bottom of Table 5. The finding of negative dependence suggests that unobserved traits that induce certain people to enrol in insurance also tend to reduce the usage of doctor services. Such a pattern, which is often called 'favourable selection' by health economists, is similar to what appears in other studies of insurance and healthcare usage (e.g. Finkelstein and McGarry (2005), Pauly (2005) and Cameron and Trivedi (2013)).

Spline estimate
The model proposed uses a smooth function for the age variable in the insurance equation. The graph of the estimated effect is shown in Fig. 1. There appears to be some non-linearity in the estimated shape, although it could be reasonably argued that the functional form looks roughly linear and hence that a parametric effect would adequately describe the effect of age. We would indeed agree with this course of action. Generally, the main point about using flexible specifications is to avoid making a priori, potentially questionable, assumptions. In this case, the relationship turned out to be roughly linear, suggesting that a simpler model specification is acceptable. The plot suggests that, as subjects age, their probabilities of having insurance also increase, and that such an increase seems less marked for individuals in the 39-48 years age range. This finding, which has been long recognized in health economics, is likely to reflect that medical risks increase with age, leading to increases in demand for insurance to mitigate financial uncertainty that is associated with those risks (Arrow, 1963).

Alternative specifications
For comparison, Table 5 reports estimates from a simple control function approach; the first stage uses a linear probability model for insurance; then residuals from the first stage are included in the second stage NB II specification for visits to a doctor. Following other applications of the control function method for binary endogenous variables (Terza et al., 2008), we use the simplest definition of a residual (i.e. the difference between the observed binary outcome and the predicted probability of the outcome) despite aforementioned potential problems with such practice (Wooldridge (2010), page 746). The coefficient of insurance is 1.56, which is slightly larger than, but nonetheless comparable with, the estimate from the copula approach. Further, the coefficient of the first-stage residuals is negative and statistically significant, which confirms the finding of favourable selection. Table 6 also shows estimates from a specification in which the copula dependence parameter is given a regression structure. Such a set-up makes sense if endogeneity, as captured by the dependence term, differs with respect to observable characteristics. This specification does not alter the main substantive conclusions about the link between insurance and physician usage, but, as shown in the rightmost column of Table 6, gender appears to correlate significantly with the magnitude of dependence. (Other explanatory variables did not appear to alter dependence significantly.) Recalling that overall dependence, as reported at the bottom of Table 5, is −0:19, the numbers in Table 6 imply that, once converted to Kendall's τ , dependence among females is −0:24, with 95% interval .−0:34, − 0:14/, whereas dependence among males is −0:14, with 95% interval .−0:25, − 0:02/. Thus, despite the slight overlap in confidence intervals, the results suggest that females exhibit stronger favourable selection than do their male counterparts. This is likely to reflect higher levels of risk aversion among females: a finding that is widely reported in both economics and psychology research (Hartog et al., 2002;Agnew et al., 2008;Eckel and Grossman, 2008;Borghans et al., 2009). Medical research has attempted to link such gender disparities in risk aversion to differences in levels of testosterone (Sapienza et al., 2009). Finally, along the lines of Zimmer (2018), Table 7 shows results from a model in which the marginal for visits to a doctor is specified as Poisson, which Table 4 suggests offers the worst fit of the count distributions under consideration. Estimates from this set-up find a larger insurance effect and a larger (negative) dependence estimate, suggesting that some of the unaccounted-for overdispersion in visits to a doctor is infecting those parameter estimates.

Conclusions
In trying to estimate the causal effect of insurance on healthcare utilization, researchers must confront the likelihood that people seek insurance in anticipation of future healthcare needs.
Ignoring such endogeneity bias might produce misleading estimates. The topic has important policy implications, because health insurance reforms that have been implemented in the USA in recent decades have focused on extending coverage to the previously uninsured. Consequently, understanding the effects of those reforms requires an accurate estimate of the causal effect of insurance on utilization. However, obtaining such an estimate is complicated by the discreteness that is inherent in common measures of healthcare usage.
This paper presents an estimation approach, based on copula functions, that consistently estimates the coefficient of an endogenous regressor in count data settings. The method is general in the types of non-linear data patterns that it can accommodate. Moreover, the statistical significance (or lack thereof) of the copula dependence parameter provides a convenient means to assess the presence of endogeneity. The results of our case-study point to evidence of favourable selection into insurance and, once favourable selection has been taken into account, the effect of insurance is larger than when it is ignored. Specifically, ignoring favourable selection, insurance appears to increase doctor visit usage by 62% but, adjusting for favourable selection, the effect increases to 134%. The results also suggest that females exhibit larger favourable selection than do males.
When health insurance is not required, health economists often worry about adverse selection, which is characterized by relatively unhealthy people being disproportionately drawn towards insurance coverage. The sicker risk pool might require insurance companies to increase premiums, potentially driving out the least sick members of the risk pool, resulting in an even sicker pool. In the extreme, this cycle might keep repeating, resulting in a 'death spiral' in which the insurance market ceases to exist. However, the finding of favourable selection suggests that concerns about death spirals might be misplaced. Many commentators warned that the Affordable Care Act, with its relatively weak mandate that everyone obtain coverage, might encourage death spirals in private insurance markets. The finding of favourable selection into such markets might partially explain the evident lack of such death spirals to date.
As presented in this paper, the marginals and copula are parametrically specified with several choices possible. Furthermore, all the parameters of the joint distribution assumed can be specified as flexible functions of covariates which can help to uncover interesting patterns in the data as highlighted in our empirical application. The numerical computations can be easily and efficiently carried out by using the newly revised R package GJRM. The estimation framework proposed does not require computationally taxing simulation-based estimators as is the case with other simultaneous estimation approaches. Specifically, as elaborated in Section 1.1, the copula approach is computationally more efficient than an approach that is based on shared unobserved random effects since it side-steps the problem of integrating out such effects. However, it could be argued that, because equation (5) requires the evaluation of CDFs, when using elliptical copulas (i.e. the Gaussian and Student t-distributions) integration is still required. We found that this was not problematic for this paper since we deal with the bivariate case, and because there are not redundant calculations. Increasing the number of margins will surely increase the computational cost of the approach based on elliptical copulas. In such a case, a proper exploration of the practical advantages and disadvantages of various alternatives is warranted.
It is worth noting that the methodology that was developed in this paper, although flexible, is fundamentally parametric, and as such it may suffer from the usual potential drawbacks resulting from model misspecifications. Developments where the margins and/or copula function are estimated by using non-parametric techniques (e.g. Kauermann et al. (2013)) were explored. However, we found that these were limited with respect to the inclusion of covariate effects and required large sample sizes to produce reliable results in a regression context. Despite its parametric flavour, the approach proposed enables a large amount of model exploration via the many functional forms that have been included in the newly revised R package GJRM. Specifically, the GJRM package offers a wide menu of marginal distributions and copula functions-far more than are emphasized in this paper. Moreover, the spline capabilities permit a large degree of flexibility in how regressors relate to outcome variables and model parameters. Thus, a researcher can explore a vast number of permutations of functional forms and regressor effects using nothing more than simple changes in computer syntax. Such exploration, although certainly not non-parametric, captures the spirit of non-parametric approaches in that it enables the data to point to meaningful model structures. Moreover, such ease of exploration has the potential to reveal interesting economic patterns that might otherwise remain hidden. As a key example presented here, the apparent difference in selection effects between males and females would have remained undetected if we had not had access to such an easy-to-employ framework for model assembly.