A computationally efficient correlated mixed Probit for credit risk modelling

Mixed Probit models are widely applied in many fields where prediction of a binary response is of interest. Typically, the random effects are assumed to be independent but this is seldom the case for many real applications. In the credit risk application considered in this paper, random effects are present at the level of industrial sectors and they are expected to be correlated due to inter-firm credit links inducing dependencies in the firms' risk to default. Unfortunately, existing inferential procedures for correlated mixed Probit models are computationally very intensive already for a moderate number of effects. Borrowing from the literature on large network inference, we propose an efficient Expectation-Maximization algorithm for unconstrained and penalised likelihood estimation and derive the asymptotic standard errors of the estimates. An extensive simulation study shows that the proposed approach enjoys substantial computational gains relative to standard Monte Carlo approaches, while still providing accurate parameter estimates. Using data on nearly 64,000 accounts for small and medium-sized enterprises in the United Kingdom in 2013 across 14 industrial sectors, we find that accounting for network effects via a correlated mixed Probit model increases significantly the default prediction power of the model compared to conventional default prediction models, making efficient inferential procedures for these models particularly useful in this field.


Introduction
Discrete choice models with correlated group-specific random effects have wide applicability and practical importance in economics and the social sciences, as they can accommodate unobserved heterogeneity, overdispersion and intracluster as well as intercluster correlation across binary outcomes. In this paper, we consider the prediction of a firm's risk to default, whereby group random effects at the level of industrial sectors are to be expected, and, at the same time, dependences between and within the industrial sectors are also to be expected because of interfirm credit links.
Unfortunately, the presence of correlated random effects in mixed models poses substantial computational challenges, with maximum likelihood (ML) estimation typically requiring the evaluation of a high dimensional integral. To overcome these numerical difficulties, various methods have been proposed in the literature that approximate the likelihood by Gauss-Hermite quadrature or Monte Carlo integration and then maximize it by either Newton-Raphson or expectation-maximization (EM) algorithms (Breslow and Clayton, 1993;Schilling and Bock, 2005). Despite achieving a computational gain, these methods can still be applied only in the presence of a limited number of groups because the number of evaluation points in the Gauss-Hermite quadrature increases exponentially with the number of random effects. In addition, these approximate (ML) estimates have been proved to be inconsistent under various conditions, with an asymptotic bias that can be severe if the variance components are not small (Breslow and Lin, 1995).
An alternative, widely used, approach for estimating mixed models for binary variables combines Monte Carlo integration with various EM algorithms, leading to the so-called Monte Carlo EM algorithm (see, among others, Ashford and Sowden (1970), Chib and Greenberg (1998), McCulloch (1997) and Gueorguieva and Agresti (2001)). For the case of a mixed probit model with independent random effects, McCulloch (1994) proposed Monte Carlo versions of the EM algorithm for ML estimation based on Gibbs sampling. This approach has been extended by McCulloch (1997) to the more general case of generalized linear mixed models, by considering a Metropolis-Hastings algorithm at each E-step of the ML estimation. Similarly, for the case of a mixed probit model with correlated random effects, Chan and Kuk (1997) proposed an EM algorithm where the E-step is made feasible by Gibbs sampling. The approach proposed, however, is computationally very intensive, as it requires sampling from a multivariate truncated normal distribution. To deal with this problem, Tan et al. (2007) proposed a non-iterative importance sampling approach to evaluate the first-and the second-order moments of a truncated multivariate normal distribution associated with the Monte Carlo EM algorithm. An alternative, direct sampling-based, EM algorithm was advanced by An and Bentler (2012), who proposed to draw random samples from the prior Gaussian distribution of random effects. This is computationally easier than from the posterior distribution, but at the expense of a higher Monte Carlo error. One limitation of the above Monte Carlo EM algorithms is that, by combining Monte Carlo simulation with iterative procedures, they are still computationally very expensive. The estimation that is involved in the E-step of the Monte Carlo EM algorithm can require a prohibitively large amount of time for a large number of statistical units and already a moderate number of random effects.
In this paper, motivated by a large credit risk application, we propose an EM algorithm for estimation of a mixed probit model with correlated random effects that can be adopted for estimation and prediction from very large data sets and a large number of random effects. The algorithm relies on efficient approximations of conditional expectations that simplify the calculation of the moments of a truncated normal distribution and avoid computationally demanding sampling methods. Similar approximations have been adopted in the context of graphical models for ordinal (Guo et al., 2015;Behrouzi and Wit, 2019) and censored (Augugliaro et al., 2018) data but they have not been used in a regression context before. Similarly to those approaches, we also propose a penalized version of the likelihood estimator, by applying the graphical lasso approach (Friedman et al., 2008) within the proposed EM algorithm. Beyond pointwise estimation, standard errors of ML estimates in the context of mixed effects models are also typically obtained by time-consuming resampling approaches. In this paper we exploit the work of Louis (1982) to derive the observed Fisher information matrix of our proposed mixed probit model and thus to obtain the asymptotic standard errors of the estimates. In doing this, we adopt results by Horrace (2015) to calculate the third and fourth moments of univariate truncated normal distributions which appear in the observed Fisher information matrix. Our paper provides several contributions to the existing literature. First, we propose an extension of the literature on non-linear mixed models to the case of correlated random effects, offering an inferential procedure that enables estimation of unknown parameters and associated standard errors also in the presence of very large samples. In doing this, we investigate ways of overcoming serious computational difficulties that often arise in regression models with correlated binary responses. We also show how penalized inferential procedures can be applied under this framework, enabling us to cover the case where the number of random effects exceeds the number of observations. An extensive simulation study assessing the properties and computational efficiency of our inferential procedure shows good performance of the approach proposed compared with existing approaches. Using data on around 64000 accounts of unlisted small and medium-sized enterprises (SMEs) based in the UK and observed in the year 2013, we find that incorporating interfirm network dependences in the form of correlated random effects increases the default prediction power of the credit risk model compared with conventional models. The remainder of the paper is structured as follows. Section 2 describes the empirical application on credit risk which motivates this study. Section 3 introduces our mixed graphical probit model and describes the EM algorithm for parameter estimation, with the proposed efficient approximations of the conditional expectations, the inference under penalized likelihood and the derivation of asymptotic standard errors. Section 4 carries out an extensive simulation study on the method, and Section 5 describes the results of the proposed approach on real data. Finally, Section 6 provides some concluding remarks.

Motivating example: credit risk modelling of small and medium-sized enterprises
There is nowadays interest in creating default prediction models for SMEs. Academic research into failure prediction has focused almost exclusively on large companies, i.e. those which are listed on, and priced by, the market, proposing a wide range of models and methods to assess and quantify their risk of default. In contrast, there has been a relatively small number of prior academic studies examining default prediction and credit scoring models with reference to small, private, businesses, mostly because of the difficulty in obtaining sufficient and good quality data in these contexts. These models are likely to be different from those used for large corporates. For this reason, the recent Basel Accords are now directing the international credit system to pay closer attention to measuring and managing credit risk of SMEs (Sabato, 2010).
When modelling credit risk for SMEs, an important feature to be considered is the fact that companies are not simply independent agents competing for customers on markets. They are linked by supply-customer relationships. Firms interact with each other because they exchange items of value, such as information, goods, services and money. For example, the outputs of some firms (subcontractors) are input for some other firms. In addition, some firms may extend trade credit to other firms, thus creating some sort of interfirm credit links (Battiston et al., 2007). Interdependence between firms' default can also arise because they share part of the management team and hence are subject to similar investment decisions, or because firms react similarly to external shocks such as a rise in the interest rate (Andrews, 2005). Under this framework, the failure of a firm is likely to increase the probability of failure of connected firms, giving rise to clustered fluctuations in the number of failed firms.
Despite the importance of interfirm links in determining firms' performances, only a few studies have looked at the role of interaction in determining firms' default and clusters of default, with the majority of these studies focusing on identifying the conditions under which local failures can result in bankruptcies across the network (Delli Gatti et al., 2006), or exploring whether firms that issue more trade credit are more likely to experience debtor failure (Jacobson et al., 2013). Yet fewer studies have considered incorporating information on firms' interdependences in a default prediction model. Among these, Barro and Basso (2010) have proposed a model of contagion that associates the economic relationship of sectors of the economy and the geographical proximity of each pair of firms in a network of firms, whereas Barreto and Artes (2013) have developed a measure of local risk of default using ordinary kriging from data on 9 million Brazilian SMEs observed in 2010. After including this measure as an additional explanatory variable in a logistic credit scoring model, they showed that the performance of the model improved considerably.
It is well known that the financial performance of companies is in part driven by sectorand area-specific attributes, linked for example to heterogeneity across industries in accounting policies or local trends in demand (see, for example, Kukuk and Ronnberg (2013)). For this reason, mixed discrete choice models have been widely adopted to predict firm financial distress for large corporations (see, among others, Jones and Hensher (2004) and Kukuk and Ronnberg (2013)), with few studies also specific to SMEs (see, for example Alfo' et al. (2005)). Differently from this literature and considering the importance of interfirm dependences discussed above, in this paper, we allow group effects to be correlated by assigning them a non-diagonal covariance matrix. Under this framework, the dependence relationship of the binary outcomes (default) is induced by the underlying Gaussian graphical model on the random effects. In particular, we assume that the risk of default for one company follows a probit regression specification with correlated group random effects, where groups are given by all companies operating in the same sector of economic activity and located in the same region.
We exploit a rich data set from a large financial institution covering around 64000 accounts of unlisted firms based in the UK and observed in the year 2013. These are companies that have no more than 250 employees, a turnover of smaller than £25.9 million and a balance sheet total of no more than £12.9 million. In line with other studies, we define failure as entry into liquidation, administration or receivership. The accounts analysed for failed companies are the last set of accounts filed in the year before insolvency. The companies are spread over a total of 59 geographical areas, defined using the 'nomenclature des unités territoriales statistiques', level 3, classification, and across 13 broad sectors (divisions) of economic activity. In our model, the sectors will appear as random effects, whereas the geographical areas as the sampling units.
The data set contains a set of financial variables extracted from the accounts of firms, as well as non-financial information, that are often included in conventional default prediction models (see, among others, Altman and Sabato (2007), Altman et al. (2010), Carling et al. (2007), Campbell et al. (2008) and Jacobson et al. (2013)). In terms of firm-specific financial variables, we include a set of financial ratios that cover the areas of profitability, liquidity, leverage, coverage and activity (Altman and Sabato, 2007). Profitability is the ability of the firm to generate sufficient profits or returns, liquidity measures the ability of the firm to meet its short-term obligations, leverage refers to the relative amount of debt and other obligations of the firm, coverage is the risk that is inherent in lending to the business in the long term and activity is the level of efficiency of a business. As for the non-financial indicators, we consider variables that are linked to the age and size of the companies. We expect a higher risk of default for newly formed companies that decreases with the age of the company, and that is particularly high in the years immediately after an initial 'honeymoon period' of around 2 years. Finally, we have matched information on the postal district of the trading address with data on latitude and longitude and other geographical information extracted from the UK Office for National Statistics, to calculate covariates at the aggregated level and to account for systematic risk. In particular, we include the 'nomenclature des unités territoriales statistiques', level 3, gross domestic product, as a proxy for the economic conditions of the area where the company operates. Table 1 lists the financial ratios that are included in our analysis grouped according to the financial and the non-financial indicators, including company characteristics and aggregate variables. Table 2 provides a set of descriptive statistics for the variables that are included in our model, for failed and non-failed companies. As expected, companies that failed have on average worse leverage and liquidity indicators than firms that did not fail; they are smaller in size and younger and more frequently fall in the age risk group. It is interesting to observe that both trade debt and trade credit ratios have higher values for defaulted companies. This result is supported by the literature on trade credit which shows evidence that financially distressed small companies not only have higher levels of trade debt supplied to customers but also of trade credit obtained from suppliers (Carbó-Valverde et al., 2016).
In the next section, we formalize the proposed mixed probit model with correlated random effects and describe an inferential procedure that is computationally efficient for data such as those described in this section, for which existing mixed probit models are prohibitively slow.

Efficient mixed probit model with correlated random effects
3.1. The model Consider a sample of N r companies in region r, with r = 1, 2, : : : , R. Let y ir be the dichotomous variable equal to 1 when company i in region r defaults. Let G be the number of industrial sectors. Using the latent response model, we assume that y ir is generated by thresholding the latent variable y Å ir that follows the Gaussian mixed model: where x ir is a K-dimensional vector of explanatory variables, β is a K-dimensional vector of unknown parameters, u r = .u 1r , u 2r , : : : , u Gr / is a G-dimensional vector of Gaussian random errors with z ir being a G-dimensional vector of (known) loadings, with all entries equal to 0 except for a 1 for the entry corresponding to the sector that is associated with observation i, and " ir are Gaussian random errors. We assume that u r and " ir satisfy the following conditions for all r: E." ir / = 0, E." 2 ir / = 1, for i = 1, 2, : : : , N r , E." ir " js / = 0, for i = j = 1, 2, : : : , N r , r, s = 1, 2, : : where Σ G is a positive definite matrix with σ gh the (g, h) off-diagonal element and σ 2 g the gth diagonal element. In stacked form model (3.1) can be written as where y Å r = .y Å 1r , y Å 2r , : : : , y Å N r ,r / , X r = .x 1r , x 2r , : : : , x N r ,r / , ε r = ." 1r , " 2r , : : : , " N r ,r / and Z r is an N r × G matrix. In addition, y Å r has covariance Σ r = Z r Σ G Z r + I N r : . 3:2/ The model above allows for group effects that vary across R and G, although the dependences are allowed only across the G-dimension.

Inference
The interest is in estimating the regression parameters β, as well as the dependence structure among the G groups, given by the elements of the precision matrix Φ G = Σ −1 G . As also remarked in the graphical modelling literature, estimating the elements of the precision matrix enables us to assess whether any two units are conditionally independent given all other units (Lauritzen, 1996), thus providing a network of dependences at the level of random effects. Accordingly, let ϑ = .β, vech.Φ G // be the vector of unknown parameters in the above model, and note that the observed data y = .y 1 , y 2 , : : : , y R / are a function of the unobserved variables y Å = .y Å 1 , y Å 2 , : : : , y Å R / and u = .u 1 , u 2 , : : : , u R / . The log-likelihood of the observed data is given by l.ϑ/ = log f y,y Å ,u .y, y Å , u|ϑ/dy Å du : . 3:3/ The integral in equation (3.3) makes it difficult to maximize l.ϑ/ directly, but an EM algorithm for computing ML estimates can be adopted, by maximizing the conditional expectation of the log-likelihood function for the complete data given the observed data y. Treating y, y Å and u as the complete data, and y as the incomplete data, we have l.ϑ/ = log{f y,y Å ,u .y, y Å , u|ϑ/} − log{f y Å ,u|y .y Å , u|y, ϑ/}, .3:4/ where log{f y,y Å ,u .y, y Å , u|ϑ/} is the log-likelihood function for the complete data, namely Taking conditional expectations given y on both sides of equation (3.4) yields The Q-function is the main ingredient of the EM algorithm. Letθ .m/ denote the estimate of Θ after the mth iteration. Then the E-and M-steps of the .m + 1/th iteration are respectively given by These steps are iterated until convergence is achieved. Looking further at the optimization in the M-step, the first-order conditions for β and Φ G are given bŷ . 3:8/ Hence, the M-step alternates between estimation of β by using equation (3.7) and estimation of Φ G by using equation (3.8). At each step, the new estimate of Φ G uses the previous value of β and the new value ofΦ G is used to updateβ. Meng and Rubin (1993) showed that iterating between these two equations in the EM algorithm provides convergence to the ML estimates. However, the above expressions depend on the unknown quantities E.u r |y r / and E.u r u r |y r /. In what follows, we propose an approximation of conditional expectations E.u r |y r / and E.u r u r |y r / and show how this can be adopted to simplify the EM algorithm.

Approximating conditional expectations
Using the law of iterated expectations and the theorem on conditional normal distributions, E.u r |y r / and E.u r u r |y r / are typically calculated by .3:10/ following appendix B and Chan and Kuk (1997). From these expressions it is clear that E.u r |y r / and E.u r u r |y r / depend on the first two moments of a multivariate truncated normal distribution, namely E.y Å r |y r / and E{.y Å r − X r β/.y Å r − X r β/ |y r }: Some researchers have proposed algorithms for direct estimation or approximation of moments of multivariate truncated normal distributions (see, among others, Tallis (1961), Lee (1979) and Leppard and Tallis (1989)). Others have proposed a Markov chain Monte Carlo approach that consists of randomly generating a sequence of samples from the multivariate truncated normal distribution and then approximating the first two moments by the empirical conditional moments from these samples (Kotecha and Djuric, 1999;Chan and Kuk, 1997;Chib and Greenberg, 1998;Abegaz and Wit, 2015). Although this method is faster than direct estimation of the moments, it is still computationally very demanding for large-scale problems. A recent strand of literature has proposed approximating the first and second moments of a multivariate truncated normal distribution through an iterative procedure within the Mstep (Guo et al., 2015;Behrouzi and Wit, 2019;Augugliaro et al., 2018), leading to a computationally much faster approach than any previous methods. Exploiting this literature, we consider a mean field approximation of the second moments, namely, for i = j and for all r = 1, 2, : : : , R, E{.y Å ir − β x ir /.y Å jr − β x jr /|y r } ≈ E{.y Å ir − β x ir /|y r }E{.y Å jr − β x jr /|y r }: . 3:11/ Hence, once controlled for the observed values in y r and the regressors X r , y Å ir and y Å jr become decoupled. In Section 4 we shall show good properties of our proposed estimator with that based on the slower Monte Carlo EM procedures, that do not make the above approximation. Under approximation (3.11), to compute equations (3.9) and (3.10), we only need to find E.y Å ir |y r / and E.y Å2 ir |y r /. For this, first write the first and second conditional moments as E.y Å ir |y r / = E{E.y Å ir |y Å −i,r , y ir /|y r }, .3:12/ E.y Å2 ir |y r / = E{E.y Å2 ir |y Å −i,r , y ir /|y r }, .3:13/ where y Å −i,r = .y Å 1r , y Å 2r , : : : , y Å i−1,r , y Å i+1,r , : : : , y Å N r r / . Noting that y Å r is a vector of jointly normal variables with mean 0 and covariance Σ r , and exploiting the theorem on conditional normal distributions, we obtain that the conditional distribution of y Å ir given y Å −i,r has mean and variance respectively given byμ ir is the (i, i)th element of Σ r . Replacing these expressions in the equation for the mean and second moment of truncated normal distributions (see Appendix A) we obtain the following expressions for the first conditional moment (3.12) and the second conditional moment (3.13): where ρ 1,ir and ρ 2,ir are defined in Appendix A. These equations show that there is a recursive relationship between the elements in E.y Å ir − β x ir |y r / and E{.y Å r − X r β/.y Å r − X r β/ |y r } and offer an iterative procedure for estimating these quantities. More specifically, let E.y Å jr − β x jr |y r / .h/ and E{.y Å jr − β x jr / 2 |y r } .h/ , for all j, be the estimates of E.y Å jr − β x jr |y r / and E{.y Å jr − β x jr / 2 |y r } respectively, at the hth stage in the M-step. We plug these into the righthand side of equations (3.14) and (3.15) to compute new values of E.y Å ir − β x ir |y r / and E{.y Å ir − β x ir / 2 |y r } (inner iterations). After convergence has been reached, let E.y Å ir − β x ir |y r / .h/Å and E{.y Å ir − β x ir / 2 |y r } .h/Å be the final estimates. We plug these into equation (3.7) to obtain a new estimate of β and to compute equation (3.10) that enters equation (3.8) for estimation of Φ G (outer iterations). With the new β and Φ G , we recompute E.y Å ir − β x ir |y r / and E{.y Å ir − β x ir / 2 |y r } ready for another round of inner iterations. Note, however, that convergence for the inner iterations is not necessary; in fact, inner iterations can be reduced to a single round of computation.
According to the iterative procedure just described, the matrix inverse Σ −1 r,−i,−i , for i = 1, 2, : : : , N r , needs to be computed at each iteration of the EM procedure. Although the matrix can be rather large, given that it has size .N r − 1/ × .N r − 1/, a simplified expression can be obtained by noting that Σ r,−i,−i = Z r,−i Σ G Z r,−i + I N r −1 , and, using the matrix inversion lemma, Hence, Σ −1 r,−i,−i involves computing only the inverse of G-dimensional matrices. This shows the power of using a mixed model approach, whereby dependences are captured at the lower dimensional space of the random effects.
In addition, when N r is particularly large, such as in our real application, we found it computationally beneficial, and not detrimental to the resulting estimators, to replace the expectations (3.9)-(3.10) with the group averages of expectations of the latent variables, i.e. where m gr is the number of units belonging to group g and located in region r and Σ i∈g indicates the sum over all units belonging to group g and located in region r. This estimator is widely adopted to proxy random effects (Hsiao, 2003), also in the context of cross-sectionally dependent panels (Moscone et al., 2017).

E.u gr |y
Finally, further computational efficiency can be achieved by applying penalized ML, as described in the next subsection.

Penalized maximum likelihood estimation
ML estimation of Φ G is feasible for R G + K, though it can become unstable for R approaching G + K. For R < G + K, ML estimation is not feasible and further computational challenges arise in the high dimensional case of R G. In addition, the network of dependences is often expected to be sparse and the recovery of its structure is of interest. In our particular application, although the primary objective is the prediction of default, it is also of interest, and possibly motivating further decision making, to identify which sectors of the economy are mostly connected with each other in their risk of default.
To address these challenges, we add an L 1 -norm penalty term to the log-likelihood of the model and optimize the penalized likelihood: where ρ G is a tuning parameter controlling the degree of sparsity of the underlying network and ' : 1 ' is the L 1 -norm on the off-diagonal entries of the precision matrix. When ρ G is sufficiently large, some coefficients in Φ G are shrunk to 0, resulting in the removal of the corresponding links in the underlying network. Noting that the part of log{f y,y Å ,u .y, y Å , u|ϑ/} that depends on Σ −1 G is the log-likelihood of a multivariate normal distribution, E.u r u r |y r / , and, following the same line of reasoning as in Section 3.2, we consider the penalized estimation problem for Φ G within the M-step by optimizing E.u r u r |y r / − ρ G Φ G 1 : .3:18/ Hence, we alternate between estimation of β by using equation (3.7) and estimation of Φ G by using equation (3.18), for which efficient graphical lasso implementations can be used (Friedman et al., 2008). The regularization parameter ρ G defines the level of sparsity of the associated networkΦ G . By tuning this parameter, we can explore the full path of solutions, from a disconnected network (large ρ G , corresponding to a mixed model with uncorrelated random effects) to a fully connected network (ρ G = 0, corresponding to the ML estimates). Various information criteria are available in the penalized likelihood literature for the selection of this parameter. These methods are based on the likelihood function of the observed data, which, for our model, is given by equation (3.5). Ibrahim et al. (2008), however, suggested the use of only the Q-function in approximation (3.6) for calculation of the likelihood. This is more efficient, as the Q-function is a direct output of the EM algorithm, whereas the H-function would need to be calculated separately. Augugliaro et al. (2018) showed how, for a similar model, this approximate information criterion behaves well when compared with that based on the full likelihood. In contrast, considering that prediction of default is a classification problem, other criteria can also be used that are more in line with this objective of the study. In the real application, we shall explore both with the selection of ρ G that minimizes the extended Bayesian information criterion eBIC (Foygel and Drton, 2010) and with the ρ G that maximizes the area under the receiver operating characteristic (ROC) curve, AUC, on a test set.

Standard errors approximation
Calculating standard errors of estimates requires knowledge of the information matrix that is associated with the log-likelihood function of the observed data,which is known as the observed information matrix. However, this also involves computation of the H-function in equation (3.5), which is not a direct output of the EM iterations. Following Louis (1982), it is possible to compute the observed information matrix by exploiting the complete-data gradient and curvature. In particular, let B.y|ϑ/ = @ 2 l.ϑ/=@ϑ i @ϑ j be the partial second derivatives of the observed data log-likelihood and S.y, y Å , u|ϑ/ = @ log{f y,y Å , u .y, y Å , u|ϑ/} @ϑ and B.y, y Å , u|ϑ/ = @ 2 log{f y,y Å , u .y, y Å , u|ϑ/} @ϑ i @ϑ j be the gradient and second derivative of the complete-data log-likelihood respectively. It is possible to show that B.y|ϑ/ = E{B.y, y Å , u|ϑ/|y} + E{S.y, y Å , u|ϑ/S.y, y Å , u|ϑ/ |y} − E{S.y, y Å , u|ϑ/|y}E{S.y, y Å , u|ϑ/|y} : . 3:19/ Hence, by exploiting the law of iterated expectations as well as approximation (3.11), it is also possible to compute efficiently all terms appearing on the right-hand side of expression (3.19).
In the on-line supplementary material, we provide finite expressions for the elements of B.y|ϑ/.

Estimation of regression coefficients
In a first set of experiments, we assess the performance of our proposed method in estimating the regression coefficients and their standard errors. For this, we replicate 50 times the simulation that was described above and report the bias and root-mean-squared error (RMSE) of the slope parameter β, given by .1=50/Σ 50 s=1βs − β and √ {.1=50/Σ 50 s=1 .β s − β/ 2 } respectively. We carry out Table 3. Simulated data from a mixed graphical probit: bias and RMSE of regression coefficients when estimated by (I) a standard probit with no random effects, (II) a mixed graphical probit using approximation (3.11), (III) a mixed graphical probit using approximation (3.11) and equations (3.16) and ( a comparison of the estimator that is obtained from the approximate EM algorithm proposed in this paper with that from the Monte Carlo EM estimator by Chan and Kuk (1997). Because of the high computational cost of the Monte Carlo EM approach, we select small values of G for the comparison (G 25). We set R = 200 and conduct the comparison under ML estimation (i.e. setting ρ G = 0) for varying values of N r = N, the number of observations per regions. This comparison is important because the Monte Carlo EM estimator by Chan and Kuk (1997) does not rely on the conditional approximation (3.11). For the same combinations of N and R, we also compare the properties and computational time of the mixed graphical probit estimator using equations (3.9) and (3.10) with those of the same estimator based on their approximations (3.16)-(3.17). Table 3 reports the results. These show that the three mixed graphical estimators have a small bias and RMSE, and that these decrease as N rises, whereas their performance slightly deteriorates as the number of groups, G, increases. In all cases, the estimators from the mixed graphical probit approaches are superior to a conventional probit model with no random effects (columns (I)). Comparing the results in columns (II) and (III), the computational time of the estimator based on equations (3.16) and (3.17) is significantly smaller than that of the estimator based on equations (3.9) and (3.10), thus supporting the use of group averages of conditional expectations to proxy random effects. The fact that the bias and RMSE of the estimators in columns (II) and (III) are of comparable size with that in column (IV) indicates that approximation (3.11), adopted both in columns (II) and in columns (III), does not significantly affect the properties of our estimators. This was found to be so also for the approximate standard errors calculated by using expression (3.19), whose averages across the 50 replications were found overall to be of comparable size with the RMSE, albeit with some discrepancy at small N. Whereas Table 3 shows little difference between the estimators in terms of bias and RMSE, the difference in computational time between the graphical mixed probit estimators in columns (II) and (III) and the full Monte Carlo EM estimator in columns (IV) is striking, with the mixed graphical probit carrying out one estimation in a few seconds across all experiments, against a computational time that can be as long as a few minutes in the case of the Monte Carlo EM algorithm.

Recovery of the network of dependences under L 1 -penalization
In a second set of experiments, we assess the performance of our proposed method in recovering the underlying network of dependences. This network is constructed by drawing an edge in correspondence to each non-zero element of the estimated precision matrix Φ G . For a range of values of N, G and R, we construct the ROC curve across the path of regularization parameters under L 1 -penalization, i.e. across different levels of sparsity. Denoting byΦ ρ G the estimate of the precision matrix under the tuning parameter ρ, the curve plots the true positive rate, i.e. the percentage of true edges (non-zeros in the true Φ G ) correctly estimated as non-zeros inΦ ρ G , against the false positive rate, i.e. the percentage of true missing edges (0s in Φ G ) incorrectly estimated as non-zeros inΦ ρ G . Fig. 1 plots these curves across the path of regularization parameters and averaged over 100 replications, for different configurations of N, G and R. As in the previous simulation, the performance of the mixed graphical probit estimator improves as N increases for fixed R and G ( Fig. 1(a)), whereas it deteriorates as G rises, holding N and R constant ( Fig. 1(b)). This result can be explained by looking at the main features of our model. In fact, as N increases there are increasingly more observations to estimate the unknown parameters β and Φ G , whereas when G increases there are increasingly more parameters to estimate.

Prediction accuracy
In a final set of experiments, we assess the prediction accuracy of the proposed mixed graphical probit as a classification model. For this, we generate a testing sample with the same Monte Carlo design as above and employ the parameters that were estimated in the training sample to calculate the predicted probabilities P.Y ir = 1|x ir / = Φ.β x ir + z irû r /, with Φ the standard normal cumulative distribution function and withû r the estimated group random effects calculated by using equation (3.9) (or for larger problems its approximation (3.16)). We carry out two sets of experiments: one with R = 200 (the case of large R), where we compute the proposed ML estimator (ρ G = 0) and one with R = 50 where we compute a penalized version of the estimator. In this case, we select the regularization parameter ρ G with the value that is closest to the true sparsity level. This is possible only in a simulation setting and enables our results not to depend on the specific choice of model selection criterion. Fig. 2 shows the average ROC curves across 50 replications for varying configurations of N and G, under the large and small sample size cases respectively. The curve plots the percentage of non-zero outcomes that are correctly predicted as non-zero versus the percentage of 0s that are incorrectly predicted as non-zeros, as the classification threshold on the predicted probabilities varies between 0 and 1. Similarly to before, and as expected, we note an improvement in the performance of the mixed graphical probit as N increases and G decreases. The comparison also shows how the use of a densely estimated precision matrix (ρ G = 0) under a setting of sparsity (P.θ jk = 0/ = 3=G) does not hinder the performance of the classifier in terms of prediction accuracy. Indeed, the estimation of the regression parameters β is found to be quite stable across different levels of sparsity of the precision matrix, as noted also in the literature on correlated multivariate probit models (Chib and Greemberg, 1998). In contrast with this, in all cases considered, the performance of the mixed graphical probit is far superior to that of a conventional probit model with no random effects, as shown in Fig. 3 for two representative cases. In the next section, we shall show how the mixed graphical model can be useful for credit risk prediction and we shall discuss more closely the aspect of model selection (selection of ρ G ) within that context, where recovering the underlying network of interfirm dependences is of particular interest.

Credit risk probit model with correlated effects
We employ the proposed approach to estimate a default prediction model for SMEs based on the data that were described in Section 2. To assess the performance of the classifier, we randomly split the sample into two groups: 40000 companies are used for estimation (the training sample) and the remaining accounts for testing the prediction accuracy of the model (the hold-out sample). We use the mixed graphical probit model in equation (3.1) and include network dependences at the level of the G = 13 industrial sectors, by exploiting the grouping of the data into R = 59 geographical regions. Thus, the model contains 59 × 13 = 767 correlated random effects from a total of 40000 observations in the training data. This means that ML estimation is feasible, but it is prohibitively slow by using standard implementations of mixed models; for example the R function glmer (Bates et al., 2015) with uncorrelated random effects failed to converge. Thanks to the efficient implementation that is proposed in this paper, we can explore the full path of solutions from a fully connected network (ρ G = 0; ML estimation) to a disconnected network (large ρ G , corresponding to a mixed model with uncorrelated random effects with unequal variances). Fig. 4 shows two model selection criteria evaluated across the full path of solutions. In Fig. 4(a), we plot AUC of classification prediction on the test data for models fitted on the training data under various levels of sparsity. In Fig. 4(b), we plot eBIC of the models across the path of solutions. Both plots show an optimal point somewhere in between a fully connected network (the rightmost value with 169 non-zeros in the precision matrix) and a disconnected network (the leftmost value with 13 non-zeros in the diagonal of the precision matrix). eBIC appears to favour a sparser model (ρ G = 0:005, 41 nonzeros in the precision matrix), whereas AUC, although achieving a real maximum for the fully connected network, appears to decline significantly after fitting a model with a precision matrix with 103 non-zeros (ρ G = 0:0006). We take the latter as the optimal model for subsequent analyses. Firstly, we explore the estimated network, which gives an indication of the more connected sectors in the economy. This is plotted in Fig. 5, where links between any two sectors appear when there is a non-zero precision among them. It is interesting to see that the sectors that are more central to the network are those from real estate, manufacturing industry and the activities of households as employers, whereas we mostly find services activities sectors and, in particular, the sectors 'arts, entertainment and recreation' and 'transportation and storage' not highly connected.
Secondly, we consider the estimated regression coefficients from the fitted model. These are reported in Table 4 (column (I)), together with their standard errors which have been calculated by using the observed information matrix as described in Section 3.5 and further expanded on in the on-line supplementary material. In the remaining columns, we compare the estimates with those of a simpler mixed probit with uncorrelated random effects for sectors only (column (II), fitted with glmer) and of a conventional probit model without random effects, that ignores unobserved heterogeneity and that is often used in credit risk modelling (column (III)). Focusing on the estimates from the mixed graphical model (column (I)), the coefficient that is attached to cash over total assets is statistically significant with a negative sign, indicating that companies with higher cash reserves relative to current assets are less likely to default. The results also show a negative and statistically significant effect for the variable 'retained profits on total assets': the higher the net profits with respect to the investments made, the lower the probability that the firm will go bankrupt. The variable trade debt has a negative and significant coefficient, meaning  that, the higher the money a company is expected to receive from other companies as a result of trade, the less likely the company is to default. Looking at the non-financial variables, the coefficients that are attached to size and age are significant and indicate that, as expected, larger and older companies have lower probabilities of default. Comparing with the results that are reported in columns (II) and (III), the inclusion of network effects in the probit model does not seem to change significantly the estimated coefficients, although the standard errors are slightly smaller for the method proposed. This has been observed also in related literature on multivariate probit models (Chib and Greenberg, 1998). Finally, we compare the classification performance of the mixed graphical model with the same two models reported in Table 4. Table 5 reports the classification accuracy statistics on the hold-out sample. When adopting the mixed graphical probit, the overall classification accuracy is significantly improved. Given the high number of non-failed companies in the data, the mixed graphical probit is particularly good at identifying correctly companies that did not fail. This is confirmed also by the ROC curve in Fig. 6, where the ROC of the mixed graphical probit lies always above the ROC of the simpler mixed probit and of the conventional probit.

Concluding remarks
In this paper we have proposed a computationally efficient EM algorithm for estimation of a mixed probit model with correlated group-specific effects and have shown its use in a credit risk application, for which existing approaches were prohibitively slow. We have proposed unconstrained and penalized likelihood estimation approaches for inference and have derived the observed information matrix and asymptotic standard errors of the estimates. The L 1 -penalized approach is suitable for when the number of groups is large relative to the number of observations, for which ML fails, and/or when the recovery of the underlying network is of interest. If network recovery is not of interest but high dimensionality is present, other regularization methods can be used in the M-step of the EM algorithm proposed, such as by making use of a ridge penalty (Schäfer and Strimmer, 2005). An extensive simulation study showed that our estimator has good finite sample properties and can be adopted for estimation and prediction using very large data sets, given its moderate computational costs. A large-scale credit risk application on a unique data set on SMEs, a setting in which credit risk modelling is currently underdeveloped, showed that accounting for network effects makes a significant contribution to increasing the default prediction power of risk models and therefore that efficient inferential procedures for these models are particularly useful in this field.
The R code to fit the mixed graphical probit that is described in this paper is available on GitHub (https://github.com/veronicavinciotti/correlatedmixedprobit).
George Foy for assisting with data retrieval and Francesco Moscone, Sergio di Cesare and Mark Lycett for helpful comments on the manuscript.