Circular Regression Trees and Forests with an Application to Probabilistic Wind Direction Forecasting

While circular data occur in a wide range of scientific fields, the methodology for distributional modeling and probabilistic forecasting of circular response variables is rather limited. Most of the existing methods are built on the framework of generalized linear and additive models, which are often challenging to optimize and to interpret. Therefore, we suggest circular regression trees and random forests as an intuitive alternative approach that is relatively easy to fit. Building on previous ideas for trees modeling circular means, we suggest a distributional approach for both trees and forests yielding probabilistic forecasts based on the von Mises distribution. The resulting tree-based models simplify the estimation process by using the available covariates for partitioning the data into sufficiently homogeneous subgroups so that a simple von Mises distribution without further covariates can be fitted to the circular response in each subgroup. These circular regression trees are straightforward to interpret, can capture nonlinear effects and interactions, and automatically select the relevant covariates that are associated with either location and/or scale changes in the von Mises distribution. Combining an ensemble of circular regression trees to a circular regression forest yields a local adaptive likelihood estimator for the von Mises distribution that can regularize and smooth the covariate effects. The new methods are evaluated in a case study on probabilistic wind direction forecasting at two Austrian airports, considering other common approaches as a benchmark.


Introduction
Circular data can be found in a variety of applications and subject areas, e.g., hourly crime rate in socio-economics, animal movement direction or gene-structure in biology, and wind direction as one of the most important weather variables in meteorology. Fitting a statistical model to this type of data requires the incorporation of its specific feature of periodicity. For example, angular data are restricted to an interval such as [0, 2π) with 0 being equivalent to 2π.

Circular regression: Conditional mean vs. distributional models
Many approaches to model circular data assume that the circular variable of interest follows a circular distribution, in particular the von Mises distribution which is also known as the "circular normal distribution". One of the first regression models with a circular response variable and linear covariates was presented by Gould (1969) where the circular mean is predicted by a linear combination of covariates. Johnson and Wehrly (1978) refined this idea by plugging in a link function transforming the linear predictor to a restricted interval of length 2π. This generalized linear model (GLM) type approach was further extended by Fisher and Lee (1992) and subsequently by Fisher (1993) introducing independent GLMs for either the location or scale parameter with appropriate link functions while keeping the other parameter constant. Additionally, they developed a combined heteroscedastic or distributional version with alternating reestimation of the location and scale parameters conditional on the respective sets of covariates until convergence. While all of these models are built on well-elaborated theory, their application in practice remains very challenging, mainly due to the complexity encountered in optimizing the corresponding log-likelihood function which is not globally concave. Therefore, highly-informative starting values are crucial for such circular GLMs to converge (Pewsey, Neuhäuser, and Ruxton 2013;Gill and Hangartner 2010). In order to avoid this strong dependence on appropriate initial values, Mulder and Klugkist (2017) present a Bayesian alternative of a homoscedastic GLM for circular data. However, apart from potential difficulties in the optimization procedure of circular GLMs, the interpretation of the underlying additive effects is often challenging as well because the link function is highly nonlinear and the representation of smooth transitions on the unit circle is not straightforward. For example, the same rotation can be obtained in either positive or negative direction on the circle leading to an ambiguous interpretation.
As a very intuitive and data-driven alternative, we propose a flexible tree-based regression approach for modeling circular data by applying the von Mises distribution within the methodology of distributional trees and forests (Schlosser, Hothorn, Stauffer, and Zeileis 2019). The resulting circular regression trees and forests avoid the discussed difficulties of circular GLMs by using the available covariates for partitioning the data into sufficiently homogeneous subgroups so that a simple von Mises distribution without further covariates can be fitted to the circular response in each of these subgroups. This obviates the need for a link function or for iterating between models for the separate distribution parameters. By leveraging the distributional modeling approach, the trees can automatically detect and capture differences in both distribution parameters, providing a fully specified circular response distribution in each terminal node offering a wide range of statistical inference. In addition, the employed tree structure allows to capture non-additive effects while forests enable the modeling of smooth changes. Furthermore, covariates and their possible interactions do not need to be specified in advance as they are selected automatically in the recursive partitioning algorithm.
This novel approach to circular regression trees and forests complements the literature on treebased circular modeling. Lund (2002) already introduced a circular regression tree algorithm where binary splits are made based on an angular distance measure capturing node homogeneity. However, this only models changes in the conditional mean but not the conditional variance or full probabilistic distribution which would allow for also considering uncertainty in forecasts (Gneiting 2008). In contrast, we introduce a distributional tree approach that considers splits in all distribution parameters and yields a full probabilistic predictive model. In addition, as a natural extension, ensembles or forests of circular regression trees are presented in this study, showing that these can further improve predictive power.

Motivating example
To provide a first impression of the presented methodology, a circular regression tree is employed for probabilistic wind direction forecasting. Wind direction is a classical circular quantity and accurate forecasts are of great importance for decision-making processes, e.g., in air traffic management as considered in this study. Figure 1 shows an estimated tree for 1-hourly forecasts at Innsbruck Airport, located at the bottom of a narrow valley within the European Alps. Topography channels the wind along the west-east valley axis or along a tributary valley intersecting from the south. Hence, pressure gradients to which valley wind regimes react are considered as covariates along with other meteorological measurements (lagged by one hour) and their derivatives, such as wind direction and wind speed at the airport itself as well as spatial and temporal differences. Figure 1 illustrates the resulting tree along with the empirical (gray) and fitted von Mises (red) wind direction distribution in each terminal node. Based on the fitted location parametersμ, the subgroups can be distinguished into the following wind regimes: (1) Up-valley winds blowing from the valley mouth towards the upper valley (from east to west, nodes 4 and 5); (2) Downslope winds blowing across the Alpine crest along the intersecting valley towards Innsbruck (from south-east to north-west, node 8); (3) Down-valley winds blowing in the direction of the valley mouth (from west to east, nodes 10, 12 and 13). Node 7 captures observations with rather low wind speeds that cannot be clearly distinguished into specific wind regimes and are consequently associated with a very low estimated concentration parameterκ, i.e., a high estimated variance. In terms of covariates, the lagged wind direction ("persistence") is mostly responsible for distinguishing the broad range of wind regimes listed above while the pressure gradients and wind speed separate the data into subgroups with high vs. low precision. A more extensive case study of circular regression trees and forests applied to probabilistic wind direction forecasting at Innsbruck Airport and Vienna International Airport is presented in Section 4, along with a benchmark against commonly-used alternative approaches.
The remainder of the paper is structured as follows: The theory on probabilistic circular modeling introducing the von Mises distribution, and circular regression models are discussed in Section 2. The methodology of circular regression trees and forests and their features are introduced in Section 3. After the case study presented in Section 4, a comprehensive summary and conclusions are given in Section 5.

Probabilistic circular modeling
Probabilistic modeling of circular data requires the selection of a probability distribution which accounts for the periodicity of circular data. Generally, this feature can be obtained by "wrapping" the probability density function of any continuous distribution around the unit circle (Mardia and Jupp 1999). In that way, the wrapped Cauchy distribution or the wrapped normal distribution can be employed to model symmetric unimodal circular data. A close approximation to the wrapped normal distribution that is mathematically simpler and hence easier to use is provided by the von Mises distribution (Fisher 1993), a purely circular distribution which is also known as "the circular normal distribution" and is a common choice  Figure 1: Fitted tree based on the von Mises distribution for 1-hourly wind direction forecasts at the airport of Innsbruck. In each terminal node the empirical histogram (gray) and fitted density (red line) are depicted along with the estimated location parameter (red hand). The covariates selected for splitting are wind direction (meteorological degree), wind speed (ms −1 ), and pressure gradients (dpressure; hPa) west, east and south of the airport, all lagged by one hour. Note that in the meteorological context wind direction is defined on the scale [0, 360] degree and increases clockwise from North (0 degree). for probabilistic modeling of circular data. Based on a location parameter µ ∈ [0, 2π) and a concentration parameter κ > 0 the density of the von Mises distribution for an observation y ∈ [0, 2π) is given by where I 0 (κ) is the modified Bessel function of the first kind and order 0 (see, e.g., Zemroch 1975, Jammalamadaka andSengupta 2001, or Ley and Verdebout 2017 for a more detailed overview).
The corresponding log-likelihood function is defined as (µ, κ; y) = log(f vM (y; µ, κ)) = − log(2πI 0 (κ)) + κ cos(y − µ). (2.2) To fit a probabilistic model vM(y; µ, κ) to a circular response y, the distribution parameters µ and κ need to be estimated. This can be done by maximizing the log-likelihood function (µ, κ; y) yielding maximum likelihood estimatorsμ andκ such that a fully specified distributional model is fitted to the learning data The score function provides a way to obtain a measure of goodness of fit of the model for each observation and fitted parameter. Then, the optimization problem in Equation 2.3 can alternatively be specified as n i=1 s(μ,κ, y i ) = 0.
(2.5) Figure 2 depicts a von Mises model for circular data in [0, 2π) fitted by maximum likelihood, either using a linearized (left) or circular (right) scale. In both cases, the empirical histogram (gray bars) is shown along with the fitted density (red line) and estimated location parameter (red hand). However, this distributional model only considers the circular response variable but no covariate. Of course, including covariates is of interest in a regression setup for forecasting.
In most generalized linear model (GLM) or generalized additive model (GAM) approaches to circular regression the location parameter µ depends on covariates z through a link function g(·), circular intercept µ 0 and coefficient vector β: (2.6) The link function transforms the additive predictor to an interval of length 2π. Typically g(x) = 2 · arctan(x) is employed, as suggested by Fisher and Lee (1992). They also developed a heteroscedastic version by combining two individual GLMs, each for one of the parameters µ and κ. This provides a first approach to a fully probabilistic regression model for circular data, albeit the parameters are not regressed simultaneously on covariates as in the more general framework provided by generalized additive models for location, scale, and shape (GAMLSS, Rigby and Stasinopoulos 2005). Nevertheless, other circular additive models share the previously discussed difficulties induced by the characteristics of the log-likelihood function and the strongly nonlinear link function. Referring to additive models in general, it has to be considered that a proper model specification can be very challenging, particularly for a high number of covariates and no information on possible interactions. Moreover, the additive structure might impose a smooth effect even if the true underlying effect is an abrupt shift, which occurs, e.g., in atmospheric wind fields.
In contrast, the tree-based circular regression models proposed in the next sections largely avoid the problems above by employing recursive partitioning in combination with local adaptive likelihood estimation.

Circular regression trees and forests
Starting out from the ideas of Lund (2002), we introduce circular regression trees and forests considering splits in all distribution parameters of the von Mises distribution and providing a full probabilistic model. Moreover, the resulting tree-based models provide a very intuitive and data-driven alternative to commonly used GLMs for circular data.

Circular regression trees
Fitting a global model to a full data set can be very challenging, particularly for complex data with substantial variations. Therefore, separating the data set into more homogeneous subgroups based on covariates before fitting a local model in each of these subgroups allows to capture (potential) group-specific effects more precisely and hence can result in an overall better-fitting model. This is the general idea of regression trees which are combined with distributional modeling in Schlosser et al. (2019). Specifying a full distributional model in each node of the tree yields a distributional regression tree, where selecting the von Mises distribution enables an application to circular data. The crucial step of how and where to split the data can be accomplished with the unbiased recursive partitioning algorithms MOB (Zeileis, Hothorn, and Hornik 2008) or CTree (Hothorn, Hornik, and Zeileis 2006). For this purpose, model scores are obtained by evaluating the score function s(·) for each individual observation at the parameter estimates (Equation 2.4). For the von Mises distribution with its two distribution parameters (µ and κ) and a data set of n observations, this yields an n × 2 matrix that can be employed as a discrepancy measure, capturing how well each given observation conforms with the estimated locationμ and precisionκ, respectively. To capture dependence on covariates, the association between the model's scores and each available covariate is assessed using either a parameter instability test (MOB) or a permutation test (CTree). By doing so in each partitioning step, the covariate with the highest significant association (i.e., lowest significant p-value, if any) is selected for splitting the data. The corresponding split point is chosen either by optimizing the log-likelihood (MOB) or a two-sample test statistic (CTree) over all possible partitions. This procedure is repeated recursively until there are no significant parameter instabilities or until another stopping criterion is met (e.g., subgroup size or tree depth). A more detailed description of the applied tree-building algorithm can be found in Appendix A.
Once a distributional tree model is fitted it can be employed to obtain probabilistic predictions for a possibly new set of observed covariates z = (z 1 , . . . , z m ). Starting at the root node, the tree structure leads the observation to a terminal node where the parameter pair (μ,κ) is estimated for the corresponding subset of learning observations. This can also be expressed by employing weights which indicate whether the i-th learning observation and the observation z belong to the same terminal node: Here, 1(·) is the indicator function and B b is the b-th out of B segments partitioning the covariate space in disjoint subsets. Then the estimated parameter pair (μ,κ)(z) specifying the predicted von Mises distribution for a given z is obtained by a weighted maximum likelihood estimator: Therefore, the same parameter pair is estimated for all observations belonging to the same terminal node, which speeds up computation since the parameter estimates do not need to be recalculated for each (new) observation via maximum likelihood but can be extracted directly from the learning sample and the fitted model.
While tree models can capture non-additive effects, their structure and the consequential strict separation of data into subgroups hinders an adequate depiction of smooth effects. They can be included by combining an ensemble of trees in order to obtain a regression forest, which also stabilizes the model.

Circular regression forests
A natural extension of (circular) regression trees are ensembles or forests that can improve forecasts by regularizing and stabilizing the model. Random forests introduced by Breiman (2001) average the predictions of an ensemble of trees, each built on a subsample or bootstrap of the original data. A generalization of this strategy is to obtain weighted predictions by adaptive local likelihood estimation of the distributional parameters (Section 2.3. of Schlosser et al. 2019;Hothorn and Zeileis 2017). More specifically, for each (possibly new) observation z a set of averaged "nearest neighbor" weights w forest i (z) is obtained that is based on the number of trees in which z is assigned to the same terminal node as each learning observation y i , i ∈ {1, . . . , n}. Hence, for a forest of T trees, the weights are calculated via where |B t b | denotes the number of observations in the b-th segment of the t-th tree. Therefore, similar observations ending up more often in the same terminal node have higher weights and thus a stronger influence in the weighted maximum likelihood estimation.
In that way a specific set of weights can be calculated for each observation yielding its specific parameter estimates for the von Mises distribution ( 3.4) Therefore, the resulting parameter estimates can smoothly adapt to the given covariates z whereas w i (z) = 1 would correspond to the unweighted full-sample estimates and w i (z) ∈ {0, 1} corresponds to the subgroup selection from a tree. Thus, circular regression forests can capture both smooth and abrupt changes, while covariates and possible interactions are selected automatically and do not explicitly need to be specified beforehand.

Case study: Probabilistic wind direction forecasting
As motivated in Section 1, accurate forecasts of wind directions are of great importance for risk management in various fields such as agriculture, energy production, or aviation. For example, in order to direct airplanes to a safe landing, precise knowledge of wind direction for the next hour(s) at the respective airport is highly desirable and adequate prediction methods are required. This section exemplifies the use of circular regression trees and forests with wind direction forecasts for two Austrian airports -one in flat terrain, the other one in mountainous terrain. The results are benchmarked against alternative probabilistic forecasting methods. The study is based on +1 h and +3 h forecasts employing lagged observations in the vicinity of the airports as possible predictor variables.

Data
The circular response variable considered in this case study is a 10 min-average of wind direction measurements at Innsbruck Airport (INN) and Vienna International Airport (VIE) on an hourly temporal resolution. Temporal information and 1-hourly resolved 10 min mean observations of various meteorological quantities are used as predictor variables, including wind direction, wind speed, temperature, air pressure and humidity, all lagged by one or three hours according to the respective forecasting step. The meteorological variables are measured either directly at the airports or within their vicinities. For Innsbruck, measurements at the airport  and along the intersecting valleys are used, whereas, for Vienna, measurements at the airport and within its vicinity of approximately 30 km are used. Figure 3 provides a topographical overview of the airports and their surrounding areas with the station sites employed in this study. In addition, we use derived quantities such as 3-hourly means, minima and maxima, as well as 1-and 3-hourly temporal changes and spatial differences towards the airport of the respective quantities. An overview of the employed data sets can be found in Table 1.
The data used in this study consists of five subsequent years from January 2014 to December 2018. After first eliminating predictor variables with more than 5% missing values and then time points with any missing observations, the data set consists of 41 979 time points and 260 covariates for Innsbruck, and of 38 985 time points and 494 covariates for Vienna, respectively.

Models and evaluation
For a fair evaluation of circular regression trees and forests, and to investigate whether they can be applied as a reasonable alternative to already existing approaches, three additional statistical models are employed in this study for probabilistic forecasting of wind directions. Two of them are based on existing approaches used in the meteorological field, while the third is a state-of-the-art GLM-type model to forecast circular response variables.
• Climatological model: Accurate knowledge of weather quantities' climatologies can be important for a wide range of applications. While forecasts based on climatologies, by construction, do not adapt to the current weather situation they are still a useful baseline for the validation of newly developed forecasting systems (Simon, Umlauf, Zeileis, Mayr, Schulz, and Diendorfer 2017;Stauffer, Mayr, Messner, Umlauf, and Zeileis 2017). Specifically, the climatology employed in the following uses all observations at the same time (to adapt to daily cycles) in a window of 31 days centered around the day of interest (to adapt to seasonal cycles) in all available years in the sample. Based on these observations a probabilistic model is obtained by maximum likelihood estimation as described in Section 2. This approach follows Vogel, Knippertz, Fink, Schlüter, and Gneiting (2018) and is discussed in a comprehensive summary on different time-adaptive training schemes in Lang, Lerch, Mayr, Simon, Stauffer, and Zeileis (2019a).
• Persistency model: The persistence describes the previous value of a single weather quantity in a time series. Like the climatology it is a very basic prediction model that is often applied as a baseline reference in weather forecasts (NOAA's National Weather Service 2019). Especially in nowcasting tasks with very short forecasting steps, the persistence can provide very good estimates.
To gain a full probabilistic persistency model, we proceed similarly as for the climatological model by using maximum likelihood estimation and fitting the distribution parameters of the von Mises distribution conditional on lagged response values according to the description in Section 2. We fit one model for every hour throughout the validation data set employing the previous six lagged response values as training data. In order to allow for a stronger influence of observations closer to the time of interest, exponential smoothing is employed with a smoothing factor of 0.5; accordingly, for every prediction an equal influence rate of 50 percent is assigned both to the current observation and to the previous five observations together. Observations with longer time lags have exponential weights below 0.01 and are therefore omitted from the training data.
• Generalized linear model: Traditional approaches to forecast circular response variables are often based on circular GLM-type models (Fisher 1993). As discussed in Section 1, circular regression models often experience the problem that the likelihood function can be strongly irregular which makes optimization rather difficult. Hence, they often do not converge if no appropriate initial values are provided (Pewsey et al. 2013;Gill and Hangartner 2010). In this study, to be able to employ a GLM out of the box as a reference, we use the Bayesian implementation of Mulder and Klugkist (2017) which depends less on initial values due to an MCMC sampling algorithm using weakly informative priors. Following Mulder and Klugkist (2017) the model uses a link function g(·) to keep the response values within an interval of length 2π. As the implementation cannot handle circular covariates, we use the components of the lagged 2-dimensional wind vector (u, v) and the lagged wind speed spd as predictor variables. The model formula for the location parameter µ of the von Mises distribution can be written as: with β 0 being a circular intercept, β • the regression coefficients and the link function g(x) = 2 · arctan(x). In addition, a constant concentration parameter κ is fitted to the full learning sample. To allow for seasonally varying error characteristics in both bias and slope coefficients, and to allow for seasonal heteroscedasticity captured by the concentration parameter, we use the same time-adaptive training approach as for the climatological model; hence, separate models are estimated over all observation dates, using the same time of 31 days centered around the day of interest over all available years in the training data (Lang et al. 2019a).
• Circular regression tree: For the circular regression tree introduced in Section 3.1, all covariates provided in the learning data can be considered due to an intrinsic automatic variable selection performed in the tree estimation. The tree is built with the newly developed R package circtree employing the CTree algorithm (Hothorn et al. 2006) using a minimal number of 2000 observations in each terminal node (argument minbucket).
• Circular regression forest: Following the description in Section 3.2, the circular regression forest used in this study is constructed based on 100 individual trees employing the R package circtree. Each of these trees is again built by the CTree algorithm on a subsample containing 30 percent of the original learning data. All covariates are included for building each tree which ensures that the lagged response variable is always considered for splitting. This bagging approach can be applied in circtree by setting the argument mtry to the total number of covariates. Since a high number of possible split points leads to high computational costs, the covariates are binned in a maximum of 50 classes (argument nmax = c(yx = Inf, z = 50)). Contrary to a single-tree model, forests usually consist of very large trees as they are not prone to overfitting the data due to the stabilization obtained by combining the individual trees. Therefore, we use the following control arguments to build rather large trees: The minimal number of observations to perform a split is set to 20 (argument minsplit), the minimal number of observations in each terminal node is set to 7 (argument minbucket), and the significance level for variable selection is kept at its maximum value of 1 (argument alpha).
To compare the predictive performance of all proposed models, a circular analogue of the continuous ranked probability score (CRPS) as introduced by Grimit, Gneiting, Berrocal, and Johnson (2006) is computed. Just as the linear version of the CRPS (for more details see Hersbach 2000) it is a proper scoring rule (Gneiting and Raftery 2007) and measures the difference between an observation and the corresponding predicted distribution function in order to assess the probabilistic goodness of fit for the estimated model. Hence, the lower the CRPS value the better the predictive performance. Contrary to the linear version, the circular CRPS reduces not to the absolute error but to the angular distance when the forecast is deterministic.
In addition to the raw CRPS, corresponding skill scores are computed to assess differences in the improvement of the various statistical models over the climatological model used as a reference: All scores presented in the next section are computed out-of-sample based on five years of data. For the persistency model only dates prior to the time of interest are used and the validation is performed rolling over all observations. For all other models a five-fold cross-validation is employed using up to four calendar years for model training and the remaining single calendar year for validation. Due to the large sample size of 24 hourly values per day over five years, some kind of temporal aggregation is needed to ensure a correct visual comparison of the individual methods. The analyses performed have shown that for the employed models the variability of the predictive performance over the five years is lower than over a single day or over a single year. Hence, CRPS and CRPS skill scores are aggregated over the respective five validation years which yields 24 hourly scores per month averaged over the five validation years.

Results
This section provides a detailed analysis on the predictive performance of the different proposed statistical models applied to probabilistic wind direction forecasting. To ensure a comprehensive comparison of the models, wind direction forecasts are evaluated for two different lead times at two airports with different climatological site characteristics. Figure 4 shows the CRPS values of the employed models at forecast steps +1 h and +3 h for Innsbruck (Panels a, c) and Vienna (Panels b, d). The scores are aggregated over the five validation years, yielding yearly mean values for every hour per calendar month, with a lower score indicating better performance. The circular regression forest overall provides the best predictive performance, followed by the circular regression tree and the persistency model for both stations at both forecast steps; except for the 3 h forecasts at Innsbruck where the persistency model is outperformed by all others. In comparison to the circular regression tree and forest, for both stations and forecast steps, the climatological model and the linear model show clearly higher CRPS values and hence a lower predictive performance. The different site characteristics of the airports Innsbruck (Figure 4 a, c) and Vienna (Figure 4 b, d) seem to have an effect on the absolute level of the model performances and on their respective predictive performance variances. At Innsbruck, due to the surrounding mountains only a limited number of possible wind directions exists, namely the three wind regimes discussed for Figure 1 in Section 1. Therefore, for Innsbruck the wind direction remains rather constant in one of these possible states, but once a change takes place it is mostly a major wind direction shift, e.g., from q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q (a) q q q (c) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q (b) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q up-valley to down-valley. Due to the few wind regimes the rather inflexible climatological and linear models score relatively well with similar CRPS values as the other models ( Figure 4 a, c). In addition, at Innsbruck the potential high prediction errors in case of a change of the wind regime seem to lead to a higher variation in the predictive performance for all models in comparison to Vienna; this variation is especially high for the persistency model due to its strong vulnerability to abrupt wind shifts. On the contrary, at Vienna smaller and less abrupt changes in the wind direction as well as less pronounced wind regimes are observed due to the less mountainous surrounding. This seems to weaken the predictive performance of the climatological and linear models, and to reduce the performance variability for all models (Figure 4 b, d).
The different forecast steps have apparently only a minor effect on the predictive performance of the climatological model and the linear model at both stations. As expected, for the persistency model, at both stations, higher scores for the 3 h forecast (Figure 4 c, d) reveal a lower performance for longer lead times; this is due to the lower information content of 3-hourly instead of 1-hourly lagged response values employed as covariates in the persistency model. The circular regression tree and forest seem to partially compensate for the lower skill of the lagged response values by other covariates, hence their predictive performance only slightly decreases for the longer lead time. This compensation is especially evident for Innsbruck, where the performance difference between the persistency model and the treebased methods significantly increases from the 1-hourly to the 3-hourly forecast.
In addition to the raw CRPS (Figure 4), CRPS skill scores with the climatological model as a reference are provided in Figure 5. Skill scores are in percent, where positive values indicate an improvement in the predictive performance over the reference. For all setups, the circular regression forest has the highest skill scores with a mean performance gain of 13-25% and 58-71% for Innsbruck and Vienna, respectively. As discussed for Figure 4, this improvement over the climatological model is lower for Innsbruck due to the low number of predominant wind regimes and hence a relatively good performance of the climatological model. Additionally, Figure 5 shows that while the persistency model's performance is lower than the reference (Panel c) the tree-based models can compensate for the low skill of the lagged response values employed as covariates and, hence, are still significantly superior to the reference.

Summary and conclusion
Extending the toolbox for modeling circular data, circular regression trees and forests are established by coupling model-based recursive partitioning with the von Mises distribution. By separating the data into more homogeneous subgroups, possible difficulties in circular regression are bypassed as covariates are solely considered for splitting and group-specific models are fitted without further covariates. In addition, by specifying the von Mises distribution for each node and allowing for splits in both distribution parameters µ and κ, fully probabilistic forecasts are provided.
The performance of the novel circular regression trees and forests is assessed in a case study for shortterm probabilistic wind direction forecasting at two airports with different site characteristics. As benchmark models, probabilistic climatology and persistency models, as well as a state-of-the-art circular GLM-type model are evaluated based on proper scoring rules. In summary, the circular regression trees and forests have the highest predictive performance in this setting. For cases without changes in the wind regime, lagged response values provide already highly skillful estimates leading to a good performance of the persistency model as observed for shortterm wind direction forecasts in this study. While in these cases the trees and forests also benefit from the highly informative lagged response, they can compensate for a lower information of this covariate by incorporating other quantities and possible interactions of these, contrary to the persistency model (see Figure 5). Hence, the tree-based models provide reliable forecasts in all tested meteorological settings. For operational use, a possible extension could be the incorporation of numerical weather predictions as (additional) covariates. While this probably only slightly improves the predictive skill for short leadtimes, it possibly extends the potential forecast range of the different methods.
For the specific task of wind forecasting, the wind direction is often only relevant if the wind speed is sufficiently high. Hence, it is of interest to account for both quantities simultaneously, e.g., by considering a bivariate normal distribution for wind vectors (from which wind speed and wind direction can be obtained). The parameters of this bivariate normal distribution could then be linked to available covariates using an additive regression framework (as proposed by Lang, Mayr, Stauffer, and Zeileis 2019b) or using a tree-based approach, similar to the one proposed in this paper. Moreover, a rather different approach for a combined response of wind speed and wind direction would be a two-step or hurdle model: In the first step this could build on the truncated normal model of Thorarinsdottir and Gneiting (2010) to capture wind speed; in the second step a circular wind direction model is leveraged given that a certain hurdle for the wind speed is crossed.
Another possible improvement for obtaining more parsimonious circular regression trees is to consider splitting circular covariates into two circle segments by searching two split points simultaneously rather than sequentially at different depths. While this might slightly improve the predictive performance of circular regression trees, this should not affect the performance of the forests, as they consist of very large trees with many different splits.
To conclude, in general the tree structure can capture nonlinear changes, shifts, and potential interactions in covariates without prespecification of such effects. As supported by the presented case study, this can be particularly useful for modeling a highly fluctuating response, such as typically observed for wind direction, or/and in case of a large number of possible covariates. Moreover, the case study shows that building ensembles of circular regression trees can even improve the forecasting performance, as the resulting forests allow for modeling smooth effects and stabilize the model.
For the circular GLM considered as reference model the corresponding implementation is provided in the R package circglmbayes by Mulder and Klugkist (2017). In particular the function circGLM is applied to estimates the intercept and regression coefficient along with the concentration parameter.

A. Tree algorithm
This section provides a more detailed overview on the permutation-test-based CTree algorithm (Hothorn et al. 2006), specifically for circular data as applied for building circular regression trees and forests presented in this case study. An alternative tree-building framework is provided by the MOB algorithm, which is based on M-fluctuation tests (see Zeileis et al. 2008, for more details).
In the following, the testing and splitting strategy is described for the root node of the tree which contains the entire learning sample. For a complete tree model, the same procedure is applied iteratively to all resulting child nodes with the corresponding subsamples.
First, employing the von Mises distribution, a distributional model vM(y; µ, κ) is fitted to the learning sample of circular observations {y i } i=1,...,n as explained in Section 2. In a next step, a goodness-of-fit measurement is obtained for each parameter and each observation by evaluating the score function s(µ, κ, y) at the estimated location and concentration parameterμ andκ. To detect dependencies between the resulting scoring matrix with s(μ,κ, y i ) ∈ R 1×2 . For a numeric split variable z l the transformation function v l is simply the identity function v l (z li ) = z li such that t l ∈ R 2 as the "vec" operator converts the matrix of dimension 1 × 2 into a 2 column vector. If z l is a categorical variable with h categories then v l (z li ) = (I(z li = 1), . . . , I(z li = h)), hence, v l returns a unit vector of dimension h where the entry 1 indicates the category of z li . In this case the "vec" operator converts the h × 2 matrix into a h · 2 column vector by column-wise combination such that t l ∈ R h·2 . If there are any observations with missing values these are not included in the calculation of t l .
To map the multivariate linear statistic t l onto the real line a univariate test statistic c is employed, for example in a quadratic form where µ l and Σ l are the conditional expectation and the covariance of t l , as derived by Strasser and Weber (1999) and used for standardization, and Σ + l is the Moore-Penrose inverse of Σ l . As an alternative, also a maximum form (c max ) can be considered such that the maximum of the absolute values of the standardized linear statistic is returned.
The asymptotic conditional distribution of c(t l , µ l , Σ l ) is either normal (for c max ) or χ 2 (for c quad ) owing to the asymptotic conditional distribution of the linear statistic t l being a multivariate normal with parameters µ l and Σ l (Strasser and Weber 1999). With this knowledge at hand, the corresponding p-values can be calculated and used to select the best splitting