A non-parametric Hawkes process model of primary and secondary accidents on a UK smart motorway

A self-exciting spatio-temporal point process is fitted to incident data from the UK National Traffic Information Service to model the rates of primary and secondary accidents on the M25 motorway in a 12-month period during 2017-18. This process uses a background component to represent primary accidents, and a self-exciting component to represent secondary accidents. The background consists of periodic daily and weekly components, a spatial component and a long-term trend. The self-exciting components are decaying, unidirectional functions of space and time. These components are determined via kernel smoothing and likelihood estimation. Temporally, the background is stable across seasons with a daily double peak structure reflecting commuting patterns. Spatially, there are two peaks in intensity, one of which becomes more pronounced during the study period. Self-excitation accounts for 6-7% of the data with associated time and length scales around 100 minutes and 1 kilometre respectively. In-sample and out-of-sample validation are performed to assess the model fit. When we restrict the data to incidents that resulted in large speed drops on the network, the results remain coherent.


Introduction
The United Kingdom has one of the lowest per-capita death rates from traffic accidents in the world, estimated by the World Health Organization (2018) at 3.1 per 100,000 of population in 2016. Nevertheless 1782 deaths and 25,484 serious injuries resulted from accidents on UK roads in 2018 Department for Transport (2019). Aside from the direct human cost of serious accidents, indirect economic costs result even from relatively minor incidents. This is because crashes, collisions and breakdowns can cause severe congestion leading to significant drops in the efficiency of the road transport network. For these reasons, there is a strong imperative to further reduce the accident rate on UK roads.
However traffic accidents are rare in absolute terms and are not distributed uniformly across the network. Further rate reductions are therefore likely to require targeted interventions. Targeted interventions might try to to improve safety at specific locations where the accident risk is known to be high compared to the baseline or might try to mitigate against particular mechanisms that are known to account for a significant proportion of accidents. Infrastructure modifications to improve the safety of accidentprone junctions is an example of the first type of intervention. The deployment of the Motorway Incident Detection and Automatic Signalling (MIDAS) system to reduce the number of secondary accidents on motorways is an example of the second. Secondary accidents occur when when a driver fails to react appropriately to the disruption caused by an existing accident leading to a subsequent accident upstream of the first one.
MIDAS uses a network of induction sensors, known as loops, embedded in the road surface to automatically detect queues and then warn upstream drivers of the danger ahead via roadside signage. MIDAS is one component of the UK 'smart motorways' infrastructure, see Highways England (2020). Smart motorways use modern sensing and communications technology to allow transport system operators to display messages to drivers, dynamically control speed limits and close lanes when needed to aid the control of traffic. Most of the data, including data on accidents and congestion, is publicly available via the National Traffic Information Service (NTIS). In this paper, we use data from NTIS to model the distribution of motorway incidents as a spatio-temporal Hawkes process, a type of self-exciting point process. We focus on the M25 London Orbital, one of the busiest motorways in the UK. The objectives of the study are two-fold. The first is to quantify how accident risk on the M25 varies in space and time relative to the baseline. The second is to use the self-excitation component of a Hawkes process to quantify the likely contribution of secondary incidents to the observed totals. We would like this model to be helpful in addressing the question of how best to target interventions when the baseline accident rate is low in absolute terms. We therefore perform extensive in-sample and out-of-sample validation to verify the models performance on seen and unseen data.

Literature Review
There is much work focusing on spatial and temporal analysis of traffic incidents, with a discussion of the evidence for spatial auto-correlation and incorporation of it into accident modelling given in Aguero-Valverde and Jovanis (2008). In the work, the authors discussed previously used descriptive analysis methods, including K-function analysis and comparison to complete spatially random patterns, and how they showed clear evidence of spatial correlation among event locations. They then furthered this by incorporating spatial components into an auto-regressive model, finding it improved upon models disregarding the spatial correlation in the data. Additional analysis is provided in Fan et al. (2018), where the authors focused on an urban network in China and analysed the evolution of event hot-spots through time. They developed various network spatial analysis methods, extending kernel density estimation on networks, Moran's I and Local Indicators of Mobility Associations (LIMA) and offer various exploratory analysis of the dataset with these methods.
Further spatial-temporal analysis was completed in Song et al. (2018), where the authors used Kulldorffs space-time scan statistics to determine statistically significant clusters of traffic accidents across the entire U.K. in 2016. They found two significant clusters, both in the north of the country, but conceded that they do not explicitly account for the network structure in their analysis. Whilst we also consider data from the U.K. we use a different time-period and focus on a much smaller region, a single motorway rather than the entire country. We also aim to model the dynamics as a point-process, not just discover locations of statistically significant clusters. Of-course, from our modelling, we can infer spatially and temporally where events are most common, but fundamentally our approach is different from this work.
A point of particular interest to our work arises from the modelling of traffic incidents in Acker and Yuan (2019). Here, logistic regression and random forest models were used to predict the likelihood of event occurrences. In-particular, the models incorporated a significant variable defined as 'cascading effect' in which the presence of an event showed significant influence on the likelihood of another nearby in space and time. Although a different type of model to the one we use, and different formulation, there is clearly a sense that cascading effects are a real component in some traffic data that one may want to incorporate into a model. It is unclear what time and length scales are associated with this effect however, and if smart motorway features may remove this effect from the data.
Based on this analysis, it is natural to consider applying spatio-temporal point-process models to traffic events. Recent work considering these models on linear networks is given in Moradi and Mateu (2019), where some specific comments were made in regards to applying them to traffic event modelling. In-particular, the authors considered roadnetworks in Huston, Medellin and Eastbourne and considered what features the data showed. They found statically significant evidence that events on the network did not follow a uniform spatial-temporal Poisson process and that tests indicated favoring clustering of data in space and time. However they focused on urban road networks, where as we focus on highways, in-particular smart motorways. As we consider one continuous ring-road, not an urban network, we can also use a standard distance measure, rather than a graph distance based one as in the cited work. Our choice of dataset is not to simplify the fitting procedure onto continuous space rather than a network, but instead it offers some insight into a much discussed topic of smart motorway safety.
From our review of the literature it is clear that one could ask if point-process models incorporating a self excitation component capture this previously discussed cascading effect. Such models have been applied to a wide range of real-world problems, with a recent review of these models given in Reinhart (2018). One application discussed in this review is earthquake modelling, as in Zhuang et al. (2004), Zhuang (2011) and Kumazawa and Ogata (2014). This is a natural application for such models, as there is strong evidence that initial large earthquakes lead to aftershocks, and hence there is a clearly interpretable 'self-excitation' component to the application. Another application discussed is crime forecasting, as in Mohler et al. (2011) and Mohler (2014), where self-excitation can be seen in physical terms as retaliation crimes, among other things.
Alternative applications discussed include epidemic forecasting in Meyer and Held (2014) and Schoenberg et al. (2019), and modelling social-networks on events as in Fox et al. (2016). Very recently, similar models have been applied to modelling the spread of COVID-19 Lesage (2020), although this is still in its infancy.
In each of these works, there is some sense of a 'background' component, that models the typical behaviour, and a 'triggering' component that allows for self-excitation. There is much discussion as to what functional form the components should take, in-particular for the triggering components. Typically, one supposes some reasonable parametric forms, then determines which is most appropriate through inspection of the log-likelihood value or information criteria. Standard choices include some form of exponential, Gaussian or power-law decay of triggering in time and space. However, recent work in Zhuang and Mateu (2019) showed how one could determine both the background and triggering components in a non-parametric way, through kernel-smoothing of data. In-particular, the authors proposed to model crime data using a background comprised of periodic daily, periodic weekly, long-term trend and spatial components, as-well as triggering in space and time. However, every component is determined without assuming any functional form, instead the authors show how when basing their methods on work in Zhuang (2006), one can determine which events in the data appear to be background and which appear to be triggered, and then smooth data based on this to reconstruct the desired components. It is on this we base most of our work in this paper, and in section 4 we give an overview of the methodology and how we adapt it for our use.
Clearly, there is significant work on applying self-exciting point-process models to problems in crime, earthquakes and epidemics, but there is very little on applying them to traffic events. One paper that does look at this is Li et al. (2018), where the authors proposed a self-exciting point-process model that could theoretically be fit to real traffic data. However, only simulated data is used, generated from the proposed model, to then show how the fitting and evaluation would work. Additionally, a somewhat similar idea was considered in Lim et al. (2016), however importantly here the authors considered traffic flow data to be 'events' and tried to use self-excitation to model the idea that often traffic flow occurs in clusters. They then applied the methods to model traffic flow in Sydney, however there was no clear conclusion as to if the model was statistically defensible and captured all features of the data, or if alternative traffic flow forecasting methods were preferable. There is still an enormous amount of work to be done applying this methodology to real-data, and understanding what components of it are important, and the amount of self excitation present in traffic data, along with appropriate time and length scales it occurs on. Throughout our modelling, we are able to offer some discussions of these questions.

Collection & Selection
The data for this study is taken from the National Traffic Information Service (NTIS) †.
NTIS provides both historic and real-time traffic data for all roads in the U.K. that lie on the 'Strategic Road Network.' This network includes all motorways and major A-roads in the U.K. but we choose to focus our analysis on one the countries busiest motorways, the M25 London Orbital, pictured in Fig. 1. Inside NTIS, roads are represented by a directed graph, with edges (referred to as links from now on) being segments of road that have a constant number of lanes and no slip roads joining or leaving. We extract all links that lie on the M25 in the clockwise direction, yielding a subset of the road network to collect data for. Placed along links are physical sensors called 'loops', which record passing vehicles and report averaged quantities each minute.
The most relevant components of NTIS to our work are event flags that are manually entered by traffic operators. These flags specify an event type, for example accident or †Technical details of the NTIS data feeds are available at http://www.trafficengland.com/ services-info obstruction, as-well as the start time, end time and the link the event occurred on. We extract all accidents and obstructions that occurred on the chosen links between September 1st 2017 and September 30th 2018. However, we require more fine-grained location data for events than just the link it occurred on, so we perform further localization by inspecting adjacent loop sensors. To understand why this is needed, one should note that links vary in size from around 500 to 10,000 meters. As a result, the data is not initially appropriate to model with point-processes, however one can use the time-series provided by individual loop sensors positioned along links to determine higher resolution location data for events. The average distance between successive loop sensors is roughly 500 meters, vastly reducing location uncertainty on longer links. Full details of the localization preformed are provided in the supplementary material.

Model formulation
Our objective is to model the number of events (incidents) observed between any two positions x 1 and x 2 in any time interval between t 1 and t 2 . We denote this quantity by N [x1,x2), [t1,t2) . Since we know that accidents cluster in both space and time, the simplest model is a non-homogeneous Poisson point process. This model is specified by an underlying intensity function, λ(x, t), which is the local event probability per unit length per unit time. It is then assumed that for any intervals, [x 1 , x 2 ) and [t 1 , t 2 ), N [x1,x2),[t1,t2) has a Poisson distribution: where It is natural to assume that the intensity is multiplicatively decomposable: Here µ 0 is a uniform background intensity which has units of events per unit time per unit length. This uniform background is then modulated by the functions µ d (t), µ w (t), µ t (t) and µ s (x) to capture spatio-temporal variation. The spatial modulation, µ s (x), accounts for the fact that different locations will naturally have differing rates of accidents. For example junctions with low visibility having higher rates than straight, simple sections of road. The temporal modulation, µ d (t) µ w (t) µ t (t) consists of three components: µ d (t) represents daily variation, µ w (t) represents weekly variation and µ t (t) represents any long term trend that may be present. Daily and weekly seasonality is a ubiquitous feature of traffic data reflecting daily 'rush-hour' commuting patterns, and weekly differences between the 5-day working week to the 2-day weekend. Annual seasonality may also be present but since our data spans one year, any such variation is captured by the trend.
Inhomogeneous point processes describe clustering solely in terms of variations in the intensity function. A more sophisticated model is a self-exciting point process, known as a Hawkes process, to capture the distinction between primary and secondary incidents. Self-excitation means that when an event occurs, the probability of observing a subsequent event nearby increases. A second term depending on the previous events is added Eq.
(3) to give what is called the conditional intensity function: where A is the triggering rate, g and h are triggering functions that describe how the triggering mechanism decays in time and space respectively and (t i , x i ) are the times and locations of the observed incidents. The word 'conditional' here reflects the dependence of the intensity on the realisation of the process.
Our methodology is based on the work of Zhuang and Mateu (2019), where it is shown how to construct the conditional intensity of a Hawkes process in a non-parametric way by applying kernel smoothing to the observed data on crime. We modify this model to make it applicable to traffic data. The supplementary material, contains the model derivation, estimators of each model component and details of the computational fitting algorithm, following closely Zhuang and Mateu (2019). There are three main changes.
Firstly, the spatial triggering mechanism is one-dimensional and unidirectional: secondary incidents cannot occur downstream of the primary incident. Secondly we enforce monotonicity of the triggering functions, g and h, in Eq. (4) as a constraint to help with identifiability. Thirdly, we apply boundary correction to the kernel density estimates to reduce bias. The first of these is straightforward and detailed in the supplementary material. In the remainder of this section, some further detail is provided on the latter two.

Constraining Triggering Functions
We note that this model aims to explain any residuals from the background process by using the triggering component of the model. However, in-practice, we want to ensure for our investigation that this component truly reflects increased rates of events on a short-time scale (compared to the periodic components) in the wake of a particular event.
As such, it is a natural extension of the original methodology to enforce the triggering functions to be monotonic. Existing work details how to convert a smoothing estimate of data into a monotonic estimate, in-particular we use the methods described in Hall and Huang (2001). In standard kernel smoothing, one writes a smoothed estimate of some dataset (X,Y) as:ν Here, we have some kernel K, and some bandwidth h, and smooth it to return a result ν(x). The work in Hall and Huang (2001) introduces a generalized definition, incorporating a weight p i to each data-point used in the smoothing to 'adjust' the initial smoothed fit to be monotonic. One then writes this adjusted fit as: Our goal is now to choose a weight p i for each data-point i used in the construction of the function, whilst altering the original estimate as little as possible. There are an infinite number of sets of {p 1 , p 2 , . . . , p N } one could choose to enforce a monotonic function, and to identify a unique solution, we choose the set that is as close to the uniform distribution 1 N , . . . , 1 N as possible. One distance measure used in the work to compare the p i 's to a uniform distribution is: Using this, we can then introduce a step in our model fitting where we solve a further optimization problem, determining each p i value to produce a monotonic triggering function. The optimization problem is specified as: Since the temporal triggering decays with increasing time, we would choose to constrain the gradient to be negative. For spatial triggering, we want to enforce decay as the input becomes increasingly negative, meaning we would alter the first condition in the optimization problem. It should be noted that in practice, constraining the triggering function only marginally alters the resulting functions in section 5.5, where we fit the model to small subsets of data. When fit to the entire year, the resulting functions are already monotonic.

Boundary Correction
It is well known that, on a truncated domain, kernel density estimates are biased near the boundary. A simple way to adjust near the boundary is to 'mirror' the data, supposing we have extra data-points. If we have our 'real' data-points X 1 , X 2 , . . . X N and wish to truncate our domain at 0, we introduce extra points −X 1 , −X 2 , · · · − X N , however still evaluate the function and normalize appropriately over the true range of interest. We apply this mirrored correction to the temporal and spatial triggering functions, as-well as the background trend. To avoid edge effects on our spatial domain, we assume that the spatial background is periodic, as the M25 is an almost continuous ring around London.
Recall back to Fig. 1, we have a small section that we do not collect data on and is not a motorway, but this is negligible in comparison to the wider motorway, so assuming a spatial background on a ring is reasonable. We detail explicitly why boundary correction is not needed for the periodic components in the supplementary material, in the context of the daily background, but it extends to all other periodic components.

Results
There are a wide range of real-world issues one could investigate with such a model.
Here we discuss some of the most relevant questions after applying the model and fitting procedure to our data. Initially, we apply the methodology to our entire dataset, 1-year of traffic data on the M25, but later consider if results vary by season. To select the background smoothing bandwidths, we consider natural choices, with daily bandwidth being 60 minutes, weekly bandwidth being 5 × 60 minutes, trend bandwidth being 60 × 24 × 14 minutes and spatial bandwidth being 5500 meters.

Model Selection & Prevalence of Triggering
Given our setup, we first consider some measure of goodness of fit for models containing variations of the discussed components. The first is a simple, homogeneous Poisson process, used as the simplest reference model one could construct. We them compare models with: daily and weekly background components, daily, weekly and trend background components, daily, weekly background and triggering, and daily, weekly, trend back-ground components and triggering. To compare these, we consider the log-likelihood, given by: with a larger value suggesting a better model. Note that in using log-likelihood to judge the goodness of fit, we ignore model complexity, the idea that we could continuously add components to any model and see increasingly marginal improvements as we attain a Including a trend and triggering component further improve the model log-likelihood.
Using this methodology, the parameter A can be interpreted as the proportion of the impact of the triggering function on the total intensity. For our optimal model incorporating all components, this means about 6.65% of events appear to be the result of triggering, in practical terms about 101 events. One may want to consider this as an upper bound, as discussed in section 5.3, along with an appropriate time-scale.

Background Analysis
The background component of the model is clearly strong, as seen when inspecting the changes in log-likelihood from table 2. In our model, this constitutes a daily, weekly, spatial and trend component. We visualize the temporal background components in Fig. 3.  Inspecting Fig. 3a, we see that the daily background increases to an initial peak during the morning rush hour, then remains roughly constant, before rising again to a peak at around 4pm, and decaying after this. From Fig. 3b, there appears to be much less variation in the intensity across the week compared to all other identified components, but a slightly higher intensity on Thursdays and Fridays, and the lowest on Tuesdays and Saturdays. Finally, we see a small increase in the trend during the first 7 weeks of the data, then it remains reasonably flat until week 28, where it begins to rise again. Around week 40, it stabilizes again. This could be due to an increase in actual event intensity, or more comprehensive reporting after a certain point, more operators and so forth, however no changes in reporting are known to us.
As-well as the temporal background, where events are most common around the M25 is of interest. We show the spatial background in Fig. 4, broken into two interpretations.
There is clearly spatial structure in the events as evident in Fig. 4. We see two distinct peaks in Fig. 4a, and a smaller spread out peak in-between these two. The largest peak, around 140 kilometers along the motorway, is located near the 'Potters Bar' junction.
The second largest, around 25 kilometers into the motorway, is located between where the M25 meets the M26, and where the M25 meets the M23. This background intensity, imposed onto a map schematic of the real M25 is plotted in Fig. 4b.

Triggering Analysis
Triggering does appear to improve the log-likelihood of our model, and as we saw in table 2 it explains around 6.65% of events in the data. We visualize the resulting functions in result of this, one should take 6.65% as an upper bounds of sorts, and understand that this, combined with the identified time and length scales is an informative conclusion to draw.

Model Validation
To validate if our model captures the relevant features of a process, one often follows the work of Ogata (1988), Brown et al. (2002) and Zhuang (2006). Specifically, we use the transformed time-sequence given by: then the resulting sequence of values Λ i ∀ i ∈ {1, 2, . . . N } will follow a unit rate Poisson process if the model is correctly specified. To prove this, one can derive that the sequence of values Λ i − Λ i−1 are i.i.d. random variables that follow an exponential distribution with parameter 1, which is done in Brown et al. (2002). Given that we know the expected distribution of Λ i − Λ i−1 , we can then transform this to follow a standard uniform distribution by computing: Now, the computed z i values should follow the simplest known distribution, which we can evaluate by comparing the measured and theoretical cumulative distribution functions (CDF), as-well as the measured and expected quantiles. One can generate confidence bounds for the comparison of two CDFs by inversion of the Kolmogorov-Smirnov statistic, and for quantile-quantile (QQ) plots by using the fact that the order statistics of a uniform distribution follow a beta distribution. The distribution of the k-th order statistic has parameters k and n + 1 − k, with n being the number of sample points.
Using this, we then generate CDF and QQ plots for our different modelling scenarios, given in Fig. 6a and 6b.
It is clear from Fig. 6 that the model incorporating periodic background, trend and triggering components is statistically defensible when inspecting both the CDF and QQ plots of the results. Some of the quantiles in Fig. 6b are just on the edge of acceptable, but do not deviate outside of the confidence bands. These results show that our model is well specified, however extra components could continue to be added to improve the fit.
One may speculate as to what these are, for example the presence of particular weather conditions or similar physical factors.

Do Components Change with Season?
Given our results for the entire year of data, we can question how resilient the background and triggering components are by inspecting subsets of the data. A natural way to separate traffic data is into seasonal periods, being 3-month subsets of data. To test this, we partition our data into 3-month subsets, and fit the model to each subset independently. We then overlay the components in Fig. 7.
It is clear from our results in Fig. 7  respectively. As we do not have access to another year of data, we cannot say if these values vary throughout the year, or if they are increasing over time. We are also not aware of any changes in the reporting of events, however one could imagine that more operators were available to report events in some time periods or similar physical considerations.
We should note also that the trend component is far less impactful on these shorter time-scales, showing little change in the log-likelihood of the model for 3-month subsets.
This is expected, as the time-scale of our trend appears much longer than this. As a result of this however, we can actually perform out-of-sample model validation, something we are not aware of being done previously in the literature for this type of model. Since the trend is omitted, and we see that the triggering components are generally consistent through time, we can train the model on some 3-month subset of data, then perform our validation on unseen data. We show two examples of this in Fig. 8a and 8b.  It is clear from Fig. 8 that on short time-scales, one can use the model for acceptable performance out-of-sample. This is not possible over longer time-scales however, due to the clear trend and varying spatial background evident in the data.

Do Components Change for Significant Events?
Whilst all of the events marked in the system should correspond to actual traffic incidents, many of them may not actually correspond to ones that had a significant impact on the traffic state. Consider the case where two vehicles have a minor collision, with no debris created, and the drivers pull into the closed hard-shoulder to exchange insurance information. In such a case, there should be very little impact on flow, travel time, speed and so on. Additionally, if a vehicle breaks down and pulls into the hard-shoulder, and the road is nowhere near capacity, we should again see little drop in speed or rise in occupancy. To consider only the behaviour of events that have some significant impact, we inspect the link-level data, which contains significantly less noise than the loop-level.
For a given event window, we consider the largest percentage drop in speed between a simple historical median segmentation profile and measured values across the entire window. If this percentage drop is above some threshold, then we say the event caused a significant impact on the traffic state. Further discussion of this simple seasonal segmentation model is given in the supplementary material. Of course, as we raise this threshold we isolate more extreme events but quickly discard so much data that it is no longer reasonable to fit a model. We find that using a threshold above 50% speed drop, we start to discard over 70% of the events data, so consider thresholds between 0 and 50% for analysis. We split our dataset into subsets containing only events that lead to a speed decrease of at-least 0%, 10%, . . . , 50%, then re-run our model fitting on these subsets. The resulting background components are visualized in Fig. 9. Inspecting Fig. 9a, we see the daily background is reasonably stable across different thresholds of significance. The main difference is that as we require a more extreme speed drop, the morning and evening peak structure becomes more clear, and the periods very early in the morning and late in the day are lowered. From Fig. 9b, we observe that the weekend has a lower intensity than weekdays as we raise the threshold for significance. This is likely due to the fact that demand will be significantly lower on weekends compared to weekdays, and hence when an event does occur, there is less chance of an enormous queue forming simply due to the road being further from capacity than on a weekday. Finally, the spatial background in Fig. 9c clearly shows that the second peak, identified around 140 kilometers along the M25, appears to experience more significantly impactful events than the first peak, observed around 25 kilometers along the M25. This suggests that not only is Potters Bar a 'hot-spot' for events, but also that the events here are some of the more extreme on the network.
Analysis of triggering during different significance thresholds shows that the timescales and A values remain consistent throughout. The only visible difference in the triggering functions is that for larger thresholds, the functions decay more quickly.

Conclusion
We have analysed the spatio-temporal variation in the incident rate on London's M25 motorway over a period of one year that distinguishes between primary and secondary accidents. This variation is found to be strongly inhomogeneous. The temporal variation shows a strong daily double peak structure reflecting commuting patterns superimposed in a weaker weekly variation with a peak on Fridays and a trough on Saturdays. This pattern of temporal variation remains stable over the data period. The spatial variation shows two primary peaks in intensity. The first and largest is in the vicinity of the Potters Bar Interchange. The other is in the vicinity of Junctions 5 and 6 where the M26 and M23 join the M25. The peak at Potters Bar appears to increase in intensity during the data period, and is more pronounced when we condition on the most significant events in terms of impact on traffic speed. We find that 6-7% of the observed incidents are most probably secondary incidents under the assumptions of our model. Plausible time and length scales emerge for the range of the triggering effects : 100 minutes in the temporal triggering, and 2 kilometres for spatial triggering. From these figures we conclude that the effects of secondary incidents is a small but detectable feature of the M25 incident data set. Our analysis suggests that, on the M25 at least, the scope to further reduce accident rates by reducing the number of secondary incidents is limited compared to what could be achieved by reducing the peaks at specific times or 'hot-spot' locations.
In terms of further work, it would be very informative to repeat our analysis for a major road without the MIDAS system or other smart motorway features since we expect this infrastructure to reduce the risk of secondary incidents. General comments about clustering of events on urban roads, made in Acker and Yuan (2019), suggest that this behaviour is not exclusive to smart motorways, but also occurs on inner-city roads. We also believe it is important to further validate our methodology for spatial localisation of incidents, ideally against a dataset of true incident locations. One could also consider how to extend our out-of-sample validation methodology to other uses, for example including a simple model that would allow one to extrapolate the trend to allow for incorporation of this component.

Acknowledgments
We thank Dr. Steve Hilditch, Thales UK for sharing expertise on NTIS and UK transportation systems. This work was supported by the EPSRC (grant number EP/L015374/1).

A. Problems With Link Specification of Events
To understand why we can localize events, we detail the two scales of data being provided by NTIS. The first is at the loop-level, with time-series of speed, flow and occupancy being reported each minute for each sensor in the network. The second is at the linklevel, where aggregated data from all sensors that lie on a link is provided each minute, specifically speed, flow, occupancy and travel time. We extract both of these sets of data across our studied domain, and apply a 5-minute rolling average to smooth out the time-series.
When an event is recorded on our network, we are told which link it has occurred on, however NTIS links vary vastly in size, with the distribution of link lengths given in  on the link before we can apply point-process methodology. However, there is hope that the loop sensors are fine-grained enough in space to provide an estimate of this, with Fig. 10b showing that any point in the network should have a loop sensor less than 400 meters away. This is a far higher level of detail than initially provided by NTIS events, so we decide to refine our data further by using the time-series provided by the loop sensors. This is not a trivial task however as we do not have labeled data on which we can train a model, and instead must define sensible criteria that indicate the location of an incident. This is an extensive task, and detailed in section A.1

A.1. Event Localization Methodology
We note from the outset that detection of traffic events in both space and time is an active research field, and various approaches are being considered using an increasing variety of data sources. We are in an atypical situation in that we know a temporal and spatial window in which an event occurred, and are only searching for a more finegrained spatial location within this window. As such, we do not aim to completely solve the event detection problem using inductive loop data, rather we try and develop an effective methodology to take a given window with a known event in and argue what single set of sensors the event may lie between. From discussions with industry experts, we consider the following properties to be clear signatures of a significant traffic event: • upstream of an event location, speed will be decreased and occupancy will be increased compared to the seasonal values • downstream of an event location, speed may still be decreased, but less so than upstream of the accident Doing so, we have a model for each weekday, at the minute level, fit to each loops data separately. We produce one such seasonal model for each traffic variable on a loop.
We then consider what is reasonable to develop with our available data. Ideally, we would design and validate some localization methodology incorporating spatial-temporal information from the loop data, inferring a location from the behaviour of all loops.
However, we have no data with the ground truth locations, so we cannot reasonably develop such a model. Instead, we can use a simple 'rule of thumb' approach based on existing methodology. Numerous historic methodologies in traffic theory Payne et al. (1976), Gall and Hall (1989) compare adjacent sensors to determine two points, one where the data appears to show an incident, and another where the data does not.
We can do the same, and given we know there must be an event in the given window, we can simply ask at what point do we see the largest discrepancy for two adjacent sensors. First, we consider any two sensors in our network, i − 1 and i ∀ i ∈ {2, · · · , N }.
We denote the residual series for speed and occupancy on a sensor i as RS i and RO i respectively. We then compute the spatially differenced values: We then define an 'event impact score' between the two loops as: Since ∆RS i will likely be positive and ∆RO i will likely be negative if an event occurs between i − 1 and i, then we look for the pair of loops at which the value of EIS i is largest, and place our localized event halfway between these two loops. There is of course huge scope to improve upon such a model, but without data to validate more advanced approaches it is sufficient to provide a method that agrees with existing literature and common sense rules. In Fig. 11 we plot an example of localizing an event based on our simple methodology applied to loop sensor data.

B. Model Derivation
Before detailing how one derives the model, we recall from the main text that the conditional intensity function takes the form: and one can enforce monotonicity be solving: min p1,...,pN D 0 (p 1 , . . . , p N ) Exact details on these are discussed in the main text. These give a sense of how much variation there is in the data. This is the first link in our network, so we show the next loop sensor along for a sense of scale. We see that sensors 1 and 2 have large drops in speed and increases in occupancy, however sensor 3 appears reasonably seasonal. Our methodology has then placed the event between sensors 2 and 3.
We now discuss the fitting procedure given in Zhuang and Mateu (2019) with adaption made to uni-variate and uni-directional spatial triggering. Throughout this section, we will denote a kernel function with bandwidth ω as z ω (x), and let m d , m w be the number of data-points (in our case minutes) in a day and week respectively. Then, as discussed in Zhuang and Mateu (2019) and Zhuang (2006), for a spatio-temporal point process N p with conditional intensity function λ(t, x) and a predictable process f (t, x) ≥ 0, over a time domain [T 1 , T 2 ] and space domain S, we can write: Using this, we then outline the methodology provided in Zhuang and Mateu (2019) to construct first the background components of our desired process then the triggering components.

B.1. Background Components
To construct each component our our model, we have a dataset of events, each with a time of occurrence t i and location x i , with i ∈ {1, 2, ..., N }, and our data collected

B.1.1. Spatial Background
The background spatial component, µ s (x) can be constructed as follows. Denote: and then substitute this into equation 16, giving: where ∆x is a small positive value and 1 x is the indicator function, giving one if its input is true and zero otherwise. This suggests that we can write: In equation 19, we essentially have a histogram estimate of µ s (x), and it is natural to considering smoothing such an estimate as in the original work. A common way to do this is through kernel density smoothing, essentially smoothing a discrete data-point in time or space over a wider range of values through some kernel function, k ω (x). Here, we denote ω as the bandwidth of the kernel, and as commonly done specify k to be Gaussian: Applying this smoothing, we then attain: where ω s is a bandwidth specific to the spatial background component, and we have prevented the 'leaking of mass' problem by ensuring that each kernel function is normalized such that its integral is 1 over the domain in question.

B.1.2. Temporal Background
In addition to the spatial background, we also model two periodic background components, daily and weekly. These are constructed very similarly, so we only detail one.
To construct a periodic daily component, we take . However, since the domain is periodic, we account for the fact that some event t i may also have non-zero contribution from its location on the previous and following day. In the context of a periodic temporal function, we would consider an event before our domain as t i − m d ti md − m d and after our domain as t i − m d ti md + m d . We denote the contributions of these components as k lower daily = k ωd t − t i − m d ti md − m d and k upper daily = k ωd t − t i − m d ti md + m d . We can then incorporate these into the estimate as: The same sense of a periodic function having no boundary can be accounted for in the spatial background in the same way, altering equation 22.
Finally, one can derive the same formulation for the background trend component as:

0, otherwise
If we apply this in equation 16, letting f (τ, χ) = ρ(t i , x i , τ, χ)1 τ −ti∈[t−∆t,t+∆t] , then we attain: If we then let s = τ − t i we have: As a result: and hence:ĝ with: As before, the estimator in equation 33 can be smoothed to give: ρi,jkω g (t−tj+ti) In equation 35, we have normalized each kernel density estimate by the remaining time, and then finally dividing by the term N i=1 1 ti+t≤T , which corrects for repetitions as detailed in Zhuang and Mateu (2019).
In the same way, we can write the spatial triggering function as: where the scaling integral has been adapted to account for our one-directional triggering in space.

B.3. Determining Coefficients A, µ 0
In the model specification, we have two coefficients A and µ 0 that specify the how much or little of the triggering and background component respectively enters the conditional intensity function. In-particular, it should be noted that A can be interpreted as the proportion of the impact of the triggering function on the total intensity, so a high A value suggests data is dominated by triggering and the converse for a small value.
These parameters can be determined through maximum likelihood estimation. The loglikelihood function for a spatial-temporal point process model with triggering is given by: Using equation 14, we then see: We then denote: and attain the partial derivatives with respect to A and µ 0 as: Setting these equal to 0 and solving, one attains the following iterative system of equations:

B.4. Fitting Algorithm
Our final reference back to the original work Zhuang and Mateu (2019)  Output: Optimized components µ d , µ w , µ t , µ s , g, h, A, µ 0 For computational speed, we assume triggering after 12 hours is 0, which is informed by considering time-scales similar to the worst recorded traffic jams in the U.K, detailed in INRIX (2019).