Sparse graphs using exchangeable random measures

Summary Statistical network modelling has focused on representing the graph as a discrete structure, namely the adjacency matrix. When assuming exchangeability of this array—which can aid in modelling, computations and theoretical analysis—the Aldous–Hoover theorem informs us that the graph is necessarily either dense or empty. We instead consider representing the graph as an exchangeable random measure and appeal to the Kallenberg representation theorem for this object. We explore using completely random measures (CRMs) to define the exchangeable random measure, and we show how our CRM construction enables us to achieve sparse graphs while maintaining the attractive properties of exchangeability. We relate the sparsity of the graph to the Lévy measure defining the CRM. For a specific choice of CRM, our graphs can be tuned from dense to sparse on the basis of a single parameter. We present a scalable Hamiltonian Monte Carlo algorithm for posterior inference, which we use to analyse network properties in a range of real data sets, including networks with hundreds of thousands of nodes and millions of edges.


Introduction
The rapid increase in the availability and importance of network data has been a driving force behind the significant recent attention on random-graph models. In devising such models, there are several competing forces: (a) flexibility to capture network features like sparsity of connections between nodes, heavytailed node degree distributions, dense spots or block structure; (b) interpretability of the network model and associated parameters; (c) theoretical tractability of analysis of network properties; (d) computational tractability of inference with the ability to scale analyses to large collections of nodes.
A plethora of network models have been proposed in recent decades, each with different tradeoffs made between flexibility, interpretability and theoretical and computational tractability; we refer the interested reader to overviews of such models provided by Newman (2003, Bollobás (2001), Durrett (2007),  and Fienberg (2012). In this paper, our focus is on providing a new framework in which to make these trade-offs. We demonstrate the ability to make gains in multiple directions using this framework through a specific example where the goal is to capture (i) sparsity-tunable from sparse to dense via interpretable parameters, (ii) heavy-tailed degree distributions-again controlled via interpretable parameters-and (iii) computational tractability of Bayesian inference, scaling to networks with hundreds of thousands of nodes and millions of edges.
Classically, the graph being modelled has been represented by a discrete structure, or adjacency matrix, Z where Z ij is a binary variable with Z ij = 1 indicating an edge from node i to node j. In the case of undirected graphs, we furthermore restrict Z ij = Z ji . Then, the statistical network model is devised with this structure representing the observable quantity.
From a modelling, computational and theoretical standpoint, making an assumption of exchangeability is attractive. Under the adjacency matrix graph representation, such a statement informally equates with an invariance in distribution to permutations of node orderings. One reason why this assumption is attractive can be seen from applying the celebrated Aldous-Hoover theorem  to the adjacency matrix: infinite exchangeability of the binary matrix implies a mixture model representation involving transformations of uniform random variables (see theorem 7 in Appendix A.1). For undirected graphs, this transformation is specified by the graphon (Borgs et al. 2008;Lovász, 2013), which was originally studied as the limit object of dense graph sequences (Lovász and Szegedy, 2006;Borgs et al. 2010). The connection with the Aldous-Hoover theorem was made by Diaconis and Janson (2008). The graphon provides an object by which to study the theoretical properties of the statistical network process and to devise new estimators, as has been studied extensively in recent years Bickel et al., 2011;Rohe, et al., 2011;Zhao et al., 2012;Airoldi et al., 2013;Choi and Wolfe, 2014). Furthermore, the mixture model is a cornerstone of Bayesian modelling and provides a framework in which computational strategies are straightforwardly devised. Indeed, the Aldous-Hoover constructive definition has motivated new models (Lloyd et al., 2012) and many popular existing models can be recast in this framework, including the stochastic block model (Nowicki and Snijders, 2001; and latent space model . One consequence of the Aldous-Hoover theorem is that graphs that are represented by an exchangeable random array are either empty or dense, i.e. the number of edges grows quadratically with the number of nodes n (see Lovász (2013) and ). However, empirical analyses suggest that many real world networks are sparse . Formally, sparsity is an asymptotic property of a graph. Following , we refer to graphs with Θ.n 2 / edges as dense and graphs with o.n 2 / edges as sparse (for notation, see Appendix C). The conclusion appears to be that we cannot have both exchangeability, with the associated benefits described above, and sparse graphs. Although network models can often adapt parameters to fit finite graphs, it is appealing to have a modelling framework with theoretically provable properties that are consistent with observed network attributes.
There are a couple of approaches to handling this apparent issue. One is to give up on exchangeability to obtain sparse graphs, such as in the popular preferential attachment model Berger et al., 2014) or configuration model (Bollobás, 1980;. Indeed, the networks literature is dominated by sparse non-exchangeable models. Alternatively, there is a body of literature that examines rescaling graph properties with network size n, leading to sparse graph sequences where each graph is finitely exchangeable (Bollobás Fig. 1. Point process representation of a random graph: each node i is embedded in R C at some location θ i and is associated with a sociability parameter w i ; an edge between nodes θ i and θ j is represented by a point at locations .θ i , θ j / and .θ j , θ i / in R 2 C et al ., 2007;Olhede and Wolfe, 2012;Borgs et al., 2014a,b). Convergence of sparse graph sequences, analogous to the study of limiting objects for dense graph sequences, has also been studied (e.g. Borgs et al. (2017)). However, any method building on a rescaling approach provides a graph distribution π n that lacks projectivity: marginalizing node n does not yield π n−1 , the distribution on graphs of size n − 1.
We instead propose to set aside the discrete structure of the adjacency matrix and examine a different notion of exchangeability for a continuous space representation of networks. In particular, we consider a point process on R 2 + : where z ij is 1 if there is a link between node i and node j, and is 0 otherwise, and θ i and θ j are in R + = [0, ∞/ (Fig. 1). We can think of θ i as a time index for node i. Exchangeability, as defined in Section 2, then equates with invariance to the time of arrival of the nodes. See Section 3.5 for a further interpretation of θ i . We note that exchangeability of the point process representation does not imply exchangeability of the associated adjacency matrix; however, the same modelling, computational and theoretical advantages remain. Interestingly, we arrive at a direct analogue to the constructive representation of the Aldous-Hoover theorem for exchangeable arrays and the associated graphon. Appealing to , chapter 9, a point process on R 2 + is exchangeable if and only if it can be represented as a transformation of unit rate Poisson processes and uniform random variables (see theorem 1 in Section 2).
As a case-study in how this exchangeable random-measure framework can enable statistical network models with properties that are different from what can be achieved in the exchangeable array framework, we consider the following specification. To induce node heterogeneity in the link probabilities, we endow each node with a scalar sociability parameter w i > 0. We then consider a straightforward link probability model. For any i = j, Pr.z ij = 1|w i , w j , θ i , θ j / = 1 − exp.−2w i w j /: .

2/
This link function has been previously used by several others to build network models (Aldous, 1997;Norros and Reittu, 2006;van der Hofstad, 2014). Note that, under this specification, the 'time indices' θ i and θ j of nodes i and j do not influence the probability of these two nodes to form a link. This is in contrast with, for example, standard latent space models . See Section 4 for further discussion.
To define the set of .w i , θ i / i=1,2,::: underlying this statistical network model, we explore the use of completely random measures (CRMs) (Kingman, 1967). The .w i / i=1,2,::: are the jumps of the CRM and the .θ i / i=1,2,::: the locations of the atoms. We show that, by carefully choosing the Lévy measure characterizing this CRM, we can construct graphs ranging from sparse to dense. In particular, any Lévy measure yielding an infinite activity CRM leads to sparse graphs; alternatively, finite activity CRMs yield dense graphs. For the class of infinite activity regularly varying CRMs, we can sharpen the results to obtain graphs where the number of edges increases at a rate below n a , where 1 < a < 2 depending on the Lévy measure. We focus on the flexible generalized gamma process CRM and show that one can tune the graph from dense to sparse via a single parameter.
Building on the framework of CRMs leads to other desirable properties as well. One is that our CRM-based exchangeable point process leads to an analytic representation for the graphon analogue in the Kallenberg framework (see Section 5.1). Another is that, by drawing on the considerable theory of CRMs that has been well studied in the Bayesian non-parametric community, we can derive network simulation techniques and develop a principled statistical estimation procedure. For the latter, in Section 7 we devise a scalable Hamiltonian Monte Carlo (HMC) sampler that can automatically handle a range of graphs from dense to sparse. We empirically show in Section 8 that our methods scale to graphs with hundreds of thousands of nodes and millions of edges. Importantly, exchangeability of the random measure underlies the efficiency of the sampler.
In summary, the CRM-based formulation combined with the specific link model of equation (2) serves as a proof of concept that moving to the point process representation of equation (1) can yield models with desirable attributes that are different from what can be obtained by using the discrete adjacency matrix representation. More generally, the notion of modelling the graph as an exchangeable random measure and appealing to a Kallenberg representation for such exchangeable random measures serves as an important new framework for devising and studying random-graph models, just as the original graphon concept stimulated considerable work in the network community in the past decade.
Our paper is organized as follows. In Section 2, we provide background on exchangeability and CRMs. Our statistical network models for directed multigraphs, undirected graphs and bipartite graphs are in Section 3. A discussion of our framework compared with related network models is provided in Section 4. Properties, such as exchangeability and sparsity, and methods for simulation are presented in Section 5. Specific cases of our formulation leading to dense and sparse graphs are considered in Section 6, including an empirical analysis of network properties of our proposed formulation. Our Markov chain Monte Carlo (MCMC) posterior computations are in Section 7. Finally, Section 8 provides a simulation study and an analysis of a variety of large, real world graphs.
The programs that were used to analyse the data can be obtained from http://wileyonlinelibrary.com/journal/rss-datasets

Background on exchangeability
Our focus is on exchangeable random structures that can represent networks. We first briefly review exchangeability for random sequences, continuous time processes and discrete network arrays. Thorough and accessible overviews of exchangeability of random structures have been   and  exchangeability  presented in the surveys of Aldous (1985) and . Here, we simply abstract away the notions that are relevant to placing our network formulation in context, as summarized in Table 1. The classical representation theorem arising from a notion of exchangeability for discrete sequences of random variables is due to de Finetti (1931). The theorem states that a sequence Z 1 , Z 2 , : : : with Z i ∈ Z is exchangeable if and only if there is a random probability measure P on Z with law ν such that the Z i are conditionally independently and identically distributed (IID) given P, i.e. all exchangeable infinite sequences can be represented as a mixture with mixing measure ν. If examining continuous time processes instead of sequences, the representation that is associated with exchangeable increments was given by Bühlmann (1960) (see also Freedman (1996) in terms of mixing Lévy processes.
The focus of our work, however, is on graph structures. For generic matrices Z in some space Z, an (infinite) exchangeable random array (Diaconis and Janson, 2008;Lauritzen, 2008) is one such that .Z ij / d = .Z π.i/π.j/ / for .i, j/ ∈ N 2 .3/ for any permutations π,π of N (separate exchangeability), or for any permutation π =π of N (joint exchangeability), where the notation '= d ' stands for 'equal in distribution'. A representation theorem for exchangeability of the classical discrete adjacency matrix Z arises by considering a special case of the Aldous-Hoover theorem  to 2-arrays.
For undirected graphs where Z is a binary, symmetric adjacency matrix, the Aldous-Hoover representation can be expressed as the existence of a graphon. For completeness, the Aldous-Hoover theorem (specialized to 2-arrays under joint exchangeability) and further details on the graphon are provided in Appendix A.1. Throughout this paper, we instead consider representing a graph as a point process Z = Σ i,j z ij δ .θ i ,θ j / with nodes θ i embedded in R + , as in equation (1), and then examine notions of exchangeability in this context. Paralleling result (3), the point process Z on R 2 + is exchangeable if and only if , chapter 9) .Z.A i × A j // d = .Z.A π.i/ × Aπ .j/ // for .i, j/ ∈ N 2 , .4/ for any permutations π,π of N, with π =π in the jointly exchangeable case, and any intervals A i = [h.i − 1/, hi] with i ∈ N and h > 0. In words, result (4) states that the point process Z is exchangeable if, for any arbitrary regular square grid on the plane, the associated infinite array of increments (edge counts between nodes in a square) is exchangeable (Fig. 2). This provides a notion of exchangeability akin to that of the Aldous-Hoover theorem, but fundamentally different as the array being considered here is formed on the basis of an underlying continuous process. This array is not equivalent to an adjacency matrix, regardless of how fine a grid is considered.   derived de-Finetti-style representation theorems for separately and jointly exchangeable random measures on R 2 + , which we present for the jointly exchangeable case in theorem 1. In what follows λ denotes the Lebesgue measure on R + , λ D the Lebesgue measure on the diagonal D = {.s, t/ ∈ R 2 + |s = t} andÑ 2 = {{i, j}|.i, j/ ∈ N 2 }. We also define a U-array to be an array of independent uniform random variables.
We can think of the θ i as random time indices, the sets {θ i } × R + and R + × {θ j } forming Poisson processes of vertical and horizontal lines. The representation (1) is slightly more involved than the representation theorem for exchangeable arrays (see Appendix A.1). The first component of ξ is, however, similar to the representation for exchangeable arrays, the sequence of fixed indices i = 1, 2, : : : and uniform random variables .U i / i=1,2,::: in equation (46) in Appendix A.1 being replaced by a unit rate Poisson process {.θ i , ϑ i /} on R 2 + . We place our proposed network model of Section 3 within this Kallenberg representation in Section 5.1, yielding direct analogues to the classical graphon representation of graphs based on exchangeability of the adjacency matrix.

Proposed statistical network model
Recall that we represent an undirected graph using an atomic measure with the convention z ij = z ji ∈ {0, 1}. Here, z ij = z ji = 1 indicates an undirected edge between nodes θ i and θ j . See Section 3.5 for the interpretation of θ i .
There are many options for defining a statistical model for the point process graph representation Z. We consider one in particular in this paper based on a specific choice of (a) link probability model and (b) a prior on the model parameters.
Expanding on the discussion of Section 1, we introduce a collection of per-node sociability parameters w = {w i } and specify link probabilities via .6/ As mentioned in Section 1, this link probability model is not new to the statistical networks community and is a straightforward method for achieving node heterogeneity (see Aldous (1997) and Norros and Reittu (2006)).

Defining node parameters by using completely random measures
The model parameters consist of a collection of node-specific sociability parameters w i > 0 and continuous-valued node indices θ i ∈ R + .
Our generative model jointly specifies .w i , θ i / i=1,2,::: by first defining an atomic random measure and then taking W to be distributed according to a homogeneous CRM (Kingman, 1967). CRMs have been used extensively in the Bayesian non-parametric literature for proposing flexible classes of priors over functional space (Regazzini et al., 2003;Lijoi and Prünster, 2010). We briefly review a few important properties of CRMs that are relevant to our construction; the reader can refer to Kingman (1993) or Daley and Vere-Jones (2008) for an exhaustive coverage.
A CRM W on R + is a random measure such that, for all finite families of disjoint, bounded measurable sets .A 1 , : : : , A n / of R + , the random variables W.A 1 /, : : : , W.A n / are mutually independent.
We shall focus here on CRMs with no deterministic component and stationary increments (i.e. the distribution of W.[t, s]/ depends only on t − s). In this case, the CRM takes the form (7), with .w i , θ i / i∈N the points of a Poisson point process on .0, ∞/ × R + defined by a mean measure ν.dw, dθ/ = ρ.dw/λ.dθ/, where λ is the Lebesgue measure and ρ is a Lévy measure on .0, ∞/.
We denote this by W ∼ CRM.ρ, λ/: .8/ Note that W.[0, T ]/ < ∞ almost surely for any T < ∞, whereas W.R + / = ∞ almost surely if The jump part ρ of the mean measure-which characterizes the increments of W -is of particular interest in our graph construction, as explored in Section 5. If ρ satisfies the condition ∞ 0 ρ.dw/ = ∞, .9/ then there will be an infinite number of jumps in any interval [0, T ], and we refer to the CRM as infinite activity. Otherwise, the number of jumps will be finite almost surely. In our model, the jumps correspond to potentially connected nodes, i.e. these nodes need not be connected to any other node within a bounded interval and instead represent an upper bound on the set of connected nodes. See Section 3.5 for further discussion.
In Section 6, we consider special cases including the (compound) Poisson process and generalized gamma process (GGP) (Brix, 1999;.

Directed multigraphs
Formally, our undirected graph model is viewed as a transformation of a directed integerweighted graph, or multigraph, as detailed in Section 3.3. We now specify this directed multigraph. Although our primary focus is on undirected network models, in some applications the directed multigraph might actually be the direct quantity of interest. For example, in social networks, interactions are often not only directed ('person i messages person j') but also have an associated count. Additionally, interactions might be typed ('message', 'SMS', 'like', 'tag'). Our proposed framework could be directly extended to model such data.
Let V = .θ 1 , θ 2 , : : :/ be a countably infinite set of node indices with θ i ∈ R + . We represent the directed multigraph of interest with an atomic measure on R 2 .10/ where n ij counts the number of directed edges from node i to node j, with time indices θ i and θ j . See Fig. 3 for an illustration. Given W as defined in expressions (7) and (8), D is simply generated from a Poisson process with intensity given by the product measureW = W × W on R 2 + : i.e., informally, the individual counts n ij are generated as Poisson.w i w j /. (We consider a generalized definition of a Poisson process, where the mean measure is allowed to have atoms (Daley and Vere-Jones (2003), section 2.4).) By construction, for any A, B ⊂ R, we haveW.A × B/ = W.A/W.B/. On any bounded interval A of R + , W.A/ < ∞, implying thatW.A × A/ has finite mass.

Undirected graphs via transformations of directed graphs
We arrive at the undirected graph via a simple transformation of the directed graph: set z ij = z ji = 1 if n ij + n ji > 0 and z ij = z ji = 0 otherwise, i.e. place an undirected edge between nodes θ i and θ j if and only if there is at least one directed interaction between the nodes. In this definition of an undirected graph, we allow self-edges. This could represent, for example, a person posting a message on his or her own profile page. The resulting hierarchical model is . 3. Example of (a) an atomic measure D as in equation (10) min.n ij + n ji , 1/δ .θ i ,θ j / : This process is depicted graphically in Fig. 4. To see the equivalence between this formulation and that specified in equation (6), note that, for i = j, Pr.z ij = 1|w/ = Pr.n ij + n ji > 0|w/. By properties of the Poisson process, n ij and n ji are independent random variables conditioned on W . The sum of two Poisson random variables, each with rate w i w j , is again Poisson with rate 2w i w j . Result (6) arises from the fact that Pr.n ij + n ji > 0|w/ = 1 − Pr.n ij + n ji = 0|w/. Likewise, the i = j case arises by using a similar reasoning for Pr.z ii = 1|w/ = Pr.n ii > 0|w/.
We note that our computational strategy of Section 7 relies on this interpretation of our model for undirected graphs as a transformation of a directed multigraph. In particular, we introduce the directed edge counts as latent variables and do inference over these counts.  Fig. 4. Example of (a) the product measure Q W D W W for CRM W , (b) a draw of the directed multigraph measure DjW PP.W W/ and (c) the corresponding undirected measure Z D Σ 1 iD1 Σ 1 jD1 min.n ij C n ji , 1/δ .θ i ,θ j /

Bipartite graphs
The above construction can also be extended to bipartite graphs. Let V = .θ 1 , θ 2 , : : :/ and V = .θ 1 , θ 2 , : : :/ be two countably infinite sets of nodes with θ i , θ i ∈ R + . We assume that only connections between nodes of different sets are allowed.
We represent the directed bipartite multigraph of interest by using an atomic measure on R 2 + : where n ij counts the number of directed edges from node θ i to node θ j . Similarly, the bipartite graph is represented by an atomic measure Our bipartite graph formulation introduces two independent CRMs, W ∼ CRM.ρ, λ/ and W ∼ CRM.ρ , λ/, whose jumps correspond to sociability parameters for nodes in sets V and V respectively. The generative model for the bipartite graph mimics that of the non-bipartite graph: Model (14) has been proposed by Caron (2012) in a slightly different formulation. In this paper, we recast this model within our general framework, enabling new theoretical and practical insights.

Interpretation of θ i
We can think of the positive, continuous valued node index θ i as representing the time at which a potential node enters the network and has the opportunity to link with other existing nodes θ j < θ i . We use the terminology 'potential node' here to clarify that this node need not form any observed connections with other nodes existing before time θ i . We emphasize that an observed link between θ i and some other node θ k > θ i will eventually occur almost surely as time progresses. This could represent, for example, signing onto a social networking service before your friends do and only forming a link once they join. On the basis of our CRM specification, we have almost surely an infinite number of potential nodes as time goes to ∞. For infinite activity CRMs, we have almost surely an infinite set of potential nodes even at any finite time.
In Section 5, we examine properties of the network process across time, and we describe methods for simulating networks at any finite time. There, our focus is on the observed link process from this set of potential nodes. For example, sparsity is examined with respect to the set of nodes with degree at least 1, not with respect to the set of potential nodes. Since we need not think of θ i as a time index, but rather just a general construct of our formulation, we also generically refer to θ i as the node location in the remainder of the paper.

Related work
There has been extensive work over recent years on flexible Bayesian non-parametric models for networks, allowing complex latent structures of unknown dimension to be uncovered from real world networks (Kemp et al., 2006;Miller et al., 2009;Lloyd et al., 2012;Palla et al., 2012;Herlau et al., 2014). However, as mentioned in the unifying overview of , these methods all fit in the Aldous-Hoover framework and as such produce dense graphs. Norros and Reittu (2006) proposed a conditionally Poissonian multigraph process with similarities to be drawn to our multigraph process. In their formulation, each node has a given sociability parameter and the number of edges between two nodes i and j is drawn from a Poisson distribution with rate the product of the sociability parameters, normalized by the sum of the sociability parameters of all the nodes. The normalization makes this model similar to models based on rescaling of the graphon and, as such, does not define a projective model, as explained in Section 1. See van der Hofstad (2014) for a review of this model and Britton et al. (2006) for a similar model.
As pointed out by Jacobs and Clauset (2014) in their discussion of an earlier version of this paper, another related model is the degree-corrected random-graph model , where edges of the multigraph are drawn from a Poisson distribution whose rate is the product of node-specific sociability parameters and a parameter tuning the interaction between the latent communities to which these nodes belong. When the sociability parameters are assumed to be IID from some distribution, this model yields an exchangeable adjacency matrix and thus a dense graph.
Additionally, there are similarities to be drawn with the extensive literature on latent space modelling (e.g. , Penrose (2003) and Hoff (2009)). In such models, nodes are embedded in a low dimensional, continuous latent space and the probability of an edge is determined by a distance or similarity metric of the node-specific latent factors. In our case, the continuous node index θ i is of no importance in forming edge probabilities. It would, however, be possible to extend our approach to time-or location-dependent connections by considering inhomogeneous CRMs.
Finally, as we shall detail in Section 5.5, our model admits a construction with connections to the configuration model (Bollobás, 1980;, which is a popular model for generating simple graphs with a given degree sequence. The connections with this broad set of past work place our proposed network model within the context of existing literature. Importantly, however, to the best of our knowledge this work represents the first fully generative and projective approach to sparse graph modelling (see Section 5), and with a notion of exchangeability that is essential for devising our scalable statistical estimation procedure, as shown in Section 7.

General properties and simulation
We provide general properties of our network model depending on the properties of the Lévy measure ρ.
The proof is given in Appendix B. In the adjacency matrix representation, we think of exchangeability as invariance to node orderings. Here, we have invariance to the time of arrival of the nodes, thinking of θ i as a time index.
We now reformulate our network process in the Kallenberg representation (5). Because of exchangeability, we know that such a representation exists. What we show here is that our CRMbased formulation has an analytic and interpretable representation. In particular, the CRM W can be constructed from a two-dimensional unit rate Poisson process on R 2 + by using the inverse Lévy method (Khintchine, 1937;Ferguson and Klass, 1972). Let .θ i , ϑ i / be a unit rate Poisson process on R 2 + . Letρ.x/ be the tail Lévy intensitȳ .15/ Then the CRM W = Σ i w i δ θ i with Lévy measure ρ.dw/ dθ can be constructed from the bidimensional point process by taking w i =ρ −1 .ϑ i /. Note that the inverse Lévy intensityρ −1 is a monotone function. It follows that our undirected graph model can be formulated under representation (5) by selecting any α 0 , β 0 = γ 0 = 0, g = g = 0, h = h = l = l = 0 and is indicated by the blue shading (darker shading indicates higher value) for a stable process (GGP with τ D 0); in this case there is an analytic expression for N ρ 1 and therefore M) In Section 6, we provide explicit forms forρ depending on our choice of Lévy measure ρ. Expression (16) represents a direct analogue to that arising from the Aldous-Hoover framework. In particular, M here is akin to the graphon ω of expression (47) in Appendix A.1, and thus allows us to connect our CRM-based formulation with the extensive literature on graphons. An illustration of the network construction from the Kallenberg representation, including the function M, is in Fig. 5. Note that, if we had started from the Kallenberg representation and selected an f (or M) arbitrarily, we would probably not have obtained a network model with the normalized CRM interpretation that enables both interpretability and analysis of network properties.
For the bipartite graph, Kallenberg's representation theorem for separate exchangeability (  and , theorem 9.23) can likewise be applied.

Interactions between groups
For any disjoint set of nodes A, B ⊂ R + , A ∩ B = ∅, the probability that there is at least one connection between a node in A and a node in B is given by i.e. the probability of a between-group edge depends on the sum of the sociabilities in each group, W.A/ and W.B/.

Graph restrictions
Let us consider the restriction of our process to the square [0, α] 2 . For finite activity CRMs, there will be a finite number of potential nodes (jumps) in the interval [0, α]. For infinite activity CRMs, we shall have an infinite number of potential nodes. We are interested in the properties of the process as α grows, where we can think of α as representing time and observing the process as new potential nodes and any resulting edges enter the network. We note that, in the limit of α → ∞, the number of edges approaches ∞ since W.R + / = ∞ almost surely.
Let D α and Z α be the restrictions of D and Z respectively to the square [0, α] 2 . Then, .D α / α 0 and .Z α / α 0 are measure-valued stochastic processes, indexed by α. We also denote by W α and λ α the corresponding CRM and Lebesgue measure on [0, α]. In what follows, our interests are in studying how the following quantities vary with α: (a) N α , the number of nodes with degree at least one in the network, and (b) N .e/ α , the number of edges in the undirected network. We refer to N α as the number of observed nodes. In our construction, recall that .N α / α 0 and .N .e/ α / α 0 are non-decreasing, integer-valued stochastic processes corresponding to the number of nodes with at least one connection in Z α and the number of edges in Z α respectively. Formally, . 18/ The two processes have the same jump times, which correspond to the addition of one or more new nodes with at least one connection in the graph. An example of these processes is represented in Fig. 6. In later sections we use Z Å α = Z α .[0, α] 2 / to denote the total mass on [0, α] 2 , and similarly for D Å α and W Å α .

Sparsity
In this section we state the sparsity properties of our graph model, which relate to the properties of the Lévy measure ρ. In particular, we are interested in the relative asymptotic behaviour of the number of edges N .e/ α with respect to the number of observed nodes N α as α → ∞. Henceforth, we consider ∞ 0 ρ.dw/ > 0, since the case of ∞ 0 ρ.dw/ = 0 trivially gives N .e/ α = N α = 0 almost surely.
In theorem 2 we characterize the sparsity of the graph with respect to the properties of its Lévy measure: graphs obtained from infinite activity CRMs are sparse, whereas graphs obtained from finite activity CRMs are dense. The rate of growth can be further specified when ρ is a regularly varying Lévy measure (Feller, 1971;Karlin, 1967;Gnedin et al., 2006Gnedin et al., , 2007, as defined in Appendix A.2. We follow the notation of Janson (2011) for probability asymptotics (see Appendix C.1 for details).
Theorem 2. Consider a point process Z representing an undirected graph. Let N .e/ α be the number of edges and N α be the number of observed nodes in the point process restriction Z α (see equations (17) and (18)). Assume that the defining Lévy measure is such that ∞ 0 w ρ.dw/ < ∞. If the CRM W is finite activity, i.e. . 21/ Theorem 2 is a direct consequence of two theorems that we state now and prove in Appendix C. The first theorem states that the number of edges grows quadratically with α, whereas the second states that the number of nodes scales superlinearly with α for infinite activity CRMs, and linearly otherwise.
if W is an infinite activity CRM .23/ almost surely as α → ∞. In words, the number of nodes with degree at least 1 in Z α scales linearly with α for finite activity CRMs and superlinearly with α for infinite activity CRMs. Furthermore, for a regularly varying Lévy measure with slowly varying function l such that lim inf t→∞ l.t/ > 0, we have N α = Ω.α σ+1 / almost surely as α → ∞: . 24/ We finally give the expressions of the expectations for the number of edges and nodes in the model. The proof is given in Appendix C.4. (Equations (26) and (27) could alternatively be derived as particular cases of the results in .) Theorem 5. The expected number of edges D Å α in the multigraph, edges N .e/ α in the undirected graph and observed nodes N α are given as follows: −wt/}ρ.dw/ is the Laplace exponent. Additionally, if ρ is a regularly varying Lévy measure with exponent σ ∈ [0, 1/ and slowly varying function l, and . 28/

Direct simulation of graph restrictions
By definition, the directed multigraph restriction D α is drawn from a Poisson process with finite mean measure W α × W α , where W α ∼ CRM.ρ, λ α /: Leveraging standard properties of the CRM and Poisson process, we can first simulate the total number of directed edges D Å α based on the total mass W Å α : .29/ For k = 1, : : : , D Å α a particular edge is drawn by sampling a pair of nodes where W α =W Å α is called a normalized CRM. We form directed edges .U k1 , U k2 /, resulting in δ .U k1 ,U k2 / : . 31/ Because of the discreteness of W α , there will be ties between the .U k1 , U k2 /, and the number of such ties corresponds to the multiplicity of that edge. In particular, a total of 2D Å α nodes U kj are drawn but result in some N α 2D Å α distinct values. We overload the notation N α here because this quantity also corresponds to the number of nodes with degree at least 1 in the resulting undirected network. Recall that the undirected network construction simply forms an undirected edge between a set of nodes if there is at least one directed edge between them. If we consider unordered pairs {U k1 , U k2 }, the number of such unique pairs takes a number N .e/ α D Å α of distinct values, where N .e/ α corresponds to the number of edges in the undirected network. The construction above enables us to re-express our Cox process model in terms of normalized CRMs (Regazzini et al., 2003). This is very attractive both practically and theoretically. As we show in Section 6 for special cases of CRMs, one can use the results surrounding normalized CRMs to derive an exact simulation technique for our directed and undirected graphs.
Remark 1. The construction above enables us to draw connections with the configuration model (Bollobás, 1980;, which proceeds as follows. First, the degree k i of each node i = 1, : : : , n is specified such that the sum of k i is an odd number. Each node i is given a total of k i stubs, or demiedges. Then, we repeatedly choose pairs of stubs uniformly at random, without replacement, and connect the selected pairs to form an edge. The simple graph is obtained either by discarding the multiple edges and self-loops (an erased configuration model), or by repeating the above sampling until obtaining a simple graph. In our case, we have an infinite set of (potential) nodes and do not prespecify the node degrees. Furthermore, each node in the pair .U k1 , U k2 / is drawn from a normalized CRM rather than the pair being selected uniformly at random. However, at a high level, there is a similar flavour to our construction.

Urn-based simulation of graph restrictions
We now describe an urn formulation that allows us to obtain a finite dimensional generative process. Recall that, in practice, we cannot sample W α ∼ CRM.ρ, λ α / if the CRM is infinite activity since there will be an infinite number of jumps.
Let .U 1 , : : : , U 2D Å α / = .U 11 , U 12 , : : : , U D Å α 1 , U D Å α 2 /. For some classes of Lévy measure ρ, it is possible to integrate out the normalized CRM μ α = W α =W Å α in expression (30) and to derive the conditional distribution of U n+1 given .W Å α , U 1 , : : : , U n /. We first recall some background on random partitions. As μ α is discrete with probability 1, variables U 1 , : : : , U n take k n distinct valuesθ j , with multiplicities 1 m j n. The distribution on the underlying partition is usually defined in terms of an exchangeable partition probability function (EPPF) (Pitman, 1995) Π .k/ n .m 1 , : : : ,m k |W Å α / which is symmetric in its arguments. The predictive distribution of U n+1 given .W Å α , U 1 , : : : , U n / is then given in terms of the EPPF: . 32/ Using this urn representation, we can rewrite our generative process as /, .U kj / k=1,:::,D Å α ;j=1,2 |W Å α ∼ urn process .32/, where P W Å α is the distribution of the CRM total mass W Å α . Representation (33) can be used to sample exactly from our graph model, assuming that we can sample from P W Å α and evaluate the EPPF. In Section 6 we show that this is indeed possible for specific CRMs of interest.

Approximate simulation of graph restrictions
If we cannot sample from P W Å α in expression (33) and evaluate the EPPF in expression (32), we resort to approximate simulation methods. In particular, we harness the directed multigraph representation and approximate the draw of W α . For our undirected graphs, we simply transform the (approximate) draw of a directed multigraph as described in Section 3.3.
One approach to approximate simulation of W α , which is possible for some Lévy measures ρ, is to resort to adaptive thinning (Lewis and Shedler, 1979;Ogata, 1981;Favaro and Teh, 2013). A related alternative approximate approach, but applicable to any Lévy measure ρ satisfying condition (9), is the inverse Lévy method. This method first defines a threshold " and then samples the weights Ω = {w i |w i > "} by using a Poisson measure on [", ∞]. One then simulates D α using these truncated weights Ω.
A naive application of this truncated method that considers sampling directed or undirected edges as in expression (12) or expression (6) respectively can prove computationally problematic since a large number of possible edges must be considered (one Poisson or Bernoulli draw for each .θ i , θ j / pair for the directed or undirected case). Instead, we can harness the Cox process representation and resulting sampling procedure of expression (29)-(30) to sample first the total number of directed edges and then their specific instantiations. More specifically, to simulate approximately a point process on [0, α] 2 , we use the inverse Lévy method to sample Π α," = {.w, θ/ ∈ Π, 0 < θ α, w > "}: . 34/ Let W α," = Σ K i=1 w i δ θ i be the associated truncated CRM and W Å α," = W α, " .[0, α]/ its total mass. We then sample D Å α," and U kj as in expression (29)-(30), and set D α," = Σ D Å α, " k=1 δ .U k1 ,U k2 / . The undirected graph measure Z α," is set to the manipulation of D α," as in expression (12).

Special cases
In this section, we examine the properties of various models and their link to classical randomgraph models depending on the Lévy measure ρ. We show that, in the GGP case, the resulting graph can be either dense or sparse, with the sparsity tuned by a single hyperparameter. Furthermore, exact simulation is possible via expression (33). We focus on the undirected graph case, but similar results can be obtained for directed multigraphs and bipartite graphs.

Compound Poisson process
A compound Poisson process is a process where ρ.dw/ = h.w/ dw and h : R + → R + is such that ∞ 0 h.w/ dw = 1 and defines a finite activity CRM. In this case, we haveρ.x/ = 1 − H.x/ where H is the distribution function that is associated with h. Here, we arrive at a framework that is similar to the standard graphon. Leveraging the Kallenberg representation (16), we first sample n ∼ Poisson.α/. Then, for i = 1, : : : , n we set z ij = z ji = 1 with probability M.U i , U j / where U i are uniform [0, 1] variables and M is defined by This representation is the same as with the Aldous-Hoover theorem, except that the number of nodes is random and follows a Poisson distribution. As such, the resulting random graph is either trivially empty or dense, again agreeing with our theory.

Generalized gamma process
The GGP (Hougaard, 1986;Aalen, 1992;Lee and Whitmore, 1993;Brix, 1999) is a flexible twoparameter CRM with interpretable parameters and remarkable conjugacy properties Lijoi and Prünster, 2003;. The process is also known as the Hougaard process (Hougaard, 1986) when λ is the Lebesgue measure, as in this paper, but we shall use the more standard term GGP in the rest of this paper. The Lévy measure of the GGP is given by where the two parameters .σ, τ / satisfy The GGP has different properties if σ 0 or σ < 0. When σ < 0, the GGP is a finite activity CRM (i.e. a compound Poisson process); more precisely, the number of jumps in [0, α] is finite with probability 1 and drawn from a Poisson distribution with rate −.α=σ/τ σ whereas the jumps w i are IID gamma.−σ, τ /. When σ 0, the GGP has an infinite number of jumps over any interval [s, t]. It includes as special cases the gamma process (σ = 0, τ > 0), the stable process (σ ∈ .0, 1/, τ = 0) and the inverse Gaussian process (σ = 1 2 , τ > 0). The tail Lévy intensity of the GGP is given bȳ where Γ.a, x/ is the incomplete gamma function. Example realizations of the process for various values of σ 0 are displayed in Fig. 7 alongside a realization of an Erdös-Rényi graph.

Exact sampling via an urn approach
In the case σ > 0, W Å α is an exponentially tilted stable random variable, for which exact samplers exist (Devroye, 2009). As shown by  (see also ), the EPPF conditional on the total mass W Å α = t depends only on the parameter σ (and not τ and α) and is given by where g σ is the probability density function of the positive stable distribution. Plugging the EPPF (37) into expression (32) yields the urn process for sampling in the GGP case. In particular, we can use the generative process (33) to sample exactly from the model. In the special case of the gamma process (σ = 0), W Å α is a gamma.α, τ / random variable and the resulting urn process is given by (Blackwell and MacQueen, 1973;Pitman, 1996) . 38/ When σ < 0, the GGP is a compound Poisson process and can thus be sampled exactly.

Sparsity
Appealing to theorem 2, we use the following facts about the GGP to characterize the sparsity properties of this special case.
(a) For σ < 0, the CRM is finite activity with ∞ 0 wρ.dw/ < ∞; thus theorem 2 implies that the graph is dense. (b) When σ 0 the CRM is infinite activity; moreover, for τ > 0, ∞ 0 wρ.dw/ < ∞, and thus theorem 2 implies that the graph is sparse. (c) For σ > 0, the tail Lévy intensity has the asymptotic behaviour and, as such, is regularly varying with exponent σ and constant slowly varying function.
We thus conclude that .39/ almost surely as α → ∞, i.e. the GGP parameter σ tunes the sparsity of the graph. The underlying graph is sparse if σ 0 and dense otherwise.

Empirical analysis of graph properties
For the GGP-based formulation, we provide an empirical analysis of our network properties in Fig. 8 by simulating undirected graphs by using the approach that was described in Section 5.5 for various values of σ and τ . We compare with an Erd´ós-Rényi random graph, preferential attachment  and the Bayesian non-parametric network model of Lloyd et al. (2012). The particular features that we explore are as follows.
(a) Degree distribution: Fig. 8(a) suggests empirically that the model can exhibit power law  various values of τ ( , τ D 10 1 ; , τ D 1; , τ D 5) (σ D 0.5) for the GGP; (c) number of nodes with degree 1 versus number of nodes on a log-log-scale ( , σ D 0.2; , σ D 0.5; , σ D 0.8) (note that the Lloyd method leads to dense graphs such that no node has only degree 1); (d) number of edges versus number of nodes ( , σ D 0.2; , σ D 0.5; , σ D 0.8) (here we note growth at a rate o.n 2 / for our GGP graph models, and Θ.n 2 / for the Erd´ós-Rényi and Lloyd models (dense graphs)) behaviour providing a heavy-tailed degree distribution. As shown in Fig. 8(b), the model can also handle an exponential cut-off in the tails of the degree distribution, which is an attractive property Olhede and Wolfe, 2012). (b) Number of degree 1 nodes: Fig. 8(c) examines the fraction of degree 1 nodes versus the number of nodes. (c) Sparsity: Fig. 8(d) plots number of edges versus number of nodes. The larger σ, the sparser the graph is. In particular, for the GGP random-graph model, we have network growth at a rate O.n a / for 1 < a < 2 whereas the Erdös-Rényi (dense) graph grows as Θ.n 2 /.

Interpretation of hyperparameters
On the basis of the properties derived and illustrated empirically in this section, we see that our hyperparameters have the following interpretations.
(a) σ-from Figs 8(a) and 8(d), σ relates to the slope of the degree distribution in its power law regime and the overall network sparsity. Increasing σ leads to higher power law exponent and sparser networks. (b) α-from theorem 5, α provides an overall scale that affects the number of nodes and directed interactions, with larger α leading to larger networks. (c) τ -from Fig. 8(b), τ determines the exponential decay of the tails of the degree distribution, with τ small looking like pure power law. This is intuitive from the form of ρ.dw/ in equation (35), where we see that τ affects large weights more than small weights.

Posterior characterization and inference
In this section, we consider the posterior characterization and MCMC inference of parameters and hyperparameters in our statistical network models. Assume that we have observed a set of undirected connections .z ij / 1 i,j N α or directed connections .n ij / 1 i,j N α where N α is the observed number of nodes with at least one connection. Without loss of generality, we assume that the locations of these nodes 0 < θ 1 <: : : < θ N α < α are ordered, and we write w i = W.{θ i }/ as their associated sociability parameters. For simplicity, we are overloading notation here with the unordered nodes in W = Σ i w i δ θ i of equation (7).
We aim to infer the sociability parameters w i , i = 1, : : : , N α , for each of the observed nodes. We also aim to infer the sociability parameters of the nodes with no connections (the difference between the set of potential nodes and those with observed interactions). We refer to these as unobserved nodes. Under our framework, the number of such nodes is either finite but unknown or infinite. The observed connections, however, provide information about only the sum of their sociabilities, denoted w Å . The node locations θ i of both observed and unobserved nodes are also not likelihood identifiable and are thus ignored. We additionally aim to estimate α and the hyperparameters of the Lévy intensity ρ of the CRM; we write φ for the set of hyperparameters. We therefore aim to approximate the posterior p{w 1 , : : : , w N α , w Å , φ|.z ij / 1<i,j<N α } for an observed undirected graph and p{w 1 , : : : , w N α , w Å , φ|.n ij / 1<i,j<N α } for an observed directed graph. (Formally, this density is with respect to a product measure that has a Dirac mass at 0 for w Å , as detailed in Appendix F.)

Directed multigraph posterior
In theorem 6, we characterize the posterior in the directed multigraph case. This plays a key role in the undirected case that is explored in Section 7.2 as well.
Theorem 6. For N α 1, let θ 1 <: : : < θ N α be the set of support points of the measure D α such that .n ij + n ji / > 0 for i = 1, : : : , N α are the node degrees of the multigraph and G Å α is the probability distribution of the random variable W Å α , with Laplace transform Additionally, conditionally on observing an empty graph, i.e. N α = 0, we have The proof builds on posterior characterizations for normalized CRMs (James, , 2005Prünster, 2002; using the hierarchical construction of expression (29)-(30). See Appendix E.
The conditional distribution of .w 1 , : : : , w N α , w Å / given .n ij / 1 i,j N α does not depend on the locations .θ 1 , : : : , θ N α / because we considered a homogeneous CRM. This fact is important since the locations .θ 1 , : : : , θ N α / are typically not observed, and our algorithm outlined below will not consider these terms in the inference.

Markov chain Monte Carlo sampling for generalized gamma process based directed and undirected graphs
We now specialize to the case of the GGP, for which we derive an MCMC sampler for posterior inference. Let φ = .α, σ, τ / be the set of hyperparameters that we also want to estimate. We assume improper priors on the hyperparameters: .43/ To emphasize the dependence on the hyperparameters of the Lévy measure and distribution of the total mass w Å , we write ρ.w|σ, τ / and G Å α,σ,τ .dw Å /. In the case of an undirected graph, we simply impute the missing directed edges in the graph. For each i j such that z ij = 1, we introduce latent variablesn ij = n ij + n ji with conditional distributionn f o rk = 1, 2, : : : : By convention, we setn ij =n ji for j < i and m i = Σ N α j=1n ij . For scalable exploration of the target posterior, we propose to use HMC (Duane et al., 1987; within Gibbs sampling to update the weights .w 1 , : : : , w N α /. The HMC step requires computation of the gradient of the log-posterior, which in our case, letting ω i = log.w i /, is given by . 45/ For the update of the total mass w Å and hyperparameters φ, we use a Metropolis-Hastings step. Unless σ = 0 or σ = 1 2 , G Å α,σ,τ .dw Å / does not admit any tractable analytical expression. We therefore use a specific proposal for w Å based on exponential tilting of G Å α,σ,τ that alleviates the need to evaluate this probability density function in the Metropolis-Hasting ratio (see the details in Appendix F). To summarize, the MCMC sampler is defined as follows.
Step 1: update the weights .w 1 , : : : , w N α / given the rest by using an HMC update.
Step 2: update the total mass w Å and hyperparameters φ = .α, σ, τ / given the rest by using a Metropolis-Hastings update.
Step 3: (undirected graph) update the latent counts .n ij / given the rest by using the conditional distribution (44) or a Metropolis-Hastings update.
The computational bottlenecks lie in steps 1 and 3, which roughly scale linearly in the number of nodes and edges respectively, although one can parallelize step 3 over edges. If L is the number of leapfrog steps in the HMC algorithm and n iter the number of MCMC iterations, the overall complexity is in O{n iter .LN α + N .e/ α /}. We show in Section 8 that the algorithm scales well to large networks with hundreds of thousands of nodes and edges. To scale the HMC algorithm to even larger collections of nodes of edges, one could explore the methods of Chen et al. (2014).

Simulated data
We first study the convergence of the MCMC algorithm on simulated data where the graph is simulated from our model. We simulated a GGP undirected graph with parameters α = 300, σ = 0:5 and τ = 1, which places us in the sparse regime. The sampled graph resulted in 13 995 nodes and 76 605 edges. We ran three MCMC chains each with 40 000 iterations and with different initial values and L = 10 leapfrog steps; the step size of the leapfrog algorithm was adapted during the first 10 000 iterations to obtain an acceptance rate of 0.6. Standard deviations of the randomwalk Metropolis-Hastings steps for log.τ / and log.1 − σ/ were set to 0.02. The computing time for running the three chains successively was 10 min using a MATLAB implementation on a standard computer (central processor unit at 3.10 GHz; four cores). Trace plots of the parameters α, σ, τ and w Å are given in Fig. 9. We computed the potential scale factor reduction (Brooks and Gelman, 1998;Gelman et al., 2014) for all 13 999 parameters .w 1:N α , w Å , α, σ and τ / and found a maximum value of 1.01, suggesting convergence of the algorithm. This is quite remarkable as the MCMC sampler actually samples from a target distribution of dimension 13 995 + 76 605 + 4 = 90 604. Posterior credible intervals of the sociability parameters w i of the nodes with highest degrees and log-sociability parameters log.w i / of the nodes with lowest degrees are displayed in Figs 10(a) and 10(b) respectively, showing the ability of the method to recover sociability parameters of both low and high degree nodes accurately.
To show the versatility of the GGP graph model, we now examine our approach when the observed graph is actually generated from an Erd´ós-Rényi model with n = 1000 and p = 0:01. The generated graph had 1000 nodes and 5058 edges. We ran three MCMC chains with the same specifications as above. In this dense graph regime, the following transformation of our parameters α, σ and τ is more informative: ς 1 = −.α=σ/τ σ , ς 2 = −σ=τ and ς 3 = −σ=τ 2 . When Fig. 9. MCMC trace plots of parameters (a) α, (b) σ, (c) τ and (d) w * for a graph generated from a GGP model with parameters α D 300, σ D 0.5 and τ D 1: , chain 1; , chain 2; , chain 3; , true σ < 0, ς 1 corresponds to the expected number of nodes, ς 2 to the mean of the sociability parameters and ς 3 to their variance (see Section 6.3). In contrast, the parameters σ and τ are only weakly identifiable in this case. The potential scale reduction factor is computed on .w 1:N α , w Å , ς 1 , ς 2 , ς 3 /, and its maximum value was 1.01, suggesting convergence.
The value of ς 1 converges around the true number of nodes and ς 2 to the true sociability parameter √ {− 1 2 log.1 − p/} (constant across nodes for the Erd´ós-Rényi model), whereas ς 3 is close to 0 as the variance over the sociability parameters is very small. The total mass is also very close to 0, indicating that there are no nodes with degree 0.
Posterior credible intervals for the nodes with highest and lowest degrees are in Fig. 11, showing that the model can accurately recover sociability parameters of both low and high degree nodes in the dense regime as well.

Assessing properties of real world graphs
We now turn to using our methods to assess properties of a collection of real world graphs, including their degree distributions and aspects of sparsity. For the latter, evaluation based on a single finite graph is notoriously challenging as sparsity relates to the asymptotic behaviour of the graph. Measures of sparsity from finite graphs exist but can be costly to implement (Nešetřil and Ossona de Mendez, 2012). On the basis of our GGP-based formulation and associated theoretical results described in Section 6, we consider Pr.σ 0|z/ as informative of the connectivity structure of the graph since the GGP graph model yields dense graphs for σ < 0, and sparse graphs for σ ∈ [0, 1/ (see equation (39)). For our analyses, we consider improper priors on the unknown parameters .α, σ, τ /. We report Pr.σ 0|z/ based on a set of observed connections .z ij / 1<i,j<N α , which can be directly approximated from the MCMC output. We consider 12 different data sets:  (Bu et al., 2003); (f) USpower-network of the high-voltage power grid in the western states of the USA (https://snap.stanford.edu/data/email-Enron.html) ); (g) IMDB-actor collaboration network based on acting in the same movie (http://www. cise.ufl.edu/research/sparse/matrices/Pajek/IMDB.html); (h) cond-mat1-co-authorship network (https://snap.stanford.edu/data/email-Enron.html) (Newman, 2001), based on preprints posted to condensed matter of arXiv between 1995 and 1999, obtained from the bipartite preprints-authors network using a one-mode projection; (i) cond-mat2-as in cond-mat1, but using Newman's projection method; (j) Enron-Enron collaboration network from a multigraph e-mail network (https:// snap.stanford.edu/data/email-Enron.html); (k) internet-connectivity of Internet routers (http://www.cise.ufl.edu/research/ sparse/matrices/Pajek/internet.html); (l) www-linked World Wide Web pages in the nd.edu domain (http://lisgi1.engr. ccny.cuny.edu/∼makse/soft data.html).
The sizes of the various data sets are given in Table 2 and range from a few hundred nodes or edges to a million. The adjacency matrices for these networks are plotted in Fig. 12 and empirical degree distributions in Fig. 13 (red). We ran three MCMC chains for 40 000 iterations with the same specifications as above and report the estimate of Pr.σ 0|z/ and 99% posterior credible intervals of σ in Table 2; we additionally provide run times. MCMC trace plots suggested rapid convergence of the sampler. Since sparsity is an asymptotic property of a graph, and we are analysing finite graphs, our inference of σ here simply provides insight into some structure of the graph and is not formally a test of sparsity. From Table 2, we note that we infer negative σ-values for many of the smaller networks. This might indicate that these graphs have dense connectivity; for example, our facebook107 data set represents a small social circle that is probably highly interconnected and the polblogs data set represents two tightly connected political parties. We infer positive σ-values for three of the data sets (USairport, Enron and www); note that two of these data sets are in the top three largest networks considered, where sparse connectivity is more commonplace. In the remaining large network, internet, a question is why the inferred σ is negative. This may be due to dense subgraphs or spots (for example, spatially proximate routers may be highly interconnected, but sparsely connected outside the group) (Borgs et al., 2014b). This relates to the idea of community structure, though not every node need be associated with a community. As in many sparse network models that assume no dense spots , our approach does not explicitly model such effects. Capturing such structure remains a direction of future research that is likely to be feasible within our generative framework. However, our current method has the benefit of simplicity with three hyperparameters tuning the network properties. Finally, we note in Table 2 that our analyses finish in a remarkably short time although the code base was implemented in MATLAB on a standard desktop machine, without leveraging possible opportunities for parallelizing and other mechanisms for scaling the sampler (see Section 7 for a discussion). To assess our fit to the empirical degree distributions, we used the methods that were described in Section 5.5 to simulate 5000 graphs from the posterior predictive distribution in each scenario. Fig. 13 provides a comparison between the empirical degree distributions and those based on the simulated graphs. In all cases, we see a reasonably good fit. For the largest networks, Figs 13(j)-13(l), we see a slight underestimate of the tail of the distribution, i.e. we do not capture as many high degree nodes as truly present. This may be because these graphs exhibit a power law behaviour, but only after a certain node degree , which is not an effect that is explicitly modelled by our framework. Instead, our model averages the error in the low and high degree nodes. Another reason for underestimating the tails might be dense spots, which we also do not explicitly model. However, our model does capture power law behaviour with possible exponential cut-off in the tail. We see a similar trend for cond-mat1, but not cond-mat2. Based on the bipartite articles-authors graph, cond-mat1 uses the standard one-mode projection and sets a connection between two authors who have co-authored a paper; this projection clearly creates dense spots in the graph. In contrast, cond-mat2 uses Newman's projection method . This method constructs a weighted undirected graph by counting the number of papers that were co-authored by two scientists, where each count is normalized by the number of authors on the paper. To construct the undirected graph, an edge is created if the weight is  (i) cond-mat2; (j) Enron; (k) internet; (l) www equal to or greater than 1; cond-mat1 and cond-mat2 thus have a different number of edges and nodes, as only nodes with at least one connection are considered. It is interesting that the projection method that was used for the cond-mat data set has a clear influence on the sparsity of the resulting graph, cond-mat2 being less dense than cond-mat1 (see Figs 13(h) and 13(i)). The degree distribution for cond-mat1 is similar to that of internet, thus inheriting the same issues as previously discussed. Overall, it appears that our model better captures homogeneous power law behaviour with possible exponential cut-off in the tails than it does a graph with perhaps structured dense spots or power law after a point behaviour.

Conclusion
We proposed a class of statistical network models building on exchangeable random measures. Using this representation, we showed how it is possible to specify models with properties that are different from those of models based on exchangeable adjacency matrices. As an example, we considered a model building on the framework of CRMs that can yield sparse graphs while maintaining attractive exchangeability properties. For a choice of CRMs, our fully generative formulation can yield networks ranging from dense to sparse, as tuned by a single hyperparameter.
In this paper, exchangeability is in the context of random measures for which we appealed to the Kallenberg representation in place of the Aldous-Hoover theorem for exchangeable arrays. Using this framework, we arrived at a structure that is analogous to the graphon, which opens up new modelling and theoretical analysis possibilities beyond those of the special case that is considered herein. Importantly, through the exchangeability of the underlying random measures and leveraging HMC sampling, we devised a scalable algorithm for posterior computations. This scheme enables inference of the graph parameters, including the parameter determining the sparsity of the graph. We examined our methods on a range of real world networks, demonstrating that our model yields a practically useful statistical tool for network analysis.
We believe that the foundational modelling tools and theoretical results that we presented represent an important building block for future developments. Such developments can be divided along two dimensions: (a) modelling advances, such as incorporating notions of community structure and node attributes, within this framework and (b) theoretical analyses looking at the properties of the corresponding class of networks.
For the latter, we considered just one simplified version of the Kallenberg representation; examining a more general form could yield graphs with additional structure. Building on an initial version of this paper (Caron and Fox, 2014), initial forays into advances on the modelling side can be found in  and  and theoretical analyses in  and .
12-1-0406 negotiated by the Air Force Office of Scientific Research and Air Force Office of Scientific Research grant FA9550-12-1-0453.

A.1. Aldous-Hoover theorem and graphons
In theorem 7, we present the Aldous-Hoover theorem for the case of joint exchangeability-i.e. symmetric permutations of rows and columns-which is applicable to matrices Z where both rows and columns index the same set of nodes. Separate exchangeability allows for different row and column permutations, making it applicable to scenarios where one has distinct node identities on rows and columns, such as in the bipartite graphs that we considered in Section 3.4. Extensions to higher dimensional arrays are likewise straightforward. For a more general statement of the Aldous-Hoover theorem, which holds for separate exchangeability and higher dimensional arrays, see .
Theorem 7 (Aldous-Hoover representation of jointly exchangeable matrices ). A random matrix .Z ij / i, j∈N is jointly exchangeable if and only if there is a random measurable where .U i / i∈N and .U ij / i, j>i∈N with U ij = U ji are a sequence and matrix respectively of IID uniform[0, 1] random variables.
For undirected graphs where Z is a binary, symmetric adjacency matrix, the Aldous-Hoover representation can be expressed as the existence of a graphon M :

A.2. Regularly varying Lévy measures
Here we provide a formal definition of a regularly varying Lévy measure ρ that is referred to in theorems 2 and 5.

Appendix B: Proof of proposition 1
The proof of proposition 1 follows from the properties of W ∼ CRM.ρ, λ/.
for any permutation π of N. Joint exchangeability of Z follows directly.

C.1. Probability asymptotics notation
We first describe the asymptotic notation that is used in the remainder of this section, which follows the notation of Janson (2011). All unspecified limits are as α → ∞.

C.3. Proof of theorem 4 C.3.1. Finite activity case
We first consider the case of a finite activity CRM. In this case, the number of nodes is bounded below by the square root of the number of edges, and bounded above by the (finite) number of jumps of the CRM. Let T = ∞ 0 ρ.dw/, 0< T < ∞. Let J α denote the number of points .w i , θ i / such that θ i < α. .J α / α 0 is a homogeneous Poisson process of rate T , and thus J α =α → T almost surely. Since √ N .e/ α N α J α almost surely, it follows from result (56) that N α = Θ.α/ almost surely as α → ∞.

C.3.2. Infinite activity case
We now consider the infinite activity case where ∞ 0 ρ.dw/ = ∞. First note that, by monotone convergence, lim t→∞ ψ.t/ = ∞. : in this example, the coloured regions indicate the intersection of A 2n 1 with S .2/ n for n D 1 ( ), n D 2 ( ), and n D 3 ( ); from this, we see that X 1 D 0 (the count in blue), X 2 D 2 (the count in red) and X 3 D 1 (the count in green) Consider sets A k = [.k − 1/=2, k=2/ for k = 1, 2, : : : , and let S .1/ n = ∪ n k=1 A 2k−1 and S .2/ n = ∪ n k=1 A 2k . S .1/ n and S .2/ n define a partition of [0, n]. For n ∈ N, define the random variable X n as See Fig. 14 for an illustration. Clearly,Ñ α is a lower bound for the number of nodes: N α N α almost surely: . 58/ Using the notation S .1/ = ∪ ∞ k=1 A 2k−1 and S .2/ = ∪ ∞ k=1 A 2k , let W .1/ and W .2/ be respectively the restriction of W to the set S .1/ and S .2/ . As S .1/ and S .2/ are non-overlapping and W is a CRM, W .1/ and W .2/ are independent. Integrating over W .1/ and using the marking theorem for Poisson processes (see below for more details), we obtain for n 1  (60), implies that .1=n/Σ n k=1 X k → ∞ almost surely. We conclude thatÑ α =α → ∞ almost surely and, using inequality (58), N α =α → ∞ almost surely.

C.4. Proof of theorem 5
where the last line follows from Campbell's theorem.
Using the Palm formula for Poisson point processes (Bertoin, 2006;Daley and Vere-Jones, 2008) The Laplace exponent is a strictly increasing function with ψ.0/ = 0 and lim t→∞ ψ.t/ = ∞ and therefore admits a well-defined inverse, denoted ψ −1 : [0, ∞/ → [0, ∞/. Using the change of variable u = ψ.2w/, we obtain If ρ is a regularly varying Lévy measure, then where σ ∈ [0, 1/ and l is a slowly varying function, and it therefore follows from lemma 3 in Appendix D and ψ −1 .0/ = 0 that where g is a monotone decreasing function. Applying the Tauberian theorem of proposition 2 in Appendix D, we therefore have Finally, combining the above asymptotics with equation (63), and noting that

Appendix D: Technical lemmas
The following lemma is a corollary of theorem 3, page 239, in Feller (1971).

Appendix E: Proof of theorem 6
The proof of theorem 6 relies on results on posterior characterization with models involving normalized CRMs. We first state a corollary of lemma 5 by  and theorem 8.1 by James (2002). Similar results appear in Prünster (2002), James (2005) and . The corollary involves the introduction of a discrete random variable R, conditional on which the CRM has strictly positive mass.

Appendix F: Details on the Markov chain Monte Carlo algorithms
The undirected graph sampler that was outlined in Section 7.2 iterates as follows.
Step 1: update w 1:Nα given the rest with HMC sampling.
Step 3: update the latent countsn ij given the rest by using either the full conditional or a Metropolis-Hastings step. F.1.
Under the improper priors (43), the acceptance probability reduces to having Step 3: update of the latent variables N n ij

Discussion on the paper by Caron and Fox
Ginestra Bianconi (Queen Mary University of London) I am delighted and honoured to open the discussion on this paper which provides an ideal starting point to reflect on the benefits provided by the interdisciplinary character of network science. From brain research to social networks 'big data' in the form of networks are permeating the sciences and our own everyday life. Therefore there is an urgent need to develop new methods and techniques to extract information from network data. Network science has developed as the fast growing interdisciplinary field that addresses this problem with the aim of obtaining predictive models for social, physical and biological phenomena occurring in networks.
The success of network science (Barabási, 2016;Dorogovtsev and Mendes, 2002) is rooted in the following two characteristics of the field: (a) the ubiquitous presence of networks describing complex interacting systems in social, technological and biological contexts; (b) the ability of the field to adopt methods and techniques coming from different theoretical disciplines such as statistical mechanics, graph theory and statistical network modelling.
Although already at the beginning of the field the first aspect was shown to be essential for the characterization of the universal properties of networks (Barabási and Albert, 2009;, more recently it has become clear that network science can lead to a comprehensive understanding of network phenomena only if the methods and techniques that are used to study networks reflect different theoretical perspectives. Initially statistical mechanics has been the most prosperous and successful approach to network modelling. In this framework we distinguish between a non-equilibrium approach of growing network models evolving through the preferential attachment rule (Dorogovtsev and Mendes, 2002;Barabási and Albert, 2009) and equilibrium approaches characterizing network ensembles enforcing either hard constraints (such as the configuration model) or soft constraints (such as exponential random graphs) (Park and Newman, 2004;Bianconi, 2009, 2010).
Most statistical mechanics models describe the regime in which the average degree k does not depend on the number of nodes N, i.e. k = O.1/. Particular focus has been addressed to scale-free networks in this regime having degree distribution P.k/ Ck −γ for k 1 and a power law exponent in the range γ ∈ .2, 3]. The particular emphasis given to the regime in which the average degree is constant with increasing network size is justified by the fact that indeed the vast majority of network data from the Internet to actor collaboration networks belong to this class of networks (Barabási, 2016).
However, there is evidence that several network data from social on-line networks (Seyed-Allaei et al., 2006) to neuroscience (Bonifazi et al., 2009) have a power law degree distribution P.k/ Ck −γ with γ ∈ .1, 2] implying very heterogenous degrees, very significant hubs and a diverging average degree. Despite the recent interest in these classes of networks (Seyed-Allaei et al., 2006;Lambiotte et al., 2016;Timár et al., 2016) historically these networks have been disregarded or neglected in the statistical mechanics community (Barabási, 2016;Del Genio et al., 2011).
From the statistical perspective, recently we have been witnessing a new renaissance of network modelling  emphasizing the relevance of having exchangeable and projective network models. Exchangeable models guarantee that the order in which nodes are sampled is irrelevant. Projective models guarantee that a network inference performed on a sample of N nodes can be used to infer properties of a sample obtained including more nodes. Interestingly the statistical mechanics models including nonequilibrium growing network models and network ensembles do not have either of these two properties.
Exchangeable projective models generated by a joint exchangeable adjacency matrix are described by the graphon and generate dense networks where the average degree is increasing linearly with the network size, i.e. k = O.N/. Given a joint exchangeable adjacency matrix the Aldous-Hoover representation can be expressed as the existence of a graphon, implying that the network is dense k = O.N/. Since the vast majority of real network data sets are sparse and have an average degree increasing sublinearly with the number of nodes, one of the major problems of the field was to overcome this limitation of the graphon.
In their paper Caron Fox formulate for the first time a generative exchangeable and projective model for sparse networks.
The major step to overcome the limitation of the graphon has been considering a point process on the plane instead of an exchangeable matrix. Thanks to the Kallenberg representation theorem for random measurable functions this model admits a representation as a mixture of random functions that naturally extend the graphon to the sparse regime.
Caron and Fox's model, based on the assumption of the existence of a latent space (the sociability of a node), achieves the following three major results. ]. Therefore the model shows that in the framework of statistical network modelling these networks emerge naturally, enabling us to describe networks that have been historically neglected in the statistical mechanics approach to networks. (c) The model enables an efficient method for inference of network data. Analysed network data include network data sets of sizes up to N = 300 000 nodes. It should be noted that the code is freely available from the author's Web page.
In conclusion Caron and Fox's paper opens a new scenario in statistical network modelling enabling the treatment of exchangeable and projective sparse network models. Additionally this work is a beautiful example of the benefits that can be obtained by an interdisciplinary approach to network science.
Many prospects can be envisaged for future research. These include the extension to more general latent spaces , the proposal of new network exchangeable models to preserve privacy  and the investigation of the regime k = O.1/ for exchangeable network models that would finally clarify whether it is possible to have a global framework to study both networks with constant average degree and diverging average degree.
It therefore gives me great pleasure to propose the vote of thanks.

Karthik Bharath (University of Nottingham)
I applaud Caron and Fox's efforts in the paper for developing a novel approach to represent and model networks based on point processes, which since its first version in 2014 has inspired several new forays into the literature on graph limits. In a certain sense the authors' approach is reminiscent of the objective method (Aldous and Steele, 2004) for combinatorial objects, and continuum limits of discrete objects, wherein of chief interest is the study of local properties of the graphs unlike global properties like sparsity. However, the utility of such probabilistically interesting continuous objects as statistical models for network or graph data is suspect; this was recently explored in Bharath et al. (2017) in the context of tree-structured data (acyclic graphs) using the continuum random tree (Aldous, 1993). My own view is that statistical models that aim to capture and quantify variability in characteristics of real world networks need necessarily be compatible with its inherently discrete nature, and the fundamental statistical notion of sampling from a population network (e.g. adding or deleting vertices). I am hence sceptical of the conceptual underpinning of their approach from a purely modelling perspective: the appropriateness of the notion of exchangeability-which is the leitmotiv of the paper-of point processes (within the Kallenberg framework) for network models is unclear. Depending on the question of interest the basic statistical datum on a network is identified (vertex, edge, triangle or other motifs), and exchangeability is then defined as (probabilistic) invariance to their labelling, which usually is arbitrary. The authors' continuous space representation of a network as a point process Z = Σ i, j z ij δ.θ i , θ j / on the plane implies that the distribution of the number of points that fall within rectangles A i × A j where A i := [h.i − 1/, hi], h > 0, is invariant to relabelling of the rectangles formed by intervals. This begs the question: what exactly is exchangeable on an observed network with vertices and edges represented by the measure-valued process {Z a , α > 0}? The answer becomes trickier when isolated vertices are deleted, as done in the paper. In fact the sequence of networks {Z α , α > 0} is unlabelled for the observed atoms! A similar issue also arises in the desirable notion of projectivity under subsampling: for α 2 < α 1 projectivity of the restriction of the law of Z α 1 to [0, α 2 ] 2 is not compatible with operations such as deletion and addition of vertices or edges, and I am unsure whether a sensible analogous interpretation is available within the point process setting. In this context recently developed notions of edge and relational exchangeability appear particularly cogent Dempsey, 2015, 2016).
The flexibility in their approach to generate dense and sparse graphs governed by the choice of the Lévy measure is attractive and points towards development of interesting inferential tools for networks. The data sets that are considered in the paper are pregnant with interesting questions on functional relationships between vertices and transport phenomena in networks that are indicative of its functionality. Disappointingly, the authors restrict their attention only to assessing whether a network is sparse. The apparent utility (or lack thereof) of the point process representation for inference on real networks is not examined at all. For example, extreme events on the USairport or www data sets could be formulated by using random walks on the network. A vertex could be classified as experiencing an extreme event if the average number of walks traversing it at a given time is greater than some threshold (Kishore et al., 2011). The average then could be computed with respect to the stationary distribution of a Markov chain with an n-step transition probability of a walk from vertex i to vertex j, given by P ij .n + 1/ = r {1 − exp.2w r w j /}P ir .n/: Posterior samples of w i could conceivably be used in estimating the stationary distribution, which then allows for probabilistic statements of such extreme events. Assessing power law behaviour through the double-logarithm plot is shown to be dubious in ; surprisingly, the authors use this method of assessment despite referring to . Can the overall model fit be assessed through a posterior predictive check?
The interpretation (and identifiability) of α in Z α within the context of an observed network is not clear since α is the upper bound for the θ i s-this is exacerbated in the sparse regime when α → ∞. The interpretation of the credible intervals for the σ-parameter (and hence the question of sparsity) is also unclear since the parameter α is unobserved and is estimated. It might be interesting to explore the structural properties of the graphs if the completely random measure W is allowed to be inhomogeneous such that v.dw , dθ/ = ρ.dw , θ/λ.dθ/; indeed, this brings about additional computational and interpretability issues. Despite the elegant theoretical foundation, the point process model for a finite network is perhaps 'too rich'-whereas the Lévy measure ρ and jumps w i are profitably used, the latent parameters θ i (whose natural ordering on R + is discarded in the interests of the exchangeability) and α throw up some uncomfortable questions on model interpretability and identifiability.
The vote of thanks was passed by acclamation.

Wilfrid S. Kendall (University of Warwick, Coventry)
Caron and Fox are to be congratulated on producing an inspiring and elegant blend of the creative probabilistic theory of exchangeability and the practical demands of statistical network analysis. I am impressed by the way in which practical considerations of tunability guide them towards exchangeable and completely random measures and theorems of Kallenberg and Aldous-Hoover.
Here I share recent work on rather different random networks, having a pronounced spatial element. Aldous (Aldous and Ganesan, 2013;Aldous, 2014) proposed an axiomatization of what one might call a scale invariant random spatial network (SIRSN). Motivated by on-line maps, one postulates a random entity producing a unique set of connections N.x 1 , : : : , x n / between any prescribed set of points x 1 , : : : , x n on the plane: N.x 1 , : : : , x n / is composed of routes r.x i , x j / between each pair of points x i and x j . It is required that the network be statistically invariant under translation, rotation and scaling, that the mean distance of a route between two specified points be finite and, finally, that if one considers a network connecting all points of an (independent) Poisson point process of intensity λ then the 'long distance' parts of the routes (say, r.x i , x j /\ball.x i , 1/\ball.x j , 1/ should form a fibre process of finite mean length per unit area. (This is the 'weak SIRSN property'; the 'strong SIRSN property' requires the bound on mean length per unit area to hold uniformly in λ.) In effect, long distance routes are reused by many different point pairs It is a highly non-trivial matter to prove that SIRSNs can exist, but we now have two different examples. Aldous (2014) produced a construction based on randomized grids. Kahn (2016) and Kendall (2017) put together arguments which showed that an SIRSN can be constructed using Poisson line processesimproper scale invariant line processes marked by speeds, with routes formed using fastest possible paths in the network. Remarkably, the Poisson line process construction also works in 3-space and beyond.
It is salutary to compare this with the work of Caron and Fox. SIRSNs provide toy models for real world spatial transportation networks, and a theoretical justification for the helpful notion of 'transit nodes' (Bast et al., 2007). But, although there is typically a single scalar parameter γ expressing the scale invariance, current SIRSN models cannot be said to be very tunable, nor to accommodate possibilities for data modelling; a clear challenge for future work.

Benjamin Bloem-Reddy (Columbia University, New York)
Research in this area is progressing rapidly, and there are connections to some concurrent work. See also Janson (2017a), section 5.1, for related ideas. For every model based on an exchangeable random measure (ERM) (see also Roy (2015, 2016),  and Janson (2016)), we can construct a so-called edge exchangeable model Dempsey, 2015, 2016;Williamson, 2016;Cai et al., 2016;Janson, 2017a) that coincides with a finite restriction of the ERM model. The converse is true only in some cases (see Cai et al. (2016) for an example). I consider the completely random measure (CRM) case for concreteness.
Let X 1 , X 2 , : : : be an exchangeable sequence of edges, and let E n be the directed multigraph composed of the first n edges labelled in order of appearance. If the edge(s) connecting a pair of vertices are given a unique 'colour', Kingman's paintbox theorem shows that every such graph can be generated by sampling where Φ is a random discrete probability measure sampled from a mixing measure μ (Crane and Dempsey, 2016). Denote such a graph by E Φ n . Consider when Φ is the normalized CRM product measure (see section 5.5.1) Φ α := W α × W α =W Å2 α . Then the graph D α , with D Å α edges, and E Φα D Å α have the same conditional law, If μ places mass only on normalized CRMs with mean measure ρ.dw/λ α .dθ/, then equality in distribution holds unconditionally. However, an ERM model and its counterpart edge exchangeable model coincide only for a particular α. D α and E Φα D Å α grow differently: fix " > 0, and let D α grow to D α+" , denoting by D Å α, " := D Å α+" − D Å α the number of additional edges. Then The inequality reflects a fundamental difference in how the two model classes encode the notion of growth and offers guidance for choosing an appropriate model class. Edge exchangeable models posit that graphs grow one edge at a time; growth in ERM models is by a random number of edges as α increases. As a consequence, in an edge exchangeable graph growing to arbitrary size, an edge may occur between two vertices that were previously not connected. This is not true for ERM models, which require that new edges form only between new vertices, inducing a growing population of edges from which to sample. Conversely, edge exchangeable models sample from a fixed (possibly infinite) population.

Patrick Rubin-Delanchy (University of Bristol)
This is a very creative and thought-provoking paper on network modelling, opening many avenues for future research. With this second point in mind, it would be useful to set down formally what this new definition of graph exchangeability should be in general: is an arbitrary model for an infinite graph .z ij / i, j=1, 2::: exchangeable in this sense if there is a coupling ..z ij /, .θ i / i=i,2,::: / so that, marginally, z follows the specified model and (θ i ) is a unit rate Poisson process, and, jointly, the point process Σz ij δ.θ i , θ j / is exchangeable? Some popular approaches to network modelling rely on latent positions .X i / i=1, 2 , : : : for each of the nodes, where, conditionally on X i , we have z ij ∼ ind Bernoulli {f.X i , X j /} for each i and j for some function f . Under the usual infinite exchangeability assumption, the Aldous-Hoover theorem suggests a Bayesian treatment of f , and the X i are independent uniform positions on [0,1]. However,  assumed that f.X i , X j / is logistic in the distance X i − X j (ignoring covariates), with the X i independent and identically distributed (IID) on R d (this being part of the prior rather than the model), and the random dot product graph (Nickel, 2006;Young and Scheinerman, 2007;Athreya et al., 2016) assumes that f.X i , X j / = X T i X j , with the X i IID on a convex set X ⊂ R d . The stochastic block model (Holland et al., 1983) and mixed membership stochastic block model  can also be written as latent position models. In the present paper, such models were mentioned as possible extensions where, if I understand correctly, the θ i could represent these latent positions. Instead, I think of the w i as latent positions, and indeed in a follow-up paper these parameters live in higher dimensions . What I expect in general is that the assumption of IID latent positions, present in each of the earlier models, must be dispensed with, and instead the X i must now form a more complicated point process on R d (or whichever space is appropriate), just as the w i are not IID but instead form a more complex point process on the positive line.
It is then interesting to ask how are estimates of these positions affected, starting with the simplest question of whether even the w i in this paper are estimated consistently? But, to give a more interesting example, for the random dot product graph, Athreya et al. (2016) have shown that the estimation error under adjacency spectral embedding goes as the usual rate n −1=2 and is asymptotically Gaussian, but here the IID assumption on the X i is important. If the graph is 'sparsified', using the ideas of this paper, how are these spectral estimates affected?
The following contributions were received in writing after the meeting.

Julyan Arbel (Inria Grenoble Rhône-Alpes)
The paper by Caron and Fox is a very fine methodological work which illustrates once again the huge modelling flexibility and versatility of discrete Bayesian non-parametric priors. They target here sparsity in graphs, the level of which can be neatly assessed according to the stability parameter σ of the discrete process under consideration.
The posterior distribution of σ is notoriously highly concentrated in the context of Bayesian nonparametric inference for species sampling problems. The credible intervals' narrowness obtained for the real world graphs suggests that the same holds here. Caron and Fox validate their methodology via posterior predictive checks such as the fit to the empirical degree distribution. Another type of validation, theoretical though, which is not considered by them is through posterior consistency. In the present setting, the graph is given and interest is in assessing graph properties such as sparsity. Posterior consistency here amounts to asking whether the model is capable of recovering a sparsity index σ, in other words: if the true graphgenerative process is assumed to have a sparsity index σ 0 , then does the posterior of a contract to a point mass at σ 0 when the size of the graph increases to ∞? The sparsity index σ 0 can be defined in the spirit of equation (21) by the asymptotic relationship between the number of nodes N α and the number of edges N .e/ α : as the graph size grows to ∞. According to the definitions given in equations (17) and (18), the graph size can be equivalently measured by α, N α or N .e/ α : The true graph-generative process could be the generalized gamma process itself (well-specified setting) or any other graph process satisfying condition (77) for some σ 0 (misspecified setting).
In recent research (Arbel et al., 2017), we introduce conditions on the Lévy intensity of the completely random measure to ensure consistent estimation in species sampling problems. Though the setting is quite different, our conditions are similar to the tail assumptions made by Caron and Fox in the derivation of N α and N .e/ α asymptotic behaviours. Admittedly, the consistency assumption of a true generative model with a given fixed level of sparsity is an idealized assumption which cannot account for real world graphs oddities such as local effects underlined by the authors: dense subgraphs (spots) and community structure. However, we believe that consistency properties could shed some light on why σ were estimated to be negative for most of the real world applications in the paper, thus concluding on dense graphs.

Sayantan Banerjee (Indian Institute of Management Indore) and Subhashis Ghosal (North Carolina State University, Raleigh)
Caron and Fox are to be congratulated for an interesting construction of random graphs through completely random measures on the real line and also for extending the ideas to directed multigraphs through Poisson processes with intensity measures as the product of completely random measures on the real line. Sociability parameters are used to generate edges which are further generated through completely random measures. The most interesting aspect of such constructions is the dichotomy of density or sparsity, depending on finite or infinite mass of the intensity measure of the completely random measure driving the construction. In particular, the example of the generalized gamma process is fascinating since the whole spectrum of activity can be captured through a single tuning parameter. In this context, it will be interesting to characterize the normalized graph Laplacian and its eigenvalues which give valuable information on the number of connected components. Using some appropriate graph partitioning algorithm (see for example, Von Luxburg (2007)), different communities in the graph are possible to detect. It will be interesting to study the probabilistic behaviour of the communities thus obtained from a random graph constructed by the authors. This may facilitate scale-dependent community detection in graphs, where the different scales refer to the local or global neighbourhood of the individual nodes of the graph. For example, in the case of a Swiss roll manifold generated from Gaussian mixtures, some scales can recover the underlying Gaussian components and some scales can detect communities which respect the underlying geometry of the manifold.
From a more general perspective, the model proposed by the authors seems to be an interesting way of selecting a particular graphon model P.z ij = 1/ = g.U i , U j / for the probability of an edge through latent variables U 1 , U 2 , : : :. By setting a particular point a reference point, tying the graphon to zero at it and making the latent variables concentrate near this point should induce sparsity. Can this general graphon model be obtained through a transformation on the plane from the model proposed by the authors? If so, can this connection be used to characterize the sparsity-generating mechanism in the general graphon model?
Another question potentially of interest and topic of further study is the limiting distribution of normalized numbers of edges. A normal approximation in the dense case and a Poisson approximation in the sparse case are expected.

Marco Battiston (Oxford University) and Stefano Favaro (University of Turin and Collegio Carlo Alberto, Turin)
We congratulate Caron and Fox for their interesting contribution, which has already attracted much interest in the statistical community. Here we would like to point to new developments related to privacy issues in network modelling. Network data usually contain sensitive information about individuals, e.g. medical status, wages, friendships or sexual or political preferences. A noteworthy example of privacy disclosure is in Narayanon and Shmatikov (2009), who showed how to identify users in the Netflix data set, which can be modelled by a weighted bipartite graph, even after users and movies labels had been removed. Privacy problems are concerned with providing mechanisms to transform raw data into a privatized data set to be released. A popular measure to check whether a mechanism can privatize a data set is differential privacy, which was initially proposed in Dwork et al. (2006) and recently considered in graph theory for network data. A mechanism A is said to be node private if an intruder looking at the output released by the mechanism cannot correctly guess with high probability whether a node (individual) is in the data set or is not and figure out which are his edges (links). A formal definition of -node privacy is that, for all subsets S of the output space, the mechanism A must satisfy Pr{A.G/ ∈ S} exp. /Pr{A.G / ∈ S} .78/ for all graphs G and G which can be obtained one from another by removing a vertex and its adjacent edges.
To our knowledge, the only attempt to study an -node private mechanism is in Chayes et al. (2015). They considered sparse graphs obtained by rescaling a dense graphon with the network size, and they proposed a mechanism that releases as output a step graphon that satisfies condition (78). We believe that an interesting line of research would be to study how privacy constraints could be handled within the sparse graphs setting proposed by Caron and Fox. Specifically, is -node privacy a good measure of disclosure for graphs or are better notions needed in the sparse regime? How do we construct mechanisms satisfying these privacy notions? Will the privatized network data set that is obtained by this mechanism preserve enough statistical utility? As pointed out in Narayanon and Shmatikov (2009), sparsity facilitates disclosure and at the same time it makes statistical inference more difficult. Therefore, on the one hand, we might need quite a stringent notion of privacy for sparse graphs, but on the other hand this requirement may drastically affect the statistical utility of the released network data set. As a consequence, a clear trade-off between privacy guarantees and statistical utility arises, particularly in sparse settings. Natural questions are how to formalize this trade-off mathematically and then how to solve it.

C. Borgs and J. T. Chayes (Microsoft Research)
This paper is an important contribution to statistical network modelling, giving a framework to formulate exchangeability for sparse graphs by embedding them in random measures on R 2 + . It has already inspired several other works, including ours with H. Cohn and N. Holden .
We put the Caron-Fox model into a broader context of graphon processes as follows. First, rather than viewing w i as a 'sociability' parameter of node i, we think of it as a feature in some σ-finite space Ω (e.g. R d with a locally finite measure, a discrete but not necessarily finite space, or a combination of the two). Second, rather than a product function, we consider an arbitrary two-variable function (graphon) W : Ω × Ω → [0, 1] over the feature space.
We construct a time-dependent family .G t / t 0 of graphs by first generating a Poisson point process with intensity tρ, where ρ is an arbitrary σ-finite measure on the feature space, then connecting two points with features w i and w j with probability W.w i , w j / and finally deleting points unconnected up to time t.
If we choose the feature space to be R + and W.w, w / = 1 − exp.−2ww /, we obtain the Caron-Fox model. If we take the space to be R + × [k], and W the product of the Caron-Fox graphon with a k × k matrix, we obtain the block model version of . Taking it to be R + times the k-simplex, we obtain a sparse, degree-corrected mixed membership model, etc. Hence, our work represents a substantial extension of the modelling aspect of the authors' work.
Our work also sheds light on the theoretical questions alluded to by the authors. First, it generalizes both their model and the standard, dense exchangeable model. Second, it completes the Caron-Fox picture by giving an if and only if characterization of exchangeability, whereas previous work (except for the simultaneous independent work of Veitch and Roy (2015)) gave only an if statement. Explicitly, labelling edges by the birth times of their end points, we obtain an exchangeable random measure on R 2 + from an arbitrary graphon process. More significantly, we prove the only if statement: under mild decay conditions, all such measures can be obtained from graphon processes, extending the classic Aldous-Hoover theory for dense graphs to graphon processes.

Alexandre Bouchard-Côté and Creagh Briercliffe (University of British Columbia, Vancouver)
First, we congratulate Caron and Fox for this impressive contribution, which is already starting to have significant influence in the field. In what follows, we discuss some computational aspects of posterior inference.
In the general case, the authors describe that the computational bottlenecks of posterior inference lie in updating the weights w i and the latent countsn ij . Although there are more latent count variables than weight variables, sampling the former can be trivially parallelized, so, for sufficiently sparse graphs, we expect the weight updates to dominate.
The authors used Hamiltonian Monte Carlo within Gibbs sampling to perform the joint sampling of w 1:Nα given the rest of the variables, which seems a reasonable approach validated by solid experimental results. At the same time, given the prevalence of graph data of increasing size, we believe it is productive also to keep the door open to alternatives. In particular, the target log-density has a simple structure in the generalized gamma process case, so it is tempting to ask whether this structure can be leveraged. We investigate one potential direction to do so by using two variants of the bouncy particle sampler (BPS) algorithm (Bouchard-Côté et al., 2017). Since the local variant of the BPS has been shown to outperform state of the art Hamiltonian Monte Carlo algorithms in sparse factor graphs, this appears a priori an appropriate tool for inference in a sparse network model. Surprisingly, however, the factor graph induced by sampling the w i s is in fact fully connected. We therefore resort to the global variant of the BPS algorithm.
We use the same reparameterization as the authors and compute collision times via superposition and adaptive thinning. We run the global BPS algorithm and Stan (Stan Development Team, 2016), each on a simulated graph of N α = 2558 nodes, with the other parameters fixed to the values used to generate the data .w Å = 6:5427, τ = 1 and σ = 0:5/. We observe that the latter outperformed the former (3.2 and 5.8 effective samples per second respectively). This is consistent with previous observations that BPS algorithms are most attractive when they can exploit sparsity (Bouchard-Côté et al., 2017).
Perhaps sparsity of the inference problem might reappear in more complex models that build on the work of Caron and Fox. For example, it might be interesting to modify the follow-up work of  to allow some community-specific weights to be 0. Alternative constructions designed to accommodate covariates or node attributes might also potentially create additional sparsity structure.

Trevor Campbell and Tamara Broderick (Massachusetts Institute of Technology, Cambridge)
We congratulate Caron and Fox on their paper, which has already inspired further theory Palla et al., 2016), an alternative class of models using independent and identically distributed data sampling of edges from a fixed random measure (Cai et al., 2016;Dempsey, 2015, 2016;Campbell et al., 2016a;Janson, 2017a), and models for dynamic networks (Palla et al., 2016), link prediction (Williamson, 2016), and block structure learning .
The authors focus on the case where 'vertex sociabilities' are generated by the jumps of a Poisson process on R + . They marginalize the latent Poisson process to devise an urn scheme and thereby a tractable Markov chain Monte Carlo sampler. However, in previous work, practitioners have often found it useful to represent the latent process explicitly: it is required by modern methods such as variational inference and Hamiltonian Monte Carlo sampling (Blei and Jordan, 2006;; it can make inference simpler or tractable for hierarchical models; and it often facilitates parallel computing. The authors' results, though, imply that there is a challenge to instantiating the Poisson process in the case of sparse graphs. In particular, they show that, if a graph sequence is sparse, the Poisson process must have a countable infinity of atoms in any restriction window [0, α] (theorem 2). But we cannot store infinitely many values in memory, or update infinitely many values in finite time, and therefore we must use an approximation. One approximation involves replacing the full Poisson process with only finitely many jumps. The authors suggest weight-based thresholding (Muliere and Tardella, 1998;Argiento et al., 2016) where jumps x ∈ [0, / are removed, but this requires dealing with non-standard truncated probability distributions on R + . We highlight another option: to truncate a sequential representation of the Poisson process (Campbell et al., 2016b). This technique has the advantage that it typically involves only well-known exponential family distributions and thereby allows variational algorithms with simple closed form updates (Blei and Jordan, 2006;Doshi-Velez et al., 2009). The approximation error of truncated sequential representations has recently been thoroughly characterized (Campbell et al., 2016b). However, these results do not immediately extend to the present network model, as it involves the product of a Poisson process with itself. Nonetheless, we conjecture that variational inference based on a truncated sequential representation would enjoy similar benefits for the network model as for previous applications in Bayesian non-parametrics.

Roberto Casarin (University Ca' Foscari of Venice), Matteo Iacopini (University Ca' Foscari of Venice and Université Paris 1-Panthéon-Sorbonne) and Luca Rossini (University Ca' Foscari of Venice and Free University of Bozen-Bolzano)
Caron and Fox are to be congratulated on their excellent research, which has culminated in the development of a new class of random-graph models. The node degree and the degree distribution fail in giving a unique characterization of network complexity (Estrada, 2010). For this reason global connectivity measures, such as communicability (Estrada andHatano, 2008, 2009) and centrality (Borgatti and Everett, 2006) are used to analyse a graph. In this discussion we contribute to the analysis of the generalized gamma process (GGP) model compared with the Erd´ós-Rényi and the preferential attachment (Barabási and Albert, (c) (d)

Fig. 15.
Netwok statistics versus number of nodes for the GGP undirected network ( , σ D 0I , σ D 0.5I , σ D 0.8), the Erd´ós-Rényi ( ) and the preferential attachment model of  ( ): (a) number of connected components; (b) clustering coefficient; (c) assortativity coefficient; (d) share nodes core 1999) models. Our analysis is far from being exhaustive but shows that more theoretical aspects of the GGP model are to be investigated.
A connected component of the n-nodes graph G = .V , E/ is a subgraph in which any two vertices v i and v j are connected by paths. The number of connected components equals the multiplicity of the null eigenvalue of the graph Laplacian L, where the .i, j/ entry of L is The global clustering coefficient measures the tendency of nodes to cluster and is defined as C = number of triangle loops number of connected triples of vertices : The assortativity coefficient between pairs of linked nodes is given by jk.e jk − q j q k / σ 2 q , where q k and e jk are the distribution and the joint excess degree probability of the remaining degrees respectively, for the two vertices v j and v k , and σ q is the standard deviation of q k . Finally, given the partition of the network into two non-overlapping subgraphs (core and periphery) that maximizes the number or weight of within-core-group edges, we compute the share of nodes in the core.
According to Fig. 15 (a), the GGP couples with the preferential attachment model and performs slightly worse than the Erd´ós-Rényi random graph in terms of the number of connected components . Figs 15(b) and 15(c) highlight that the clustering structure of GGP does not vary too much with σ. The clustering coefficient is in line with the two benchmarks while the assortativity of the Erd´ós-Rényi model is not attained. For σ = 0:5, 0:8, the GGP exhibits a lower share of nodes in the core (Fig. 15(d)) than in the benchmarks and mimics the preferential attachment model for σ = 0.
Overall, the GGP can replicate typical behaviours of real world sparse networks and some fundamental features of random graphs generated from the preferential attachment model, making it suitable for a variety of applications in different fields.
We are very pleased to thank the authors for their work.

I. Castillo and T. Rebafka (Université Pierre et Marie Curie-Paris 6)
The random-graph model proposed by Caron and Fox has some remarkable properties, among others the possibility to combine sparsity and exchangeability and the scalability of computations within the associated Bayesian framework. In our view, their work raises some interesting questions on statistical inference. Adopting a frequentist approach, what can be said about posterior convergence assuming the data have been generated under a 'true' parameter?
Consider the behaviour of Π[·|Z] under two frequentist settings: (a) well specified, where Z ∼ P 0 = P w 0 for an unknown fixed w 0 ∈ W, and (b) possibly misspecified, where Z ∼ P 0 = Q for an arbitrary graph distribution Q.
In a simulation study, we considered estimation of two simple functionals; the edge density and the density of triangles, We used the default code under the generalized gamma process with improper priors on hyperparameters. One reason to consider setting .b/ is the specific exponential form of the link function considered in the paper, which may not hold for the data. Suppose for instance that Z has actually been generated from a stochastic block model with two groups, equiproportions and connectivity parameters α = 0:8 0:1 0:1 0:8 : Simulations suggest that the posterior is consistent for ψ 1 , but inconsistent for ψ 2 .
In setting (a), we considered two cases. Case 1 is an equiproportions stochastic block model with connectivity matrix compatible with the exponential link function. Bayesian and frequentist behaviours of ψ 1 and ψ 2 are remarkably close and rapidly converging ( Table 3), suggesting that a Bernstein-von Mises theorem holds.  Case 2 is a graph with 'correctly specified' link function and w a sample from the Cauchy distribution. The posterior still estimates the functionals ψ 1 and ψ 2 well but seems to underestimate large values of w 0, i (Fig. 16). To study sparsity, we repeated the two previous experiments but replacing w 0 by ρ n w 0 , where ρ n → 0. We noted that the posterior on σ was concentrated on negative values, which suggests that σ may not universally quantify sparsity.
It would be interesting to determine which aspects of w (e.g. real-valued functionals, or the complete vector w) can be estimated at minimax rate by using priors as in the paper. Another question would be to extend the model and priors to cover arbitrary link functions, and possibly more general forms of the law of w.

Suman Chakraborty (University of North Carolina at Chapel Hill)
The model proposed in this paper is aimed at providing a unified framework to incorporate desirable properties observed in real world networks such as exchangeability and sparsity. The model is defined for a graph for countably infinite vertices. A typical node i is codified via a function θ i where i is a natural number and θ i lies on the non-negative half of the real line and can be seen as the time that node i entered the network. Finally the network is represented as the point process where z ij = 1 indicates that there is an edge between nodes i and j. Interestingly Kallenberg's exchangeable completely random measures theory is devised on the appearance time of the nodes; formally the following distributional equality holds: for all natural numbers i and j, π andπ are arbitrary permutations of natural numbers .π =π for joint exchangeability) and A i = [h.i − 1/, hi] for any h > 0. Finally to introduce node inhomogeneity a set of parameters {w i } is introduced and particular link probabilities are assumed.
It seems that theorems 2 and 3 are the main advantage of this modelling approach. Roughly it enables us to produce graphs with the desired level of sparsity by tuning a single parameter. This property can be appealing for applications. Also the sparsity parameter might be useful to obtain scalings of important functions of the graph, such as the number of triangles. The projectivity is also a salient feature of the model as it can be used to draw inference on the unobserved part of the population. It would be interesting to analyse how much information is necessary to make valid inference about the whole network.
The form of exchangeability of the model is cleverly devised and enabled the authors to use known representations of , which is a crucial component of the paper. It will be helpful for the reader to have a few more comments from the authors on the statistical interpretations and scope of the notion of exchangeability used here. Finally as suggested by Fig. 8(a) the power-law-like tail behaviour of the empirical degree distribution, the projectivity and tunable sparsity parameter make the model attractive to both theoretical and practical applications.

Harry Crane (Rutgers University, Piscataway)
Caron and Fox tout their proposal as 'the first fully generative and projective approach to sparse graph modelling [: : :] with a notion of exchangeability that is essential for devising our scalable statistical estimation procedure' (at the end of Section 4; emphasis added). In calling theirs the first such approach, they brush aside prior work of , whose model is also generative and projective, and produces sparse graphs. The Barabási-Albert model is not exchangeable, but nor is the authors'. And, although the Barabási-Albert model is inadequate for most statistical purposes, the model proposed is not obviously superior, especially with respect to the criteria highlighted above.

Generative
Though amenable to simulation, the obscure connection between Kallenberg's theory of exchangeable completely random measures and the manner in which real networks form makes it difficult to glean much practical insight from this model. At least Barabási and Albert's preferential attachment mechanism offers a clear explanation for how sparsity and power law might arise in nature. I elicit no such clarity from the Caron-Fox model.

Projective
Proiectivity is important for relating observed network to unobserved population and is therefore crucial in applications for which inferences extend beyond the sample. Without a credible sampling interpretation, however, the statistical salience of projectivity is moot. Here projectivity involves restricting point processes in R 2 + to bounded rectangles [0, α] 2 , whose best-known interpretation via p-sampling (Veitch and Roy, 2016) seems unnatural for most conceivable networks applications, including those in Section 8.

Exchangeability and sparsity
A certain nonchalance about whether and how this model actually models real networks betrays an attitude that sees sparsity as an end in itself and exchangeability as a means to that end. Even the authors acknowledge that 'exchangeability of the point process [: : :] does not imply exchangeability of the associated adjacency matrix' (on the third page). So why is there all the fuss about exchangeability if its primary role here is purely utilitarian? To me, the pervasiveness of 'exchangeability' throughout the paper is but a head fake for unsuspecting statisticians who, unlike many modern machine learners, understand that exchangeability is far more than a computational contrivance.

Final comment
The Society's Research Section is all too familiar with the Crane-Dempsey edge exchangeable framework, which meets the above four criteria while staying true to its intended application of interaction networks. For brevity I refer the reader to Dempsey (2015, 2016) for further discussion.

Daniele Durante (University of Padova)
There is increasing interest in modelling network data under a Bayesian non-parametric perspective, and the paper from Caron and Fox provides an appealing contribution in this direction. In fact, as they effectively demonstrate, many Bayesian non-parametric representations not explicitly motivated by network data can indeed provide powerful methods for studying relevant patterns underlying real world connectivity structures.
Consistent with this focus, the growing demand for more flexible and realistic statistical models for networks is a main motivation underlying the stochastic point process representation of Caron and Fox, implying less restrictive exchangeability assumptions. Their model is carefully parameterized to provide flexible inference on sparsity and the degree distribution, which represent the quantities of main interest. However, a natural question when combining Bayesian non-parametric representations and complex data is whether the resulting Bayesian formulation induces large support priors for a wider set of expected network features-e.g. clustering coefficient and average shortest path length. These are relevant measures in characterizing other realistic properties, such as small world architectures, and an improved characterization of prior support may provide insights on which properties the model can flexibly incorporate.
Looking at this problem from a broader perspective, Caron and Fox consider settings in which a single network is observed. In such cases the information is clearly not sufficient to learn highly flexible representations, and although restricting the focus to a specific class of distributions-by incorporating reasonable structure-is a clever approach, it does not protect from misspecification issues. For instance the novel exchangeability assumption proposed by the authors effectively improves flexibility in characterizing sparsity but could substantially restrict the support on other relevant classes of distributions. Therefore, looking at the model from an applied perspective, it is fundamental to understand the extent to which their Bayesian non-parametric formulation can be viewed as truly non-parametric. Indeed, without prior support about the true distribution of the random graph, the posterior can never concentrate around the truth, thereby affecting quality of inference.
A possible answer to this issue is to generalize the model from Caron and Fox to obtain a fully flexible characterization of the entire distribution for random graphs. Motivated by the availability of replicated networks, Durante et al. (2017) addressed this goal with a focus on latent space models (e.g. ), and I am excited to see how the inspiring contribution from Caron and Fox can open new avenues also in this direction.

Chao Gao (University of Chicago)
I congratulate Caron and Fox for their results that give a solid foundation for exchangeable network analysis. This paper has already inspired some important subsequent works (Veitch and Ray, 2015;Crane and Dempsey, 2016;Broderick and Cai, 2015). My discussion will focus on the potential influence on the development of methods and algorithms under this new framework.

Within-sample inference
Given a network of finite size D α , the locations are labelled θ [1:n] = .θ 1 , : : : , θ n /, with n ∼ Poisson.α/, Consider the problem of estimating w [1:n] = .w 1 , : : : , w n / in a frequentist way. The conditional likelihood function p.{n ij } 1 i, j n |θ [1:n] , w [1:n] , n/ is This is the same likelihood function as in  under the node exchangeable framework (or its non-exchangeable sparse version with a scaling parameter . Another important problem is community detection. An analogous block model was proposed by  under the framework of exchangeable random measures. Again,given (θ [1:n] , w [1:n] , n), the conditional likelihood of  is the same as that of  for degree-corrected block models. The two examples show that, for within-sample inference, existing methods, algorithms and even theorems in the framework of  can generally be applied to the new framework of Caron and Fox without change, at least from a frequentist perspective.

Out-of-sample inference
Given the discussion in the previous paragraph, I believe that the development of new methods and algorithms should be emphasized on out-of-sample inference. The exchangeability framework of Caron and Fox makes rigorous inference on the population or future observations possible for the first time in sparse networks, for which they should be congratulated again. After all, the models of Caron and Fox and later extensions  are defined on the entire population, whereas the degree-corrected block model  is defined only on a finite set of nodes.

Subgraph densities
Recently, it has been shown by Gao and Lafferty (2017) that statistics based on small subgraph frequencies can lead to optimal testing procedures between Erdó´s-Rényi models and stochastic block models.
Empirical studies also illustrate that structures of social networks can be reflected by statistics of small subgraph frequencies (Ugander et al., 2013). It is not very clear to me how analogous ideas can be developed in the framework of exchangeable random measures. For example, let T be the total number of triangles in a given network Z α . To define the empirical triangle frequency, should T be normalized by . n 3 /, by . Nα 3 / or by α 3 ? What is a proper definition of the population triangle density? Inference for population subgraph densities are out of sample and thus deserves future investigations under the framework of Caron and Fox.

Jim E. Griffin and Fabrizio Leisen (University of Kent, Canterbury)
We congratulate Caron and Fox for this stimulating paper which proposes a new perspective on network models. The use of completely random measures (CRMs) provides a tractable and rich framework for modelling graphs. CRMs have been intensively studied in Bayesian non-parametrics and this provides a ready-made toolkit for the development of novel and tractable network models. One example is the development of vectors of CRMs; see Leisen and Lijoi (2011), Griffin et al. (2013), Griffin and Leisen (2017) and the references therein. Section 3.2 of the paper illustrates the use of the methodology for modelling directed multigraphs. We believe that vectors of CRMs can be employed in this setting for modelling multiple types of interactions (such as 'message', 'SMS', 'like' and 'tag' as in the paper). This can be achieved by considering different but related multigraphs for each interaction. In the spirit of compound random measures (Griffin and Leisen, 2017), we model the kth interaction (from p possible interactions) as The w i s and the m ki s are drawn from the directing Lévy measure and the score distribution h respectively. Extending the interpretation in the paper, the w i s represent an overall sociability and the m ki s adjust for differences in the levels of different types of interaction. This approach has been employed by  for modelling networks with overlapping communities. An attractive feature of this framework is that the parametric score distribution h can easily be extended, in standard ways, to include covariates or time variation. We believe that this paper will become a cornerstone of network modelling as well as a natural application of Bayesian non-parametric techniques.

Lancelot F. James (Hong Kong University of Science and Technology) and Creighton K. Heaukulani (Hong Kong)
We congratulate Caron and Fox on a paper that has generated considerable interest. This work has influenced our thought process on some projects we are working on (Bloem-Reddy et al., 2017), and this discussion reflects some of those elements. We ask one question and offer some comments. Generalizing Section 5.5 (see equation (31)), we can construct processes of the form, for q = 1, 2, 3, : : : where .b k /|.p k / ∼ ind Bernoulli.p k /, and .p k / are points of a Poisson process with intensity τ , satisfying 1 0 vτ .v/dv = 1. This implies that D Å α, q |W Å α is Poisson.W Åq α /. Hence, for q = 1, we might associate D α,1 with a hierarchical Indian buffet process, and D α,2 equates with the marginal structure considered by the authors. In this regard, we refer to these processes as hierarchical q-latent feature models. A question to the authors is how might we interpret D α, q for q 3? Now, by the Bayes rule, there is a conditional density of W Å α |D Å α, q = l, given by : . 79/ Then, F α .y/ : ). When l > 0, the density (0.1) also corresponds to that of W Å α |G 1=q l =W Å α = 1, where G l is a gamma.l, 1/ variable. So, for q = 2, another contribution of the paper of Caron and Fox is to identify classes of unexplored Poisson-Kingman distributions that naturally appear in an interesting context. The appearance of PK.ρ, f .l,2/ α / is a little more obvious in an earlier draft of the authors' work. In general, one samples U 1 , : : : A subtle point about this sampling scheme, inherent to latent feature models, is that, unlike the Bayesian non-parametric statistics latent class setting where one samples n points from F α , different from ql, the sample size ql, dictated by D Å α, q = l, always agrees with the superscripts in PK.ρ, f .l, q/ α /. This explains the cancellations in equation (71). See   (and also James (2002)), which exhibits all the pertinent structures for q = 1.

Svante Janson (Uppsala University)
I find the random-graph model introduced here by Caron and Fox very interesting. Apart from its potential use in applications, it has novel and interesting mathematical properties. Moreover, it has been an inspiration of important generalizations developed after the first version of the present paper by, in particular,  and .
The relationship with Kallenberg's characterization of exchangeable random measures is interesting, and presumably useful in further developments of the theory, but I would like to stress that, for the content of the present paper, Kallenberg's highly technical theorem may serve as a (possibly important) inspiration for the model, but it is not needed for the formal construction of the model and the study of its properties.
Furthermore, the basic construction can be stated in several, equivalent, ways. I prefer to see the basic construction in the paper as follows, including generalizations by  and . Let .S, μ/ be a σ-finite measure space, and let F.x, y/ be a fixed symmetric measurable function S × S → [0, 1]. Generate a random countable point set .w i , θ i / ∞ 1 of points in S × R + by taking a Poisson point process in S × R + with intensity μ × λ. Regard the θ i as (labels of) vertices, and add an edge θ i θ j with probability F.w i , w j /, independently for all pairs .θ i , θ j / with i j. (Finally, eliminate isolated vertices.) The version in the present paper constructs .w i , θ i / ∞ 1 by a completely random measure, which is equivalent to choosing S = R + with μ the Lévy measure; furthermore, F is chosen as F.x, y/ = 1 − exp.−2xy/ (for x = y). Kallenberg's theorem yields the same random graphs by a canonical choice .S, μ/ = .R + , λ/, but a different F ; see Section 5.1. Other choices of F yield generalizations of the model. Other choices of .S, μ/ yield the same random graphs but are sometimes useful, so it seems convenient to allow an arbitrary choice and not to fix it in advance. Finally, in connection with theorems 3 and 5, note that, if ∞ 0 wρ.dw/ < ∞, then N .e/ α =E[N .e/ α ] → 1 almost surely as α → ∞. This follows easily because the loops can be ignored and, ifN .e/ α denotes the number of non-loop edges and the edges are defined by the events U ij F.w i , w j / for an independent and identically distributed array .U ij / i j , thenN .e/ α =α 2 is a reverse martingale with respect to the σ-fields F t generated by .w i / ∞ 1 ∪ .U ij / ij ∪ .θ i 1 θ i >t / ∞ 1 .

Kuldeep Kumar (Bond University, Gold Coast)
One of the important contributions of this paper is the development of a scalable algorithm which enables inference of the graph parameters determining the sparsity of the graph. For visual graph display the graphs can be turned from dense to sparse based on a single parameter for a specific choice of completely random measure. One of the problems which is inherent in these kinds of algorithms is the choice of parameters .α, δ, γ/. There are many developments in the area of machine learning algorithms specifically in the area of tree decomposition like random forests and stochastic gradient boosting (treenet). Have Caron and Fox compared this algorithm which relies on tree decomposition or the pebble game algorithm as developed by Lee and Streinu (2008)?
Antonio Lijoi (Bocconi University, Milan), Ramsés H. Mena (Universidad Nacional Autónoma de México, Mexico City) and Igor Prünster (Bocconi University, Milan) We congratulate Caron and Fox for proposing a clever construction of random graphs for networks, which enables us to achieve sparsity in an effective way. This represents another successful instance of Bayesian non-parametric modelling based on completely random measures (CRMs). Among several potential developments, of particular interest is the extension to a multisample context with data recorded from two or more networks. Assuming that the data are still conditionally independent but not identically distributed, a natural problem is the derivation of testing procedures to verify whether the probability distributions, or some of their features, are shared across samples. In the generalized gamma process (GGP) case, the parameter σ plays a pivotal role: it not only controls the sparsity of the graph, but it also influences posterior clustering of the data. See . Hence, homogeneity between two networks, directed by independent GGPs, may be assessed by testing for equality of their respective σ-parameters. Along these lines,  defined a Bayesian test on the discount parameter of the Pitman-Yor process, which plays the same role as σ for the GGP.
Alternatively, network comparisons can be faced by assuming a richer model with dependent CRMs .W 1, α , W 2, α / accommodating a wide spectrum of dependence structures across networks, ranging from exchangeability (i.e. distributional homogeneity) to unconditional independence. In this framework, we may test whether the two distributions themselves are equal. Recently, unrelated to network applications, Bacallado et al. (2015) have addressed the issue within a parametric model and provided an insightful discussion on the notion of approximate exchangeability. A natural model that may serve the purpose is based on ideas in Rodríguez et al. (2008), where the nested Dirichlet process is introduced for clustering probability curves. Similarly, this approach may be useful for clustering networks on the basis of distributional similarity. Here, however, we need a model that can handle CRMs and a potentially fruitful approach is proposed in Camerlenghi (2015). Letq be a discrete random probability measure on the space of boundedly finite measures M R+ on R + , and q 0 is the probability distribution of a CRM on R + . If Discreteness ofq implies that with positive probability W Å 1 = W Å 2 , which in turn yields W 1 = W 2 . This corresponds to similarity of the networks as they have the same distribution.
Jorge Mateu (University Jaume I, Castellón) and Matthias Eckardt (Humboldt-Universität zu Berlin) Caron and Fox are to be congratulated on a valuable contribution and thought-provoking paper on sparse random graphs which examines the exchangeability for a continuous space representation of networks. In particular, they propose a new framework to study random-graph models based on the notion of modelling the graph as an exchangeable random measure, and appealing to a Kallenberg representation for such exchangeable random measures in connection with planar point processes. For a regularly shaped square lattice, the authors define a point process to be exchangeable if the counted number of edges per grid site is exchangeable for any arbitrary square lattice. This is a very timely topic bridging random-graph models to the field of spatial point processes. Our discussion focuses on the linkage of random-graph structures to the spatial domain.
For spatial point processes, we consider a realization of a random sequence of locations on a complete separable metric space equipped with a suitable Borel σ-algebra such that the cardinality of locations that fell in a given area contained in the observation window can be expressed by means of a random measure. This counting measure can refer to a purely spatial point process or, more sophisticated, to a marked spatial point process, e.g. a multivariate spatial point pattern. For such data, one is interested in the structural exploration and the detection or extraction of the characteristics and features within and between distinct sets of events. Although a large body of literature on the analysis of spatial point patterns exists, applications of random-graph models still remain very limited. One example of such random graphs for spatial point patterns are neighbouring networks which include random structures in the calculation of planar point pattern characteristics (see, for example, Marchette (2004)). See also some recent approaches in Eckardt and Mateu (2017a, b, c).
We note that the concept of exchangeability is quite close to quadrat counting and the definition of complete spatial randomness in point processes. The authors comment on counting the number of edges per grid site, whether they are directed or not (see Section 3.2). We have several versions of counting or sampling in point processes that do have a clear counterpart in the corresponding edge counting process in graphs. Network data are becoming increasingly available and connections between graph models and planar point processes are a welcome line of research, together with extensions to spatiotemporal planar processes.
Matthew Moores and David Firth (University of Warwick, Coventry, and Alan Turing Institute, London) Caron and Fox have achieved a major breakthrough in computationally tractable inference for random graphs with hundreds of thousands of nodes. This will enable simulation and Bayesian analysis of data sets that were previously infeasible. The scalability of their approach is due to the representation of the discrete graph structure using a latent continuous model, the generalized gamma process. The sociability parameters w i of each node can be estimated by using Hamiltonian Monte Carlo sampling, since the gradient of the conditional log-posterior is available in closed form. Further improvements in scalability might be achieved by taking advantage of parallelism, as well as by rewriting portions of the code in a compiled language.
The pair potentials Pr.z i, j = 1|w i , w j / = Pr.n i, j + n j, i > 0|w i , w j / assume independence between in-degree and out-degree, even in the case of directed multigraphs. This implies a very different generative process from models of adversarial networks, where nodes compete with each other for edges. Examples include the Bradley-Terry model of citation networks (Varin et al., 2016) or the Plackett-Luce model of ranking data (Gormley and Murphy, 2006). A non-parametric Plackett-Luce model has recently been proposed by . The contour plot in Fig. 17 compares the distribution of w i for the World Wide Web data set  with the PageRank of each node (Brin and Page, 1998). PageRank was computed by using the igraph package (Csárdi and Nepusz, 2006;Kolaczyk and Csárdi, 2014) for R (R Core Team, 2017). The difference in emphasis between these two approaches might explain why the tail of the degree distribution was underestimated for this data set, due to violation of the model assumptions.
Nevertheless, the authors have demonstrated close agreement between the posterior predictive and the empirical degree distributions of a variety of real world graphs, both dense and sparse. A curious aspect of the generalized gamma process is the parameter w Å = W Å α − Σ N α i=1 w i , which represents the left-over sociability once all of the nodes with degree 1 or greater have been accounted for. This parameter is only partially identified, since it conflates degree 0 nodes with the potentially infinite number of nodes that might enter the network at a future time. Nodes with degree 0 do not contribute to the likelihood in any way, but it seems feasible to sample sociability parameters for these nodes from the prior, as with finite network models (Caron and Doucet, 2012). The parameter w Å could be viewed as an upper bound on the sociability of these nodes, particularly in the case where the network is assumed to be a closed system.

Yang Ni and Peter Müller (University of Texas at Austin)
We congratulate Caron and Fox on a very interesting paper. Our discussion highlights a particular use of the proposed models that we felt was missing in the paper. Implicit in the paper is an assumption that (part of) the random graph D (or a derived undirected graph Z) is observable. Although this is common for social network data, it is less common in biomedical inference where the goal is often to infer an unknown latent network structure.
The typical inference is set up under a hierarchical model where p.β|D/ maps the graph to the parameters β (Caron and Fox have already used up all other Greek letters) of the top level sampling model for the observed data y. This could be, for example, a Gaussian graphical model for protein activation y i . And we discuss another example below. We suggest the use of the proposed novel models for p.D|φ/. Good prior regularization is more important in this context than in applications where the network is observed. We illustrate our suggestion with a small simulation study and an application. Both are based on directed cyclical graphs (Ni et al., 2016), a special case of reciprocal graphical models (Koster, 1996). The directed cyclic graph allows inference on a directed graph G, possibly including cycles, by setting up a simultaneous equation model and interpreting a directed edge .l, i/ in the graph G as an indicator for a non-zero coefficient of y l in the equation for y i . In this context we explore the use of a generalized gamma process (GGP) prior p.D|φ/, including a mapping of a multigraph D to a directed graph G by mapping n ij → I.n ij > 0/. Table 4 reports summaries for a simulation study with four alternative priors p.D|φ/. The GGP wins. Fig. 18 shows the estimated graphs in inference for a gene network reported in Ni et al. (2016) under the sparsity prior used there (thresholded prior) versus the new GGP prior. Also under the realistic conditions of this data analysis the choice of prior matters. Importantly, implementation of posterior inference was quite straightforward, as described in the paper.

Peter Orbanz (Columbia University, New York)
Caron and Fox have set out to address misspecification problems of graphon models, described in . Are they solved? Exchangeable graphs are dense and 'amorphous' collections of edges. Caron and Fox's non-compact modification makes them sparse, but still amorphous, and I would hence  Fig. 18. Estimated gene networks for TCGA colon cancer data under (a) the GGP prior and (b) the thresholded prior (Ni et al., 2016): the posterior expected false discovery rate is controlled to be less than 10% for both estimations argue it addresses one symptom of a deeper problem. Nonetheless, a complete solution would be much to ask for, and I believe that Caron and Fox have made a very valuable contribution. The vague notion that these graphs lack structure can be made more precise (Orbanz, 2017). Start with a fixed, finite graph g k of size k, invent a randomized algorithm that generates a random subgraph S n .k/ of size n and take the distributional limit S n in input size k → ∞. If the algorithm selects vertices independently and reports the induced subgraph, S n is exchangeable, and the resulting random graphs are those represented by graphons. More generally, consider a randomized subsampling algorithm, and ask what set of transformations leaves the output distribution invariant. This relates the distributional invariance to experimental design for networks (e.g. Kolaczyk (2009)). Invariance, in turn, explains inference from a single realization: if only one network is observed, and modelled as a random graph, it constitutes a sample of size 1. Inference is hence generally ill posed, but invariance can explain averaging within a single graph. One can obtain laws of large numbers (Orbanz, 2017). Roughly: if the subsampling algorithm generates output invariant under a nice transformation group G, there is an enumeration φ 1 , φ 2 , : : : of the elements of G such that almost surely for any L 1 -function f .
From this perspective, misspecification of exchangeable graphs is a consequence of selecting vertices independently of the edge set. To obtain more structure, sampling can 'follow edges' (e.g. extract neighbourhoods), but finding concisely represented models then becomes much harder. The subsampling implicit in Caron and Fox's model is site percolation (Veitch and Roy, 2016): each vertex is selected or deselected independently with fixed probability, which can yield sparsity, but still selects vertices ignoring edges.
If it remains to be seen how widely applicable the proposed models are in actual network analysis problems, I see the merit of this work elsewhere: Veitch and Roy (2016) is a part of a beautiful body of work Janson, 2016) expanding Caron and Fox's non-compact modification into a fully fledged generalization of graphon theory. Caron and Fox have introduced an idea that has enriched the landscape of network theory, and I wholeheartedly congratulate them.

Konstantina Palla and Xenia Miscouridou (University of Oxford)
We congratulate Caron and Fox for this excellent work on sparse graphs. It is of great interest to the whole community in statistical machine learning (and beyond)!
The authors have proposed a novel statistical model for networks that builds on exchangeable random measures. Their construction accounts for sparsity; an extremely important property in real world scenarios. In a graph context, a graph is defined as dense if the number of edges grows quadratically with the number of nodes and sparse if it scales subquadratically. Then, in the sparse regime, two nodes chosen at random are unlikely to be connected. Whereas many real world networks are believed to be sparse ) most of the popular Bayesian models used in network analysis account for dense graphs, i.e. the number of edges grows quadratically in the number of nodes. The fundamental reason for this misspecification is the classical representation of the graph as a random exchangeable array, i.e. the adjacency matrix . Exchangeability in the graphs domain has been historically defined as distribution invariance to the permutation of the order that the nodes appear, i.e. relabelling the nodes does not change the distribution of the graph, and is known as vertex exchangeability. However, as a corollary of the Aldous-Hoover theorem , exchangeable random arrays are dense or empty and thus not appropriate for most real applications. In an attempt to account for sparse graphs, several models have been proposed but with undesirable properties. These are models that give up either exchangeability  or projectivity Bollobás and O'Riordan, 2009;Borgs et al., 2014). In this excellent work, the authors propose a model that represents graphs as infinite point processes on R 2 + giving rise to a class of random graphs ranging from dense to sparse, as tuned by a single parameter. For the associated notion of exchangeability of point processes, , chapter 9, provides a representation theorem as the continuous space counterpart of the Aldous-Hoover theorem, enabling a strong theoretical foundation.
The work, proposed by François Caron and Emily Fox, has stimulated exciting research: the framework is being used as the building block for a series of interesting extensions. One example is incorporating structure in the network in terms of community affiliations of the nodes  or accounting for node attributes. Also, the dynamic version of the model is of interest, where the node parameters might evolve over time (Palla et al., 2016).

Daniel M. Roy and Victor Veitch (University of Toronto)
Caron and Fox have made several fundamental contributions, leading to significant progress in our understanding of statistical network modelling. In particular, their work has inspired the development of a complete generalization of the dense exchangeable (graphon) framework to a framework that can also capture sparse graphs Veitch and Roy, 2016;Janson, 2017).
The new framework makes use of Caron and Fox's identification of random graphs and random measures, and the associated notion of exchangeability. An application of Kallenberg's representation theorem for exchangeable measures on the plane implies that the distribution of every extremal sparse exchangeable graph admits a simple representation in terms of a structure we call a graphex, by analogy with the graphon underlying dense exchangeable graphs.
The essential character of the dense exchangeable model is preserved: edges are formed independently of one another, conditionally on latent variables associated with each vertex. Whereas graphons are associated with probability spaces (and independent and identically distributed sequences of latent representations), graphexes are associated with σ-finite measure spaces (and Poisson point processes of latent representations). Dense (graphon) models can be seen to correspond to graphexes where the measure space is finite.
Another key difference is the appropriate notion of size: nested dense graph sequences can be indexed by the integer number of vertices, whereas the number of vertices in nested sparse graph sequences grows by random increments.
The analysis by Caron and Fox and in follow-up work by other researchers shows that the sparse exchangeable model can capture a wide range of the empirical phenomena observed in real world networks but outside the purview of the dense exchangeable model. As dense graphon models are in routine use in graph limit theory, statistical network analysis and machine learning, sparse exchangeable models yield novel extensions and generalizations of practical interest. Further, the completely random-measure approach introduced by Caron and Fox gives a practical method for defining sparse exchangeable models, which has already been used to extend important model classes .

Victor Veitch and Daniel M. Roy (University of Toronto)
Caron and Fox identify random graphs with random measures and use this construction to define sparse exchangeability. This notion of exchangeability is central both to their model and to the generalizations it has inspired . A different perspective on the random-measure construction and the associated notion of sparse exchangeability helps to clarify the strengths and limitations of the new modelling approach, and its relationship to exchangeable dense graphs.
Probabilistic symmetries are closely connected to sampling design (Orbanz, 2017). Sparse exchangeability can be understood in terms of p-sampling, a subgraph sampling scheme whereby one produces a subgraph from a larger graph by including each vertex independently with probability p > 0, and dropping any isolated vertices in the induced subgraph (Veitch and Roy, 2016). For every β > α > 0, if ξ β is a size β graph generated by a graphex W, then an α=β-sampling of ξ β is equal in distribution to a size α graph ξ α generated by W. By contrast, for dense exchangeable graphs, if G n is an n-vertex random graph generated by a graphon W, then a random subgraph induced by sampling k vertices at random from G n is equal in distribution to a k-vertex graph generated by W.
The sampling perspective highlights a key distinction between the sparse and dense exchangeable models: the former excludes isolated vertices. This is often not a problem in practice, so the flexibility of the Caron-Fox model and its descendants seems to come at little cost.
Exchangeability has considerable utility: p-sampling yields a graph analogue of the empirical measure, which has been shown to be a consistent non-parametric estimator for the graphex parameter underlying the sparse exchangeable graphs. This estimator is a powerful tool, and its construction relies critically on sparse exchangeability.
The sampling perspective also highlights a weakness of the sparse exchangeable approach. Independent vertex sampling schemes, including p-sampling, are rarely good models for the data collection process underlying real world network data. The sparse exchangeable models thus retain a key pathology of the dense exchangeable models and, accordingly, seem unlikely to resolve fully the practical problems that motivated Caron and Fox. However, the dense exchangeable models have proven to be very useful in practice, so there is reason to believe that the applicability of the sparse exchangeable model goes beyond its literal remit. The sparse exchangeable model constitutes a significant advancement, and we expect that its development will point the way towards further novel attacks on the core problems.

Priyantha Wijayatunga (Umea University)
Sometimes we need to construct network or graph models for systems under study, e.g. for network representations of stock markets (Boginski et al., 2005) and for functional regionwise connectivity of the human brain (Simpson et al., 2013). Then edges of the graph among vertices that are subcomponents of the system, such as returns of stocks and regions of the brain, are inferred by using statistical dependence measures such as Pearson's correlation that are applicable for linear associations, mutual information that can be used for some non-linear associations and so on. Each edge can be attached with a numerical weight that is determined by the strength of association between two vertices. For subsequent inferential tasks the use of accurate dependence measures can be important. Therefore, some researchers propose to use regression functions in place of dependence measures, mainly because of difficulties in measuring nonlinear relationships accurately (Song et al., 2012). For measuring any dependence between discrete random variables, say X and Y , the so-called Cramer's V or Tchuprow's T , or the generalized correlation coefficient proposed in Wijayatunga (2016) can be used, where Wijayatunga's coefficient is defined (following Pearson's correlation coefficient) as a metric-M-based measure of dependence where p is the joint probability distribution of X and Y and p I is that when their independence is assumed, P X denotes the set of all joint probability distributions, each representing a maximal dependence while preserving the marginal distribution of X (and similarly for P Y ), |A| is the cardinality of set A and M.p, q/ is a distance metric between two probability distributions p and q, e.g. Hellinger distance M.p I , p/ = . 1 2 Σ x, y [ √ p.x, y/ − √ {p.x/p.y/}] 2 / 1=2 or similar. Note that this is a normalized metric-M-based distance between the dependence and the independence, represented by p.x, y/ and p.x/p.y/ respectively. If there are many maximal dependences (when the respective marginal is fixed), i.e. cardinalities of P X or P Y are big, then one can eliminate some of them subjectively. Ideally these cardinalities are 1. This can be generalized for continuous variables; however, with Hellinger distance the required normalizing constant is 1. Therefore one can use the measure ρ H .X, Y/ = . similar to that proposed in Granger et al. (2004), where f represents the respective probability density functions. In these cases, it is tried to measure strengths of dependences accurately; therefore it is interesting to see differences between network models constructed from such measures and those obtained from traditional linear measures of dependences.

Mingyuan Zhou (University of Texas at Austin)
I congratulate Professor Caron and Dr Fox for a well-written paper that establishes a novel statistical network modelling framework, which uses completely random measures to model graphs with various levels of sparsity. Although it has been made clear in the paper that the characteristics of the underlying completely random measure, e.g. the discount parameter σ < 1 of the generalized gamma process, play a crucial role in determining the levels of sparsity of the generated graphs, I should like to call attention to the potentially important role played by the link function f.x/ = 1 − exp.−x/, in generating unweighted undirected sparse graphs with z ij |w i , w j ∼ Bernoulli{f.2w i w j /} for i = j. With this link function, the contribution of z ij to the negative log-likelihood of the model can be expressed as −z ij ln{1 − exp.−2w i w j /} + 2.1 − z ij /w i w j , which quickly explodes towards ∞ as the product w i w j approaches 0 when z ij = 1. Thus the choice of this link function implies an inductive bias towards fitting non-zero edges z ij = 1, while not strongly penalizing zero edges z ij = 0 even if their corresponding products w i w j are large. The same link function, which is referred to as the Bernoulli-Poisson link, has also been used in Zhou (2015), which constructs nonparametric Bayesian network models for overlapping community detection and missing link prediction, allowing the computation to scale linearly with the number of edges, rather than quadratically with the number of nodes. It would be of interest to articulate the role of this specific link function in supporting sparse graphs under the framework proposed. In addition to controlling for sparse graphs how the number of edges increases as a function of the number of nodes, another topic worth further investigation is how to introduce structured sparsity patterns to the graph adjacency matrices, including modelling dense on-diagonal but sparse off-diagonal blocks, sparse on-diagonal but dense off-diagonal blocks or a mixture of both. It would be interesting to find out whether the new notion of exchangeability could be maintained while achieving these network modelling goals.
Besides the above work on more structured models for simple graphs, a few discussants mentioned other possible extensions. Durante, and Lijoi, Mena and Prünster suggest models for collections of networks, building on dependent completely random measures. James and Heaukulani suggest an extension of the model to deal with hypergraphs, building on the Indian buffet process. Palla and Miscouridou consider extensions to dynamic networks. Finally, Bharath and Rubin-Delanchy suggest the use of inhomogeneous completely random measures to deal with covariates. These ideas represent interesting directions to explore and demonstrate how our framework provides a foundation on which it is possible to expand in many ways.

Sampling and projectivity
Many discussants commented on the projectivity properties and sampling mechanism that is associated with our network process. In our setting, each node i is associated an arrival time θ i , and the model is invariant over any measure preserving transformation of time. Our construction is therefore exchangeable in this sense. For some α > 0, we consider only interactions between nodes that arrived before time α; therefore, α tunes the sample size of the graph.
The notion of sampling that is naturally associated with this class of graph processes has been defined in the follow-up work of Veitch and Roy (2016). As recalled in the comments of Veitch and Roy, and Orbanz: for a given graph, sample a subgraph by choosing each vertex with probability p ∈ .0, 1/; then discard the isolated nodes. If we consider an exchangeable graph of size α, then p-sampling gives a subgraph with the same distribution, but of size pα. As mentioned by Bharath, Veitch and Roy, Orbanz and Crane, this sampling scheme highlights a structural weakness of the exchangeable graph process approach: independently sampling vertices may appear unrealistic as a network sampling strategy in some situations.
Nonetheless, as recalled by Veitch and Roy, the (dense) graphon models share the same limitations but have proven to be extremely useful in a large number of applications. An important advantage of the approach, which we emphasize in this paper, is that the exchangeability property enables the design of scalable inference algorithms. We believe that this is a key aspect for the method to be widely applied. capture community structure exist, they are not straightforward to define and do not fit in a unified framework. Therefore, as mentioned by Crane, this has limited their use as statistical models. (b) In the (non-exchangeable) BA model, one needs to know the ordering of the nodes to make inference on the parameters of the model. If this ordering is not known, it needs to be imputed as a latent variable, severely limiting the scalability of the resulting algorithm. Because of the exchangeability property of our approach, we avoid this issue. The labelling of the nodes is irrelevant in our approach, and we can additionally make use of the hierarchical structure to design scalable algorithms, as demonstrated in Section 7.
Another related line of work that was mentioned by several discussants (Bharath, Crane, Campbell and Broderick, and Bloem-Reddy) are edge exchangeable models Dempsey, 2015, 2016;Cai et al., 2016;Williamson, 2016;Janson, 2017a). This line of work-which first appeared in arXiv (Crane and Dempsey, 2015) nearly 2 years after our initial arXiv posting-considers a different notion of exchangeability for 'interaction data' defining multi(hyper)graphs (Crane and Dempsey, 2016); as with our method, this approach can also produce sparse graphs.
We believe that both approaches have merit, and that one does not subsume nor replace the other. Both serve as important building blocks for statistical network models. As has been noted by several researchers (Cai et al., 2016;Janson, 2017a;Williamson, 2016) and in the discussion of Bloem-Reddy, there are some similarities between the two approaches, especially as relate to our construction of directed multigraphs in Section 3.2. Although we present a construction for multigraphs, the focus of our paper is on simple graphs. However, the general framework still provides a notion of exchangeability that applies to both multigraphs and simple graphs. Interestingly, this is not so for edge exchangeable models. Simple graphs are obtained from multigraphs by discarding multiple edges (Crane and Dempsey (2016), section 5.4). Through this process, the edge exchangeability property is lost.
Alternative ideas for constructing non-projective sparse multigraphs are presented by Kumar. Kendall, and Mateu and Eckardt discuss other related approaches based on stochastic geometry. Mateu and Eckardt mention the similarities with geometric or neighbour graphs, which also use (spatial) point processes for random graphs. Finally, Kendall discusses the class of scale invariant random spatial networks, which has a different notion of (spatial) invariance and uses (improper) Poisson line processes. Though these constructions are quite different from our own, they offer another illustration of the usefulness of stochastic processes for building network models.

Further properties of the class of random graphs
In this paper, we focused on the sparsity of the resulting networks defined by equations (6)-(8) when ρ has finite mean. The sparsity properties are not tied to the specific choice of the link function (see the comment by Zhou) and follow-up work studied the sparsity properties of general Kallenberg graphs.  studied the expected number of edges and nodes for the general model, showing that sparsity is achieved whenever the function M is integrable with unbounded support; Caron and Rousseau (2017) provide strong laws for the asymptotic number of edges and nodes and degree distributions when M is integrable and satisfies some regular variation assumptions. Bianconi asks whether it is possible to achieve the extremely sparse regime in this framework, N .e/ α = O.N α / almost surely. Caron and Rousseau (2017) showed that it is possible to obtain rates of N .e/ α = Θ{N α l Å .N α /} where l Å is a slowly varying function with l Å .t/ → ∞. If ρ has finite mean, or more generally if M is integrable, E[N α ] = o.α 2 /; hence one cannot achieve the extremely sparse regime. It would be of interest to characterize other properties of the special class of graphs defined by equations (6)-(8), and more generally of graphs based on Kallenberg's representation. Casarin, Iacopini and Rossini investigate empirically the distribution of the assortativity coefficient and clustering coefficient (also discussed by Durante and Chakraborty) for the generalized gamma process model. Banerjee and Ghosal suggest investigating the characterization of the graph Laplacian and the limit distribution of the normalized number of edges. These represent some important possible directions of exploration.

Consistency and rates of convergence
Questions regarding the convergence of posterior distributions of the parameters, hyperparameters or functionals have been highlighted by a few discussants. Arbel suggests an approach to show posterior consistency for the sparsity parameter σ. Such results, in a well-specified and misspecified setting, would be useful, as this would justify the use of the method that is described in Section 8.1 as a sparsity test. Castillo and Rebafka also study empirically the asymptotic behaviour of the posterior distribution on σ in a misspecified scenario, when simulating from a sparse, but non-projective, model. They also investigated empirically the convergence of the posterior of the edge density and density of triangles.
Gao (see also Rubin-Delanchy and Chakraborty) draws connections with the rich literature on frequentist estimation of the parameters of network models, such as the stochastic block model. It is not obvious to us, however, how these tools could be applied in the sparse regime. In this case, the number of isolated nodes with sociability w i > 0 is infinite for any given α, which is different from the framework that was described by Gao, corresponding closely to the dense case. Let .w i / be the sociabilities of the nodes such that θ i < α and N α the number of nodes with at least one edge. Our likelihood, for the directed multigraph, is given by .w i w j / n ij n ij ! exp.−w i w j /: In contrast, the likelihood defined by Gao differs in the first term and corresponds to a model with a finite number of nodes. This also clarifies a question from Moores and Firth: the weights of isolated nodes with arrival time θ i < α appear in the likelihood, and their sum is identifiable (but not the individual weights).
Although we provided some important analyses of the properties of our modelling framework, there are many other interesting aspects to study relating to the asymptotic properties of inference and estimation.

Computational aspects
This paper presented an MCMC algorithm with Hamiltonian Monte Carlo (HMC) and Metropolis-Hastings updates. The key step underlying our scheme is a data augmentation that allows a very simple expression for the gradient in the HMC update and tailored proposals for the hyperparameters. There is, however, plenty of room for improvement in the efficiency of our MCMC algorithm. Our HMC implementation is very simple, with a fixed number of leapfrog steps, and the step size adapted during part of the burn-in period. More elaborate HMC strategies could be used here. As mentioned by Moores and Firth and briefly discussed in Section 7.2 of our paper, the algorithm is very amenable to parallelism. Alternatives to HMC sampling have been implemented by Bouchard-Côté and Briercliffe, showing that additional gains could be obtained through such modifications as well. Alternatively, Campbell and Broderick suggest using variational Bayesian instead of MCMC algorithms, applying a truncation scheme for the completely random measure. This also represents a promising alternative to our MCMC algorithm.