MODELING SOCIAL NETWORKS FROM SAMPLED DATA 1

Network models are widely used to represent relational information among interacting units and the structural implications of these relations. Recently, social network studies have focused a great deal of attention on random graph models of networks whose nodes represent individual social actors and whose edges represent a specified relationship between the actors.

Most inference for social network models assumes that the presence or absence of all possible links is observed, that the information is completely reliable, and that there are no measurement (e.g., recording) errors. This is clearly not true in practice, as much network data is collected though sample surveys. In addition even if a census of a population is attempted, individuals and links between individuals are missed (i.e., do not appear in the recorded data).

In this paper we develop the conceptual and computational theory for inference based on sampled network information. We first review forms of network sampling designs used in practice. We consider inference from the likelihood framework, and develop a typology of network data that reflects their treatment within this frame. We then develop inference for social network models based on information from adaptive network designs.

We motivate and illustrate these ideas by analyzing the effect of link-tracing sampling designs on a collaboration network.

Key words and phrases: Exponential family random graph model, p* model, Markov chain Monte Carlo, design-based inference

1. Introduction

Networks are a useful device to represent “relational data,” that is, data with properties beyond the attributes of the individuals (nodes) involved. Relational data arise in many fields and network models are a natural approach to representing the patterns of the relations between nodes. Networks can be used to describe such diverse ideas as the behaviour of epidemics, the interconnectedness of corporate boards, and networks of genetic regulatory interactions. In social network applications, the nodes in a graph typically represent individuals, and the ties (edges) represent a specified relationship between individuals. Nodes can also be used to represent larger social units (groups, families, organizations), objects (airports, servers, locations) or abstract entities (concepts, texts, tasks, random variables). We consider here stochastic models for such graphs. These models attempt to represent the stochastic mechanisms that produce relational ties, and the complex dependencies thus induced.

Social network data typically consist of a set of n actors and a relational tie random variable, Y_ij, measured on each possible ordered pair of actors, (i, j), i, j = 1, …, n, i ≠ j. In the most simple cases, Y_ij is a dichotomous variable, indicating the presence or absence of some relation of interest, such as friendship, collaboration, transmission of information or disease, etc. The data are often represented by an n × n sociomatrix Y, with diagonal elements, representing self-ties, treated as structural zeros. In the case of binary relations, the data can also be thought of as a graph in which the nodes are actors and the edge set is <(i, j) :Y_ij = 1>. For many networks the relations are undirected in the sense that Y_ij = Y_ji, i, j = 1, …, n.

In the application in this paper we consider a network formed from the collaborative working relations between n = 36 partners in a New England law firm [Lazega (2001)]. We focus on the undirected relation where a tie is said to exist between two partners if and only if both indicate that they collaborate with the other. The scientific objective is to explain the observed structural pattern of collaborative ties as a function of nodal and relational attributes. The relational data is supplemented by four actor attributes: seniority (the rank number of chronological entry into the firm divided by 36), practice (there are two possible values, litigation = 0 and corporate law = 1), gender (3 of the 36 lawyers are female) and office (there are three different offices in three different cities each of different size).

For large or hard-to-find populations of actors it is difficult to obtain information on all actors and all relational ties. As a result, various survey sampling strategies and methods are applied. Some of these methods make use of network information revealed by earlier stages of sampling to guide later sampling. These adaptive designs allow for more efficient sampling than conventional sampling designs. We consider such designs in Section 2.

Most of the work presented here considers the network over the set of actors to be the realization of a stochastic process. We seek to model that process. An alternative is to view the network as a fixed structure about which we wish to make inference based on partial observation.

In this paper we develop a theoretical framework for inference from network data that are partially-observed due to sampling. This work extends the fundamental work of Thompson and Frank (2000). For purposes of presentation, we focus on the relational data itself and suppress reference to covariates of the nodes. This more general situation is dealt with in Handcock and Gile (2007).

In Section 2 we present a conceptual framework for network sampling. We extend this framework in Section 3 to focus on inference from sampled network data. We first consider the limitations of design-based inference in this setting, then focus on likelihood-based inference. Section 4 presents the rich Exponential Family Random Graph Model (ERGM) family of models that has been applied to complete network data. Section 5 presents a study of the effect of sampling from a known complete network of law firm collaborations. Finally, in Section 6, we discuss the overall ramifications for the modeling of social networks with sampled data and note some extensions.

2. Network sampling design

In this section we consider the conceptual and computational theory of network sampling.

There is a substantial literature on network sampling designs. Our development here follows Thompson and Seber (1996) and Thompson and Frank (2000). Let 𝒴 denote the set of possible networks on the n actors. Note that in most network samples, the unit of sampling is the actor or node, while the unit of analysis is typically the dyad. Let D be the n × n random binary matrix indicating if the corresponding element of Y was sampled or not. The value of the i, jth element is 0 if the (i, j) ordered pair was not sampled and 1 if the element was sampled. Denote the sample space of D by 𝒟. We shall refer to the probability distribution of D as the sampling design. The sampling design is often related to the structure of the graph and a parameter ψ ∈ Ψ, so we posit a model for it. Specifically, let P(D = d|Y = y;ψ) denote the probability of selecting sample d given a network y and parameter ψ.

Under many sampling designs the set of sampled dyads is determined by the set of sampled nodes. Let S represent a binary random n-vector indicating a subset of the nodes, where the ith element is 1 if the ith node is part of the set, and is 0 otherwise. We often consider situations where D is determined by some S which is itself a result of a sample design denoted by P(S|Y, ψ). For example, consider an undirected network where the set of observed dyads are those that are incident on at least one of the sampled nodes. In this case D = S ◦ 1 + 1 ◦ S − S ◦ S, where 1 is the binary n-vector of 1s. A primary example of this is where people are sampled and surveyed to determine all their edges.

We introduce further notation to allow us to refer to the observed and unobserved portions of the relational structures. Denote the observed part of the complete graph Y by Y_obs = Y_ij : D_ij = 1> and the unobserved part by Y_mis = Y_ij : D_ij = 0>. The full observed data is then Y_obs, D>, in contrast to the complete data: Y_obs, Y_mis, D>. We will write the complete graph Y = Y_obs, Y_mis>. In addition, we make the convention that undefined numbers act as identity elements in addition and multiplication. So a number x plus or multiplied by an undefined number y is x, and hence Y = Y_obs + Y_mis. For a given network y ∈ 𝒴, denote the corresponding data as y_obs, d> and the other elements by their lower-case versions y = y_obs + y_mis. Finally denote 𝒴(y_obs) = y_obs + υ ∈ 𝒴>, that is the set of possible unobserved elements which together with y_obs result in valid network. The set y_obs + 𝒴(y_obs) is then the restriction of 𝒴 to y_obs.

A sampling design is conventional if it does not use information collected during the survey to direct subsequent sampling of individuals (e.g., network census and ego-centric designs). Specifically, a design is conventional if P(D = d|Y = y;ψ) = P(D = d|ψ) ∀y ∈𝒴. A simple example of a conventional sampling design for networks is an ego-centric design, consisting of a simple random sampling of a subset of the actors, followed by complete observation of the dyads originating from those actors. A complete census of the network is another. More complex examples include designs using probability sampling of pairs and auxiliary variables. Alternatively, we call a sampling design adaptive if it uses information collected during the survey to direct subsequent sampling, but the sampling design depends only on the observed data. Specifically, a design is adaptive if: P(D = d|Y = y;ψ) = P(D = d|Y_obs = y_obs, ψ) ∀y ∈ y_obs + 𝒴(y_obs). Hence a design can be adaptive for a given y_obs (rather than all possible observed data), although most common such designs are adaptive for all possible data observed under them. Conventional designs can be considered to be special cases of adaptive designs.

Note that adaptive sampling designs satisfy

P(D = d|Y_obs, Y_mis, ψ) = P(D = d|Y_obs, ψ),

a condition called “missing at random” by Rubin (1976) in the context of missing data. Note that this is a bit misleading—it does not say that the propensity to be observed is unrelated to the unobserved portions of the network, but that this relationship can be explained by the data that are observed. The observed part of the data are often vital to equality (2.1). Hence adaptive designs are essentially those for which the unobserved dyads are missing at random.

Denote by [a] the vector-valued function that is 1 if the corresponding element of the vector a is logically true, and 0 otherwise. Let a × b be the elementwise product of the column vector a and the column vector b and a · b be the scalar product ∑_j a_jb_j. Let a ◦ b be the outer product matrix with ijth element a_ib_j. If y is a matrix and b a vector let y · b be the column vector with ith element ∑_j y_jib_j.

2.1. Some adaptive designs for undirected networks

We now consider several examples of adaptive designs for undirected networks.

2.1.1. Example: Ego-centric design

Consider a simple ego-centric design:

Select individuals at random, each with probability ψ.

Observe all dyads involving the selected individuals (i.e., dyads with at least one of the selected individuals as one of the pair of actors).

The sampling design can be determined for this case. First note that

P(D_ij = 1|Y, ψ) = 1 − (1−ψ) 2 ∀ i ≠ j.

This, however, does not give the joint distribution of D. Let S be the binary n-vector where 1 and 0 indicate that the corresponding individual has been selected, or not, respectively. Within this design, S is determined by D (i.e., S = [D1 = (n − 1)1]). Then P(S = s|Y, ψ) = ψ 1·s (1 − ψ) n−1·s , s ∈ n . If the i th element of S is 1 then all elements in the i th row and column of D are 1. D_ij = 0 if and only if both the i th and j th elements of S are both 0. Hence the probability distribution of D is

P(D = d|Y, ψ) = ψ 1·s (1−ψ) n−1·s d = 1◦s + s◦1 − s◦s, s ∈ n .

Note that the distribution does not depend on Y, and is therefore conventional.

2.1.2. Example: One-wave link-tracing design

We refer to any sample in which subsequent nodes are enrolled based on their observed relations with other sampled nodes as a link-tracing design. Consider the one-wave link-tracing design specified as follows:

Select individuals at random, each with probability ψ. Observe all dyads involving the selected individuals.

Identify all individuals reported to have at least one relation with the initial sample, and select them with probability 1.

Observe all dyads involving the newly selected individuals.

Let S₀ denote the indicator vector for the initial sample and S₁ the indicator for the added individuals not in the initial sample. Then the whole sample of individuals is S = S₀ + S₁. As in the undirected ego-centric design, D = 1 ◦ S + S ◦ 1 − S ◦ S. Note that S₁ = [Y S₀ × (1 −S₀) > 0] is derivable from S₀ and Y. Hence

P ( D = d | Y , ψ ) = ∑ s 0 : s 0 + [ Y s 0 × ( 1 − s 0 ) > 0 ] = s ψ 1 · s 0 ( 1 − ψ ) n − 1 · s 0 d = 1◦s + s◦1 − s◦s, s ∈ n .

2.1.3. Example: Multi-wave link-tracing design

Consider a multi-wave link-tracing design in which the complete set of partners of the kth wave are enrolled, that is, the link-tracing process described above is carried out k times. If k is fixed in advance this is called k-wave link-tracing.

Let S₀ denote the indicator for the initial sample, S₁ the indicator for the added individuals in the first wave not in the initial sample, …, S_k the indicator for the added individuals in wave k not in the prior samples. Then the whole sample of individuals is S = S₀ + S₁ + ⋯ +S_k. As in the ego-centric design D = 1 ◦ S + S ◦ 1 − S ◦ S. Note that S m = [ Y S m − 1 × ( 1 − ∑ t = 0 m − 1 S t ) > 0 ] , m = 1, …, k is derivable from S₀ and Y. Then

P ( D = d | Y , ψ ) = ∑ s 0 : s 0 + s 1 + ⋯ + s k = s ψ 1 · s 0 ( 1 − ψ ) n − 1 · s 0

for d = 1 ◦ s + s ◦ 1 − s ◦ s, s ∈ n . Here S m = [ Y S m − 1 × ( 1 − ∑ t = 0 m − 1 S t ) > 0 ] = [ Y obs S m − 1 × ( 1 − ∑ t = 0 m − 1 S t ) > 0 ] , m = 1, …, k so that the individuals selected in the successive waves only depend on the observed part of the graph, and not on the unobserved portions of the graph. Clearly, this is also true for one-wave link-tracing as a simple case of k-wave link-tracing. Note that it may be possible that S_m = ∅ for some m < k, so that subsequent waves do not increase the sample size (i.e., S_k = ∅). A variant of the k-wave link-tracing design is the saturated link-tracing design, in which sampling continues until wave m, such that S_m = ∅. We interpret k as the bound on the number of waves sampled imposed by the sampling design. Since saturated link-tracing does not restrict the number of waves sampled, we represent it by setting k =∞.

2.2. Some adaptive designs for directed networks

We can also consider variants of these adaptive designs for directed networks.

2.2.1. Example: Ego-centric design

Consider a simple ego-centric design:

Select individuals at random, each with probability ψ. Observe all directed dyads originating at the selected individuals.

As before, the sampling design can be determined for this case. Since a directed dyad is observed only if its tail node is sampled,

P(D_ij = 1|Y, ψ) = ψ ∀ i ≠ j

and D = S₀ ◦ 1. Hence the probability distribution of D is

P(D = d|Y, ψ) = ψ 1·s (1−ψ) n−1·s

for d = s ◦ 1, s ∈ n and the distribution does not depend on Y. As in the undirected case, this design is therefore conventional.

2.2.2. Example: One-wave link-tracing design

Consider a one-wave link-tracing design on a directed network specified as follows:

Select individuals at random, each with probability ψ. Observe all directed dyads originating at the selected individuals.

Identify all individuals receiving an arc from a member of the initial sample, and select them with probability 1.

Observe all directed dyads originating at the newly selected individuals.

P ( D = d | Y , ψ ) = ∑ s 0 : s 0 + [ Y s 0 × ( 1 − s 0 ) > 0 ] = s ψ 1 · s 0 ( 1 − ψ ) n − 1 · s 0

2.2.3. Example: Multi-wave link-tracing design

Consider a directed version of the multi-wave link-tracing design in which the complete set of out-partners of the kth wave are enrolled. The whole sample of individuals is S = S₀ + S₁ + ⋯ + S_k. And S m = [ Y · S m − 1 × ( 1 − ∑ t = 0 m − 1 S t ) > 0 ] , m = 1, …, k is derivable from S₀ and Y. Then

P ( D = d | Y , ψ ) = ∑ s 0 : s 0 + s 1 + ⋯ + s k = s ψ 1 · s 0 ( 1 − ψ ) n − 1 · s 0

for d = s ◦ 1, s ∈ n , where we note that S m = [ Y · S m − 1 × ( 1 − ∑ t = 0 m − 1 S t ) > 0 ] = [ Y obs · S m − 1 × ( 1 − ∑ t = 0 m − 1 S t ) > 0 ] , m = 1, …, k so that the individuals selected in successive waves of depend only on the previously observed part of the graph, and not on the unobserved portions. The saturated link-tracing design is represented by k = ∞.

3. Inferential frameworks

In this section we consider two frameworks for inference based on sampled data. In the design-based framework y represents the fixed population and interest focuses on characterizing y based on partial observation. The random variation considered is due to the sampling design alone. A key advantage of this approach is that it does not require a model for the data themselves, although a model may also be used to guide design-based inference [Särndal, Swensson and Wretman (1992)]. Under the model-based framework, Y is stochastic and is a realization from a stochastic process depending on a parameter η. Here interest focuses on η which characterizes the mechanism that produced the complete network Y. We find severe limitations of the design-based framework for data from link-tracing samples, and focus on likelihood inference within the model-based framework.

3.1. Design-based inference for the network

In the design-based frame, the unobserved data values, or some functions thereof, are analogous to the parameters of interest in likelihood inference. The population of data values is treated as fixed, and all uncertainty in the estimates is due to the sampling design, which is typically assumed to be fully known (not just up to the parameter ψ).

Inference typically focuses on identifying design-unbiased estimators for quantities of interest measured on the complete network. In an undirected network analysis setting, for example, we can consider estimating τ =∑_ij y_ij, the number of edges in the network. Note that y is a partially-observed matrix of constants in this setting. Then τ ̂ is design-unbiased for τ if

𝔼 D [ τ ^ | ψ , y ] = τ ,

where the expectation is taken over realizations of the sampling process. Specifically,

𝔼 D [ τ ^ ( Y obs , D ) | ψ , y ] = ∑ d ∈ 𝒟 τ ^ ( y obs ( d ) , d ) P ( D = d | ψ , y ) ,

where τ ̂ (y_obs(d), d) is the estimator expressed as a function of the observed network information. Similarly, the variance of the estimator is computed with respect to the variation induced by the sampling procedure

𝕍 D [ τ ^ ( Y obs , D ) | ψ , y ] = ∑ d ∈ 𝒟 ( τ ^ ( y obs ( d ) , d ) − τ ) 2 P ( D = d | ψ , y ) .

The Horvitz–Thompson estimator is a classic tool of design-based inference, and is based on inverse-probability weighting the sample. In our example, it is

τ ^ ( Y obs , D ) = ∑ i < j : D i j = 1 y i j π i j ,

where the dyadic sampling probability π_ij = P(D_ij = 1| ψ, y) is the probability of observing dyad (i, j).

Consider an estimator of τ based on relations observed through the egocentric design of Section 2.1.1. Then

π_ij = 1 − (1−ψ) 2 ∀ i, j.

The classic Horvitz–Thompson estimator τ ̂ of τ then weights each observation by the inverse of its sampling probability

τ ^ = ∑ i < j : D i j = 1 y i j π i j = 1 1 − ( 1 − ψ ) 2 ∑ i < j : D i j = 1 y i j . 𝕍 ( τ ^ ) = ∑ i < j ∑ k < l < [ 1 − ( 1 − ψ ) 2 ] − 2 π i j , k l − 1 >y i j y k l , π i j , k l = < π i j , i = k , j = l , π i j π k l , i ∉ < k , l >and j ∉ < k , l >, ψ 3 − 3 ψ 2 , otherwise .

Among the many available estimators for the variance of the Horvitz–Thompson estimator is the Horvitz–Thompson variance estimator:

𝕍 ^ ( τ ^ ) = ∑ i < j : D i j = 1 ∑ k < l : D k l = 1 1 π i j , k l < [ 1 − ( 1 − ψ ) 2 ] − 2 π i j , k l − 1 >y i j y k l .

Note the importance of the unit sampling probabilities in these estimators. This is a hallmark of design-based inference: inference relies on full knowledge of the sampling procedure in order to make unbiased inference without making assumptions about the distribution of the unobserved data. This typically requires knowledge of the sampling probability of each unit in the sample. This procedure is complicated in the network context, in that we require the sampling probabilities of the units of analysis, dyads, which are different from the units of sampling, nodes. In fact, for even single-wave link-tracing samples, the dyadic sampling probabilities are not observable.

For the one-wave link-tracing design of Section 2.1.2, N(i, j) = k> : y_ik = 1 or y_jk = 1 or k ∈ i, j>. Then if the initial sample S₀ is drawn according to the design in Section 2.1.2, π_ij = 1 − (1 − ψ) ‖N(i,j)‖ . Suppose S_0i = 1, and S_0j = 0. Then dyad (i, j) is observed, but ‖N(i, j)‖ is unknown because it is unknown which k satisfy y_jk = 1. The link-tracing sampling structures for which nodal and dyadic sampling probabilities are observable are summarized in Table 1 . For directed networks, we assume sampled nodes provide information on their out-arcs only, so that D is not symmetric and D_ij = 1 ⇔ S_i = 1.

Table 1

Observable sampling probabilities under various sampling schemes for directed and undirected networks. Nodal and dyadic sampling probabilities are considered separately. “X” indicates observable sampling probabilities, while a blank indicates unobservable sampling probabilities

Sampling scheme	Nodal probabilities π_i		Dyadic probabilities π_ij
	Undirected	Directed	Undirected	Directed
	Ego-centric	X	X	X	X
One-wave	X
k-wave, 1 < k < ∞ saturated	X

Of the designs considered here, dyadic sampling probabilities are observable only for ego-centric samples, and never for link-tracing designs. Nodal sampling probabilities are also observable for ego-centric sampling, as well as for one-wave and saturated link-tracing designs in undirected networks. Overall, this table presents strong limitations to the applicability of design-based methods requiring the knowledge of sampling probabilities to link-tracing designs. Note that this limitation is not specific to dyad-based network statistics. Estimation of triad-based network statistics such as a triad census would be subject to similar limitations. A Horvitz–Thompson style estimator would rely on a weighted sum of observed triads, weighted according to sampling probabilities. Sampling probabilities for triads would be even more complex, as they would typically require sampling of two of the three nodes involved in an undirected case, and at least two of the three nodes in an directed case, depending on the triad census. Both of these sampling probabilities would not be possible to compute for link-tracing samples in which the degrees or in-degrees of some involved nodes are unobserved.

Not surprisingly, most of the work on design-based estimators for link-tracing samples has focused on the cases where sampling probabilities are observable: typically for one-wave or saturated samples used to estimate population means of nodal covariates. Frank (2005) presents a good overview and extensive citations to this literature. See also Thompson and Collins (2002); Snijders (1992). Although examples tend to focus on instances where sampling probabilities are observable, the limited applicability of classical design-based methods in estimating structural network features based on link-tracing samples has not been emphasized in the literature.

In the absence of observable sampling probabilities, design-based inference requires a mechanism for estimating sampling probabilities. This is most often necessary in the context of out-of-design missing data, and addressed with approaches such as propensity scoring [Rosenbaum and Rubin (1983)], which rely on auxiliary information available for the full sampling frame to estimate unknown sampling probabilities. Link-tracing differs from the traditional context of such methods in that the sampling probabilities are unobserved even when the design is executed faithfully, and in that the unknown sampling probabilities result directly from the unobserved variable of interest. In particular, estimating unknown sampling probabilities is equivalent to estimating unobserved relations based on the observed relations. One approach is to augment the sample with sufficient information to allow for determination of the sampling probabilities. However in most cases, this requires a substantial expansion of the sampling design. Therefore, in practice we must rely on a model relating the observed portions of the network structure to the unobserved portions. Lack of reliance on an assumed outcome model is a great advantage of the design-based framework over the model-based framework. By introducing a model to estimate sampling probabilities based on the outcome of interest, we reintroduce this reliance on model form, negating much of the advantage of the design-based framework. Furthermore, note that the naive use of this approach has an ad-hoc flavor, while still requiring complex observation weights and variance estimators.

In the next section, we describe an alternative more flexible likelihood approach to network inference based on link-tracing samples.

3.2. Likelihood-based inference

Consider a parametric model for the random behavior of Y depending on a parameter p-vector η: