A meritocratic network formation model for the rise of social media influencers

The majority of today’s online social networks offer the users the possibility to actively contribute to the platform’s growth by sharing different forms of UGC, according to their interests, competencies, and willingness. Looking from a different perspective, users are also exposed to the content generated by others on the platform. Thanks to the integrated search engines and the use of hashtags they can explore this content, discover users with similar interests, and ultimately decide to become followers. Given the limited time, they can spend on social media platforms, intuitively users seek to optimize the list of followees so as to receive high-quality content.

Meritocratic principle

In order to support our intuition, we collected a longitudinal Twitter data set30 on a network composed of >6000 scientists working in the area of complex social networks. Compared with other data sets, one of the advantages is that we can expect most of the complex network scientists to be active on Twitter, given that they consistently study social networks effects. Moreover, the most popular nodes can be easily associated with renowned researchers in the field. Arguably, the number of followers can be considered as a proxy for the quality of the content generated by a user. We support this hypothesis by manually inspecting and labeling the top in-degree nodes.

Then, we enumerate the agents in decreasing in-degree order, so that agent 1 (presumably providing the best UGC) is the most followed, agent 2 is the second most, and so on. For each agent i, we reconstruct the temporal sequence of the outgoing connections (({i}_{1},{i}_{2},ldots ,{i}_{{d}_{i}^{{{{{{{{rm{out}}}}}}}}}})), where ik is the index (rank) of the destination of the k-th outgoing connection of agent i, and ({d}_{i}^{{{{{{{{rm{out}}}}}}}}}) is her final out-degree. Then, for each agent i we compute the probability

$${p}_{i}=frac{left|{jin left{2,ldots ,{d}_{i}^{{{{{{{{rm{out}}}}}}}}}right},,{{mbox{s.t.}}}{i}_{j} , > , {{mbox{median}}},{{i}_{1},ldots ,{i}_{j-1}}}right|}{{d}_{i}^{{{{{{{{rm{out}}}}}}}}}-1},$$

(1)

that estimates the likelihood that a new connection of agent i is higher (in ranking) than the median of the previous connections. At one extreme, for pi = 0 the sequence (({i}_{1},{i}_{2},ldots ,{i}_{{d}_{i}^{{{{{{{{rm{out}}}}}}}}}})) is such that every new followee has a rank smaller than (or equal to) the median of the current list of followees’ ranks, i.e., ({i}_{j}le ,{{mbox{median}}},{{i}_{1},ldots ,{i}_{j-1}}). As such, the rolling median, i.e., the median computed among the first k elements, is always non-increasing in k, and the user is continuously seeking better UGC. Conversely, for pi = 1 the rolling median is always increasing.

In Fig. 1 we compare the histogram of the empirical distribution of pi (in purple) with the null hypothesis, plotted in blue, in which we remove the temporal-ordered pattern of the sequence. The elements in agent i’s followees set ({{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{out}}}}}}}}}:= {{i}_{1},{i}_{2},ldots ,{i}_{{d}_{i}^{{{{{{{{rm{out}}}}}}}}}}}) are randomly re-ordered: in the new sequence (({bar{i}}_{1},{bar{i}}_{2},ldots ,{bar{i}}_{{d}_{i}^{{{{{{{{rm{out}}}}}}}}}})), the k-th element of the list is drawn uniformly at random from ({{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{out}}}}}}}}}setminus left({bar{i}}_{1},ldots ,{bar{i}}_{k-1}right)). The null hypothesis has a median value of ~0.5, which is easy to interpret: if the sequence is completely random, adding an extra element to the partial sequence has a 50% probability of being above the median, and 50% probability of being below it. Comparing the two distributions, we notice that empirical data tend to have a decreasing quality-ranking sequence of followees. As the difference is statistically significant, we can reject the hypothesis that the temporal sequence is random.

Fig. 1: Median-rule violation on the Twitter data set.
figure1

In purple, we plot the histogram of the probability pi as defined in eq. (1). The data refer to N = 6474 agents out of the original 6757 by considering those with an out-degree of at least two. The median of the distribution is 0.436 (mean and std: 0.450, 0.189). In light blue, we compute the same distribution upon reshuffling the temporal sequences of the connections (null hypothesis). The median of this distribution is 0.5 (mean and std: 0.489, 0.173). The two distributions are statistically significantly different (p value of Kolmogorov–Smirnov test  10−8).

Ultimately, this empirical evidence confirms our intuition, i.e., users tend to continuously increase the quality threshold of the new followees. This characteristic, though, is missing in the network formation literature, as there is typically no quality associated with the users. For instance, in the preferential attachment model27, each user selects m followees proportional to their in-degree. If the network is large enough, the probability of selecting a node k in the d-th draw does not depend on d. Therefore, the temporal sequence of connections of the preferential attachment model is similar to the null hypothesis. Moreover, even in the fitness model16 where the quality (fitness) is considered, users tend to connect to high-quality nodes first, rather than later.

Quality-based model

To formalize our quality-based model, we consider the unweighted directed network among N ≥ 2 agents whose UGC revolves around a specific common interest, e.g., a particular traveling destination. We denote the directed tie from i to j with ({a}_{ij}in left{0,1right}), where aij = 1 means i follows j. Then, we assume there are no self-loops and that each agent i can only control her followees aij but not her followers aji. Similarly to the approach in the fitness model16, we endow each actor i with an attribute qi, drawn from a probability distribution, e.g., uniform, normal, exponential distribution, that describes the average quality of i’s content32, e.g., a picture taken at that traveling destination. As will be manifested later, our model predictions are independent of the numerical representation of these qualities, which could be somehow subjective and arbitrary. Instead, in our model, only the ordering of the individual qualities matters. Therefore, the choice of the underlying probability distribution does not affect any of the following results, contrary to the fitness model16.

The quality qi can be seen as the expectation of a Bernoulli random variable Qi describing the probability of followers liking agent i’s content. Higher values of qi are then associated with better UGC. A value of zero, instead, can be used to model users that do not produce any UGC. With this setup, the model can be directly applied to the platforms, e.g., YouTube or Twitch, in which users can be partitioned into two classes, i.e., the content creators and their followers (or viewers, see Supplementary Note 1).

We then consider a sequential dynamical process starting from the empty network, where at each time-step (tin left{1,2,ldots right}) each actor i picks another distinct actor j chosen randomly from a probability distribution on (left{1,ldots ,i-1,i+1,ldots ,Nright}). In the following theoretical analysis, we consider the uniform distribution. However, we also integrate into our discussion the numerical comparison between uniform distribution and the in-degree-based preferential attachment meeting process.

To reflect the meritocratic principle, we base the tie formation decision on the comparison between i’s current followees’ and j’s qualities. Let the payoff function of agent i measure the maximum quality received by i, i.e.,

$${V}_{i}
(2)

where ({{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{out}}}}}}}}}
(3)

meaning that i will add j in her followees’ set if j provides better quality content compared to i’s current followees. Note that, if i finds a node that already belongs to her set of followees, the connection will not be re-discussed. While, intuitively, this may lead to a large out-degree, we will show that this is not the case because the cost of good-quality connections is infinitely low, but the cost of poor-quality ones is infinitely high.

We emphasize that the choice of the payoff function eq. (2) reflects the natural continuous chase for the maximum quality37 (exploitation) while minimizing the effort owing to non-improving connections. Alternatively, one could consider smoother payoffs such as the followees’ average quality

$${V}_{i}

Theorem 1

(Convergence) For any set of qualities (left{{q}_{1},ldots ,{q}_{N}right}), the network reaches an equilibrium almost surely. Moreover, the probability of reaching equilibrium within t time-steps reads as follows:

$${mathsf{P}} [,{{mbox{An equilibrium is reached within}}};t;{{mbox{time steps}}},]\ ={mathsf{P}}[{a}_{12}

From the above theoretical cumulative distribution function, we derive the expected number of time-steps to reach equilibrium. The result, shown in Fig. 2 (in blue), indicates that the number of time-steps grows almost exponentially in the network size N.

Fig. 2: Numerical analysis of the number of time-steps until convergence.
figure2

For each value of N, we run 1000 simulations, in three different settings for the meeting process probability distribution: in blue, we use a uniform distribution (the dashed line also shows the theoretical results), in orange, we use a preferential attachment mechanism, and, in purple, we use a mixed distribution (50% chance of uniform distribution and 50% chance of preferential attachment, for each meeting). The data points indicate the median value. The shaded area indicates the first and third quantile.

For comparison, we consider two alternative scenarios for the meeting probability distribution. In orange, we use a preferential attachment approach: at each time-step, each agent i meets a distinct agent j according to a probability that is proportional to the current in-degree of the agent j. By doing so, the quality-based dynamics (in the follow/not follow decision) are combined with a meeting process that coarsely resembles the effect of recommendation systems (increasing the visibility of high in-degree nodes). In purple, the meeting agent is chosen with 50% probability, from the uniform distribution, and with the remaining 50% probability, according to the preferential attachment mechanism. As the numerical results indicate, the preferential attachment-based meeting process significantly reduces the number of required time-steps, even when mixed with a uniform distribution.

In-degree distribution

The network structure at equilibrium is a consequence of the probability distribution ruling the meeting process, and of the meeting order itself. Even though there can be multiple equilibria, some network macroscopic properties can be statistically described. A preliminary observation is that, at equilibrium, every other node follows node 1, and node 1 follows node 2. Before studying the stationary state, we statistically describe the transient in-degree distribution as a function of the quality ranking i. In particular, we provide an analytical formula for the expected in-degree of each node as a function of i, as stated in the following theorem.

Theorem 2

(In-degree distribution) The probability that node i is followed by node j ≠ i after t > 0 time-steps is:

$${mathsf{P}}[{a}_{ji}
(4)

Moreover, the probability of node i having in-degree ({d}_{i}^{{{{{{{{rm{in}}}}}}}}}
(5)

where we omitted the time-step dependency on ({bar{p}}_{i}) and ({underline{p}}_{i}). Finally, the expected in-degree of node i after t time-steps reads as:

$${mathbb{E}}left[{d}_{i}^{{{{{{{{rm{in}}}}}}}}}
(6)

The proof of the theorem is provided in Supplementary Note 1, whereas in Fig. 3a we show the probability density function, together with its expectation, for a network of 1000 agents, after t = 200 time-steps (non-stationary state). The following corollary, instead, studies the probability density functions upon reaching the network dynamics equilibrium.

Fig. 3: In-degree, theoretical analysis.
figure3

Given a network of N = 1000 agents, the plots show the probability density functions of the in-degree (derived through eq. (5) from eq. (4) and eq. (7)) as a function of the rank i. In orange, we also plot their expected value (equations eq. (6) and eq. (8)). The plot in a refers to the theoretical results at t = 200, in b to the equilibrium. The plots are shown in log–log scale, to emphasize Zipf’s law: ({mathbb{E}}[{d}_{i}^{{{{{{{{rm{in}}}}}}}},star }]=N/i).

Corollary 1

At equilibrium, the probability that node i is followed by node j ≠ i is:

$${mathsf{P}}[{a}_{ji}=1]:= mathop{lim }limits_{tto infty }{mathsf{P}}[{a}_{ji}
(7)

and the expected in-degree of node i reads as:

$${mathbb{E}}left[{d}_{i}^{{{{{{{{rm{in}}}}}}}},star }right]=left{begin{array}{ll}N-1,quad &, {{mbox{if}}},i=1,\ frac{N}{i},quad &,{{mbox{otherwise}}}.end{array}right.$$

(8)

Proof

First, we derive the probability of node i being followed by node j at equilibrium by taking the limit t →  in eq. (4):

$$mathop{lim }limits_{tto infty }{mathsf{P}}[{a}_{ji}

Similarly, taking the limit for t →  of eq. (6) yields exactly to

$${mathbb{E}}left[{d}_{i}^{{{{{{{{rm{in}}}}}}}},star }right]=left{begin{array}{ll}N-1,quad &, {{mbox{if}}},i=1,\ frac{N}{i},quad &,{{mbox{otherwise}}},.end{array}right.$$

According to the above result, at equilibrium the best content provider, node 1, receives N − 1 connections, node 2 has N/2 expected followers, node 3 has N/3, and so on. The result can be intuitively reached with the following plausible reasoning: any user that has not yet found node 1 nor node 2, has the same probability of finding any of the two in the coming time-step. In expectation, in half of the cases, the user will become a follower of node 2 before finding and following (necessarily) node 1. In the other half of the case, she will find node 1 before having seen node 2. Thus, the expected number of followers of node 2 is half of the expected number of followers of node 1. The reasoning can be extended for any node j > 2.

Such a regular scaling property is called Zipf’s law34 and it is illustrated in Fig. 3b, where we plot the expected in-degree of each node as a function of its quality ranking, together with the probability density functions. In the log–log scale, the expected in-degree perfectly matches a line with coefficient −1. Real-world evidence of Zipf’s law has been reported in many systems, including firm sizes38, city sizes33, connections between web-pages39.

To empirically validate our results in the context of online social networks, we collected and analyzed data from Twitch, an online social media platform focusing on video streaming, including broadcasts of gameplay, e-sports competitions, and real-life content. Over the past decade, Twitch gradually became one of the most popular social media platforms, reaching up to four million unique creators streaming each month and 17 million average daily visitors (https://www.twitch.tv/p/press-center/), and serving as a virtual third place for online entertainment and social interactions40. Our data sets consist of the followership networks in three different categories, i.e., poker, chess, and art, where users can live-watch broadcasters playing or discussing these arguments. Among them, we only retain the first two categories, as the third one did not satisfy the criterion of having a baseline community of interested users. The description of our crawling method and of the collected data are provided in the Methods section and in Supplementary Note 3. In Fig. 4, we show the in-degree of the 15 most followed users in chess and poker as a function of the rank. The empirical data are fitted via linear regression (in the log–log plot) and look strikingly similar to Zipf’s law.

Fig. 4: Zipf’s law on empirical Twitch data.
figure4

The log–log plot shows empirical evidence of Zipf’s law in our Twitch data sets. In blue, we report the data on the number of followers of the top 15 users (broadcasters) in the chess category, as well as their linear regression fit of coefficient −1.04 (Pearson coefficient, −0.98, R2 = 0.96, RMSE = 0.16). In purple, the poker data set shows a similar fitting coefficient of −0.98 (Pearson coefficient: −0.97, R2 = 0.95, RMSE = 0.17). Both fits are very close to the theoretically expected Zipf’s law (of coefficient −1) shown in orange.

Zipf’s law is more than a power-law

The peculiarity and apparent ubiquity of Zipf’s law have triggered numerous efforts to explain its origins33. Despite being a discrete distribution, Zipf’s law is often associated with the continuous Pareto distribution, better known as power-law41. For this reason, it is frequently seen as the result of a linear preferential attachment process27 based on Gibrat’s rule of proportional growth42 which leads to Yule-Simon (power-law) distributions43. However, as noticed in35, there is more than a power-law in Zipf: although a power-law distribution is certainly necessary to reproduce the asymptotic behavior of Zipf’s law at large values of rank i, any random sampling of data does not lead to Zipf’s law and the deviations are dramatic for the largest objects. In particular, Zipf’s law emphasizes the relation among the top-ranking elements, which essentially correspond to the most important nodes, i.e., the network influencers. The typical Zipf’s sequence N, N/2, N/3, … can be observed only if data constitutes a coherent set35. Thus, Zipf’s law is more insightful than just a power-law.

The difference with a Pareto distribution becomes evident when considering our in-degree probability density function, which can be analytically derived by using the result of the previous theorem and computing the average of each user’s probability density function

$${mathsf{P}}[{d}^{{{{{{{{rm{in}}}}}}}},star }=d]=frac{1}{N}mathop{sum }limits_{i=1}^{N}{mathsf{P}}[{d}_{i}^{{{{{{{{rm{in}}}}}}}},star }=d].$$

As shown in Fig. 5, the theoretical in-degree probability density function follows a power-law of coefficient (hat{alpha }=-2.06pm 0.003) (p value < 10−8), which is not surprising since the expected in-degree is distributed according to a Zipf’s law (see again the discussion in ref. 41). However, notice that the power-law fit does not hold in the region of low-in-degree nodes (dmin = 7, according to the Clauset algorithm44). In this region, the log-normal distribution is a better fit compared to the power-law45. On the other hand, the fit with a log-normal distribution is worse (compared to the power-law) in the heavy tail of our distribution, i.e., in the region of high-quality nodes. However, the most noticeable difference is that, contrary to both the power-law and the log-normal curve, our theoretical distribution is not monotonically decreasing, especially in its right tail. The probability of having a node of in-degree N/2 is slightly larger than the probability of having a node of in-degree N/2 ± α, for some range of α ≥ 1. Moreover, the probability of having a node of in-degree in a neighborhood of N/2 is larger compared with the power-law fit (chosen as baseline). Conversely, our theoretical probability of having a node of in-degree slightly smaller than N − 1 is negligible (<10−8), stating that the second-best node cannot be too close to the best node. This is in sharp contrast with the monotonic Pareto and log-normal distributions.

Fig. 5: In-degree probability density function.
figure5

In gray, the theoretical result for N = 1000 agents is shown. In orange, we plot the numerical distribution resulting from 1000 simulations upon reaching equilibrium. In black, the numerical distribution is shown after using the standard logarithmic binning of data. Finally, we fit the numerical data with a power-law using the algorithm in Clauset et al.44 (blue) and with a log-normal distribution (purple).

In synthesis, our probability distribution differs from Pareto and log-normal distributions when focusing on the network influencers. In this range, i.e., in the extreme right tail, our theoretical distribution is similar to its power-law fit, only after logarithmically (and thus coarsely) binning the data (as common practice when empirical data are scattered and the distributions are continuous). Yet, without such arguably coarse binning of the data, our theoretical distribution can predict more accurately the Zipf’s regularity found on the top influencers of real-world networks, as in the Twitch data set shown in Fig. 4. We finally refer to Supplementary Note 4 (see in particular, Supplementary Figs. 12, 13) for a detailed analysis of the empirical in-degree distribution.

Preferential attachment meeting process

The theoretical results on the nodes’ in-degree distributions are derived based on a uniform distribution meeting process. In this scenario, every user has the same probability of being found. However, most social media platforms personalize the content users are exposed to, thus it is fundamental to understand what can be the impact if some users have more visibility than others. In order to do so, we numerically studied the results of a preferential attachment in-degree-based meeting process, which mimics the idea that popular users might get promoted (in terms of visibility) by the platform’s recommendation system. Compared with the preferential attachment model27, though, here we only alter the probability of being found, but the connection will still depend on the meritocratic principle of eq. (3). In Fig. 6, we report the results of a mixed scenario in which the potential followee is chosen with 50% probability from the uniform distribution, and with the remaining 50% probability from an in-degree based distribution. We refer to Supplementary Note 2 (see in particular Supplementary Fig. 1) for more details on the numerical comparison between uniform distribution, mixed preferential attachment, and pure preferential attachment. Compared to the uniform distribution scenario in Figs. 3 and 5, the introduction of a mixed preferential attachment-based meeting process slightly increases the variance of the in-degree probability distribution of each agent. In this scenario, it becomes possible that some low-quality agents get an initial (i.e., in the early stage of the network formation process) advantage (purely by chance), which gets reinforced by the (mixed) preferential attachment mechanism. Yet, this effect quickly fades away, because potential followers still undergo the quality threshold rule of eq. (3). On the other hand, the (mixed) preferential attachment may penalize some other agents, which receive fewer followers than what they would expect with the uniform distribution process. It is remarkable that, even after introducing the mixed preferential attachment process, the correlation between quality and followers persists (on average): the higher the quality, the higher the average number of followers. Even more importantly, Zipf’s relation is robust under mixed preferential attachment-based meeting processes (e.g., recommendation systems). Similarly, on the right column of Fig. 6, the average in-degree probability distribution function is strikingly similar to the one obtained with the uniform distribution scenario.

Fig. 6: Numerical analysis of the preferential attachment effect on the in-degree distribution.
figure6

Numerical results of 1000 simulations with 1000 nodes for a mixed process (with 50% probability, the potential followee is chosen from a uniform distribution, and with the remaining 50% from a preferential attachment mechanism). a The color map shows the in-degree probability density function, as a function of the quality rank. In orange, the average in-degree. In blue, Zipf’s law equals the expected in-degree in the uniform distribution scenario. b The numerical probability density functions (orange) and their power-laws (blue) and log-normal (purple) fits are shown.

Out-degree distribution

Similarly to the in-degree, we can study the statistical distribution of the nodes’ out-degree at equilibrium. First, notice that, contrary to the in-degree, the out-degree distribution does not depend on the rank, but is uniform for all the nodes in the network. In fact, according to the dynamics, each node creates connections to increasing quality nodes, until reaching its own equilibrium. Let ({d}_{N}^{{{{{{{{rm{out}}}}}}}},star }) be the random variable describing the out-degree (at equilibrium) of a general node i in a network of N agents. First, note that the probability of node i > 1 finding node 1 (or node 1 finding node 2) at her first choice is simply:

$${mathsf{P}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }=1right]=frac{1}{N-1}.$$

Conversely, the probability of having maximum out-degree corresponds to the probability of meeting all the other nodes in increasing quality ordering, i.e.,

$${mathsf{P}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }=N-1right]=frac{1}{(N-1)!}.$$

Obviously, a node cannot have an out-degree larger than N−1 in a network of N agents, thus ({mathsf{P}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }=dright]=0), for all d ≥ N. Similarly, ({mathsf{P}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }=dright]=0), for all d < 1, every node must have out-degree of at least 1. Then, we denote as C1 the random variable describing the first meeting of i. C1 can take any value (jin left{1,ldots ,i-1,i+1,Nright}), with uniform probability 1/(N − 1). Then, the probability ({mathsf{P}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }=dright]), for N ≥ 2 and 1 ≤ d ≤ N − 1, can be described recursively as follows:

$${mathsf{P}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }=dright]= mathop{sum }limits_{j=1}^{i-1}{mathsf{P}}[{C}_{1}=j]{mathsf{P}}left[{d}_{j}^{{{{{{{{rm{out}}}}}}}},star }=d-1right] \ +mathop{sum }limits_{j=i+1}^{N}{mathsf{P}}[{C}_{1}=j]{mathsf{P}}left[{d}_{j-1}^{{{{{{{{rm{out}}}}}}}},star }=d-1right] \ = frac{1}{N-1}mathop{sum }limits_{j=1}^{N-1}{mathsf{P}}left[{d}_{j}^{{{{{{{{rm{out}}}}}}}},star }=d-1right].$$

(9)

In other words, ({mathsf{P}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }=dright]) is equal to the sum of the probability of agent i choosing (jin left{1,ldots ,i-1,i+1,ldots ,Nright}) as first choice times the probability of having out-degree d − 1 in a network with j remaining nodes (the node itself and the j − 1 nodes to which she can still connect to).

Thanks to eq. (9), we can describe the out-degree probability functions by means of recursion on the network size N. The result for different values of N is pictured in Fig. 7a. The out-degree distribution exhibits non-monotonic properties (first increasing, then decreasing) and quickly vanishing tail: extremely large values are particularly rare. Even though the theoretical out-degree distribution cannot be directly associated with any known distribution, it is very close to a gamma distribution (or to a Poisson distribution), as shown in Supplementary Fig. 2 (see also the discussion in Supplementary Note 2). Thanks to its non-monotonic nature and its fast decay, there are also some similarities with the log-normal distributions (whose decrease is more than linear in the log–log plot). On the other hand, it differs more significantly from a power-law distribution. Despite being observed in the in-degree distributions of many real-world social networks, power-laws do not always properly describe their out-degree distributions. Empirical evidence on the asymmetry between the in- and out-degree distributions is found on Twitter46 and YouTube47. In particular, several empirical studies on the out-degree distributions of online social networks highlight a much faster decrease in the probability of seeing very high out-degree nodes48,49,50. Albeit some influencers on YouTube, Instagram, or Twitch, might have one million followers, it is hard to imagine that some users follow several thousand or even a million of other users, since every single action of the following someone involves a decision-making process and at least a click on the Follow button. As a result, the out-degree distributions of these platforms feature a clear cut-off (sometimes even artificially imposed to prevent fake-user accounts) on the order of roughly a thousand connections (for network samples, this clearly reduces even further), instead of a heavy tail as featured by the power-law.

Fig. 7: Theoretical analysis of the out-degree distribution.
figure7

a Shows the out-degree complementary cumulative distribution function for different size N of the network. Note that the support of the distributions is [1, N−1] but with very quickly vanishing right tails. b Shows, in bold blue, the expected out-degree as a function of the number of agents N in the network. In gray, the function ({{{{{{{mathrm{log}}}}}}},}_{2}(N)) shows a similar growth. The shaded area represents the confidence interval obtained with one standard deviation computed from the out-degree distribution using the definition, (Varleft[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }right]=mathop{sum }nolimits_{d = 0}^{N-1}{left(d-{mathbb{E}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }right]right)}^{2}{mathsf{P}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }=dright]), and the recursive formula eq. (9).

In our Twitch data sets, we find evidence of this fast drop in the frequency of the high out-degree nodes. As shown in Fig. 8, the out-degree probability density function is concentrated in the range d [1, 10], with the 99-percentile being at d = 15 (d = 19) in the chess (poker) data set. As a comparison, our theoretical distribution would predict the 99-percentile to be at d ~ 10, while a power-law of classical exponent −2 predicts it at d = 100. The maximum out-degrees are found to be, respectively, d = 151 and d = 142, which are higher than the maximum out-degree of the theoretical distribution d ~ 20, but at least three orders of magnitude smaller than the maximum predicted by a power-law of exponent −2, i.e., from the maximum in-degree of the empirical distribution. At first glance, the theoretical and empirical results in Figs. 7a and 8 show some differences. In particular, compared with the data the empirical distribution exhibits a larger frequency of low out-degree nodes. The difference can be due to the fact that the network is still in the formation process, and many users only joined it recently. Therefore these most recent users have just started their search and there is an abundance of low out-degree nodes. We conjecture that another possible reason may be the recommendation systems behind the Twitch platform (we explored this effect by studying the numerical out-degree distribution under preferential attachment-based meeting process, see Supplementary Fig. 3). To better understand this inconsistency, in Supplementary Fig. 15, we studied the stacked frequencies of the out-degree of the followers of the 15 most followed nodes in each data set.

Fig. 8: Empirical analysis of the out-degree distribution.
figure8

The plot shows the complementary cumulative distribution function of the out-degree of the Twitch data sets concerning the users in the categories chess (blue) and poker (purple). For comparison, we plot the results of our theoretical distribution with N = 350 (gray), and of a power-law of exponent −2 (orange).

From the recursive description of the probability density function in eq. (9), it is possible to compute the expected nodes’ out-degree, as stated in the following theorem.

Theorem 3

(Out-degree distribution) At equilibrium, the nodes’ expected out-degree ({mathbb{E}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }right]) in a network of N ≥ 2 agents equals the (N − 1)-th harmonic number:

$${mathbb{E}}left[{d}_{N}^{{{{{{{{rm{out}}}}}}}},star }right]=mathop{sum }limits_{k=1}^{N-1}frac{1}{k}.$$

According to the theorem, the expected out-degrees as a function of N are given by the harmonic sequence: 1, 3/2, 11/6, 25/12, …. The proof provided in Supplementary Note 1 is built on the derivation of the out-degree probability distribution, however, the result can be intuitively derived from the following observations: the expected out-degree is uniform across all the nodes, and the sum of the expected out-degree equals the sum of the expected in-degree of all the nodes. From the result on the in-degree distribution (see eq. (8)), the sum of the expected in-degree is equal to (N − 1) + N/2 + N/3 + … + 1. Thus, the expected out-degree is equal to (N − 1)/N + N/(2N) + N/(3N) + … + 1/N, which in fact corresponds to the N − 1-th harmonic number. As shown in Fig. 7b, the expected out-degree has a similar growth compared with the base-2 logarithm of the network size N.

Diameter and clustering

Since each node is connected, on average, to roughly ({{{{{{{mathrm{log}}}}}}},}_{2}(N)) other nodes, intuitively, the network should feature the small-world property. In Fig. 9a we provide some numerical results on the average network diameter and average nodes’ distance for different network sizes. According to the results, also the network diameter has a growth rate similar to the base-2 logarithm of the network size. Moreover, the average node’s distance has even slower growth, with a value of 5 for networks of 104 nodes. These results match the empirical observations on the small-world property of real-world networks, according to which the distance between two randomly chosen nodes grows proportionally to the logarithm of network size51. Moreover, they are quantitatively similar to previous findings on, e.g., an Instagram network sample52 (of size equal to 44 thousand nodes) where the diameter is found to be 11, and the average distance 3.16.

Fig. 9: Diameter and clustering analysis.
figure9

Numerical analysis of a the average diameter and average nodes’ distance, and b average clustering coefficient for different network sizes. The shaded area indicates one standard deviation from the average value. For the plot on the left, we run 100 simulations for each value of N (except for the last data-point for which we run 25 simulations). For the plot on the right, we run 100 simulations if N ≤ 104, and one simulation otherwise.

As such, our theoretical model already captures some of the most widely observed features of real-world networks, i.e., the in-degree scaling and the small-world properties. A third common feature is the high clustering coefficient. Even though it has been shown that the average value on directed social networks such as Instagram or Twitter, is smaller than on other undirected social networks, e.g., Facebook, it remains on the order of 10%22. To compute the clustering coefficient for directed networks, we adopted the class out in the taxonomy of53 in which a triad around node i is closed when i follows two distinct agents j and k, and there exists a tie from j to k or from k to j. Accordingly, the clustering coefficient of i reads as

$${c}_{i}=frac{{sum }_{jne i}{sum }_{kne i,j}{a}_{ij}{a}_{ik}{a}_{jk}}{{sum }_{jne i}{sum }_{kne i,j}{a}_{ij}{a}_{ik}}.$$

The numerical results pictured in Fig. 9b indicate that the average clustering coefficient is monotonically decreasing in the network size N, yet it remains above 10% on networks composed of 106 nodes. Only a marginal increase is observed when introducing the preferential attachment on the meeting process (see the comparison in Supplementary Fig. 4).

Audience overlap

The previous analysis on the clustering coefficient studied the probability that two user’s followees are friends with each other. Similarly, it is interesting to analyze the probability that two (highly followed) users are followed by the same third user. In other words, we aim at studying the similarity between the followers’ sets of the different agents. The similarity between agents reveals the existence of common interest, and it can be used for link prediction54 or to improve the recommendation systems55. Inspired by the Jaccard index introduced for species similarities56, and already used to measure followers’ overlap57, we propose the following real-valued matrix O to measure the overlap between the audiences of two agents:

$$O(i,j):= frac{left|{{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{in}}}}}}}}}cap {{{{{{{{mathcal{F}}}}}}}}}_{j}^{{{{{{{{rm{in}}}}}}}}}right|}{left|{{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{in}}}}}}}}}right|}in left[0,1right],$$

(10)

if (left|{{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{in}}}}}}}}}right| , > , 0), and 0 otherwise, where ({{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{in}}}}}}}}}) denotes the set of followers of agent i. In other words, this coefficient measures the number of common followers of i and j, normalized by the number of followers of i. Note that, when agent i is lower in the ranking list with respect to agent j, i.e., i > j, we typically have (| {{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{in}}}}}}}}}| < | {{{{{{{{mathcal{F}}}}}}}}}_{j}^{{{{{{{{rm{in}}}}}}}}}|), and eq. (10) corresponds to the Szymkiewicz-Simpson coefficient, also known as the overlap coefficient58, where the denominator is replaced by (min{| {{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{in}}}}}}}}}| ,| {{{{{{{{mathcal{F}}}}}}}}}_{j}^{{{{{{{{rm{in}}}}}}}}}| }). Compared with it and to the Jaccard index56, whose denominator is (|{{{{{{{{mathcal{F}}}}}}}}}_{i}^{{{{{{{{rm{in}}}}}}}}}cup {{{{{{{{mathcal{F}}}}}}}}}_{j}^{{{{{{{{rm{in}}}}}}}}}|), our measure leads to a non-symmetric matrix.

In Fig. 10a, we plot the numerical results on the overlap index for networks of 105 nodes, upon reaching the equilibrium. The results are averaged over 10 simulations. According to the previous results, all the nodes should follow node 1 at equilibrium, thus any follower of a node i should also be a follower of node 1 and the overlap in the first column is simply (O(i, 1) = 1). Moreover, consistent with our result in eq. (8), we observe the Zipf’s sequence 1, 1/2, 1/3, …, in the first row (O(1, j) = 1/j). Perhaps surprisingly, a similar pattern appears in all the rows, upon averaging on a sufficiently large number of simulations. We emphasize that the results are independent of the number of nodes N. An intuition for these observations is provided in Supplementary Fig. 5.

Fig. 10: Followers’ overlap analysis.
figure10

Followers’ overlap results among the top 15 nodes. a The average numerical results were obtained from 10 simulations with 105 agents, upon reaching equilibrium. b The results from the Twitch data set related to the chess category. An equivalent result for the poker data set is reported in Supplementary Fig. 14.

To further validate our theoretical model, we perform the same analysis on the overlap among the 15 most followed nodes of our Twitch data sets, reported in Fig. 10b (for the chess data set) and in Supplementary Fig. 14 (for the poker data set). Our numerical results are qualitatively well aligned with the real-world data with respect to the horizontal decrease of the overlap index. Yet, the Twitch data sets show that the overlap index is not always uniform across the different rows. For instance, in the chess data set, low-ranking nodes exhibit a slightly higher overlap index with respect to the numerical results. Conversely, high-ranking nodes show the opposite behavior. This phenomenon might be related to the statistical dependency of individuals’ followee sets on their out-degrees, as discussed in Supplementary Note 4.

Leave a reply:

Your email address will not be published.