unit 5 (1)
unit 5 (1)
Problem definition
● Here, we present each user of a social network with a dot called node/vertex and
links. So this whole thing in computer science and applied mathematics is called
graphs.
follower recommendation.
● Given the graph, we take the help of already existing links. Here, u5 is a friend
friend of u1?
Or
Mathematically, it is written as :
G = <V,E>
The directed graphs are usually used in Instagram and Twitter, where one person follows
another person, as well as the other person, can also follow back. So, there can be two
The dataset we have is also a directed graph for our problem statement.
● For ex: Instagram has follow and follow back system.
Follow means you’re opting to follow the person. Follow back means someone
followed you and you’re being offered the option to follow them back.
● Path is a sequence of valid edges to go from one vertex to another vertex and
the length of a path is equal to the number of edges between two points.
a → f → c → b → h, where length = 4
or
a → f → c → d →e →b → h, where length = 6
When we are dealing with traveling across path and cost of moving along the
same path changes due to some factor, we have to consider the directions
train.csv
● The dataset was provided by facebook which is removed currently. I’ll provide
edges/links.
- Source node
- Destination node
● The first datapoint in the above diagram states that there is an edge from U1 to
U690569. Here, U1 is the source node and U690569 is the destination node.
● Here, U1 follows U690569, U315892, U189226. Here, all the edges originating
from “1” are listed below. Similarly, there are edges from 2, 3…etc. Remember,
which college they studied or did they study in the same college or some
● But, as the network grows, people are following new people. The network is
very dynamic in real-world because today I may have discovered my old friend
Facebook gave us a snapshot of the graph at one time. So, there are some
So, we are mapping this to a binary classification task with “0” implying the
→ feature :
So, in the below figure, we are trying to predict that if the edge between U1 and
U2 follows → {U3,U4,U6}
Here, U1 and U2 are having respective sets of nodes they follow. As they have
so many common vertices or two sets are highly overlapped, there is a high
So, there is a high chance that U1 may want to follow U6 and U2 may want to
follow U5.
Similarly, there is a high chance that U1 could follow U2 and U2 could follow
U1.
The fact that U1 is following U2 signifies that there is a high chance that U2 will follow back
U1.
● So, these are called graph features.
● No low-latency requirements
You can precompute the top 5 or 10 friends Ui should follow once in 2 days or
weekly or every night and store it in a hash table-like structure and show it
requirement
I could have 100 such users which Ui could follow and I can have the
probability values. I might have 5 slots or 10 slots, where I want to show the
recommending Uj to Ui.
● Both precision and recall are important, hence F1 score is a good choice here.
Let’s say Ui = {U1, U2, U3, ….,U10}, here these are the top 10 probable
As in most social networks you don’t get show all the users whom Ui may want
● Here , we are using networkx library, which is one of the most popular library
● We will create a new file with out the headers “source_node” and
● Once we have created that, we will build the graph using networkx library.
Here, we will provide the file without headers and ask the networkx library to
(nx.DiGraph()).
It says :
- Type : DiGraph
out degree. So for every vertex, it will measure the in degree and out degree and
find the average. Our graph information says that on an average there are 5
- We use the file without headers and state that we need the top 50 rows.
- We can easily draw using “nx.draw()” function and save the figure.
- In the below subgraph information, in degree, and out-degree are 0.75. It is less
than 1 because there are some empty nodes with no edges at all.
- In the above case, we can assume that those with no edges may be new users or
The below code tells us how many people are there in the graph.
dict in_degree → keys are unique nodes in the graph and corresponding values are number of
In Degree Analysis
— There are some users with more than 0 followers, but you can notice a sharp
line which states that there is some user or at least one user with more than 500
— As we move from 0th user to 1.86 million user, sorted in terms of followers, I
— I’m zooming in on the graph from 0th user to 1.5 million user.
— So, in the below figure, we can observe that around 1.5 million users have
— There are people with more followers, but most people don’t have more
followers.
— Here mean, median, 25th percentile,75th percentile, 50th percentile are very
— We can observe that 90% of the users have followers less than 12.
● Let’s zoom in on the graph from 0th user to 1.5 million user.
— We can observe from the graph that, around 1.5 million users follow less than
7 people.
● Let’s draw box-plot for analysis.
— Reading other parameters like mean, median, percentiles from the boxplot
● No of persons those are not not following anyone and also not having any
followers are 0.
In Di-Graph:
— If all the nodes in a graph are connected by some path ignoring direction