0% found this document useful (0 votes)
14 views22 pages

unit 5 (1)

Uploaded by

Pandu snigdha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views22 pages

unit 5 (1)

Uploaded by

Pandu snigdha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT 5

Problem definition

● Here, we present each user of a social network with a dot called node/vertex and

we connect people who are friends/followers/connections with edges called

links. So this whole thing in computer science and applied mathematics is called

graphs.

● In Facebook, you must have encountered the recommendation which is known

as a friend recommendation, and similarly, in Instagram, it is known as a

follower recommendation.

● Given the graph, we take the help of already existing links. Here, u5 is a friend

of u3 and u3 is a friend of u1. Now, Is it sensible to assume that u5 could be a

friend of u1?

Or

Is u1-u5 link/edge a valid link or not?


The graph nodes, vertices and edges.

Mathematically, it is written as :

G = <V,E>

Each edge is a pair of two vertices.

edge = (ui, uj)

The directed graphs are usually used in Instagram and Twitter, where one person follows

another person, as well as the other person, can also follow back. So, there can be two

edges between two vertices too.

The dataset we have is also a directed graph for our problem statement.
● For ex: Instagram has follow and follow back system.

Follow means you’re opting to follow the person. Follow back means someone

followed you and you’re being offered the option to follow them back.
● Path is a sequence of valid edges to go from one vertex to another vertex and

the length of a path is equal to the number of edges between two points.

Here, valid path between a and h :

a → f → c → b → h, where length = 4

or

a → f → c → d →e →b → h, where length = 6

● Why directed graphs?

When we are dealing with traveling across path and cost of moving along the

same path changes due to some factor, we have to consider the directions

here.So directed graph is used in such situations.


Data Format & Limitations

train.csv

● The dataset was provided by facebook which is removed currently. I’ll provide

the data in my github repository.


● In a nutshell, the dataset consists of a simple file with two columns.

● Each datapoint is a pair of vertices.

● In the dataset, we are given pairs of vertices/nodes which contains directed

edges/links.

● Data columns ( 2 columns )

- Source node

- Destination node

● The first datapoint in the above diagram states that there is an edge from U1 to

U690569. Here, U1 is the source node and U690569 is the destination node.

● Here, U1 follows U690569, U315892, U189226. Here, all the edges originating

from “1” are listed below. Similarly, there are edges from 2, 3…etc. Remember,

these are directed edges.

● In total, we have 1862220(1.86 million) vertices/nodes and 9437520( 9.43

million) edges/links in our directed graph.


● We could have metadata like about the city in which the person lives in or in

which college they studied or did they study in the same college or some

additional information. But, as far as this task is concerned, we have directed

graph nodes and edges and have to work on it.

● So, this is a purely graph-based link prediction problem.

● But, as the network grows, people are following new people. The network is

very dynamic in real-world because today I may have discovered my old friend

on Facebook and started following them. So as far as the problem is concerned,

Facebook gave us a snapshot of the graph at one time. So, there are some

constraints as we cannot understand the evolution of the graph.

Mapping to a Supervised Classification Problem

Let’s map our data to a supervised classification problem.

● Let’s say we have vertex Ui and Uj.

If Ui is following Uj or there is a directed edge between Ui and Uj:

→ Then we will label it as “1”.

If Ui is not following Uj or there is no edge between Ui and Uj:

→ Then we will label it as “0”.

So, we are mapping this to a binary classification task with “0” implying the

absence of an edge and “1” implying the presence of a directed edge.

● Now, how do we featurize our data ?

→ feature :

So, in the below figure, we are trying to predict that if the edge between U1 and

U2 should be present or not.


U1 follows → {U3,U4,U5}

U2 follows → {U3,U4,U6}

Here, U1 and U2 are having respective sets of nodes they follow. As they have

so many common vertices or two sets are highly overlapped, there is a high

chance that U1 and U2 have common interests.

So, there is a high chance that U1 may want to follow U6 and U2 may want to

follow U5.

Similarly, there is a high chance that U1 could follow U2 and U2 could follow

U1.

The fact that U1 is following U2 signifies that there is a high chance that U2 will follow back

U1.
● So, these are called graph features.

Business Constraints & Metrics

● No low-latency requirements

You can precompute the top 5 or 10 friends Ui should follow once in 2 days or

weekly or every night and store it in a hash table-like structure and show it

whenever Ui logs in. As we can precompute, so there’s no strong latency

requirement

● We will recommend the highest probability links to a user, so we need to

predict the probabilities of the links which are useful.

I could have 100 such users which Ui could follow and I can have the

probability values. I might have 5 slots or 10 slots, where I want to show the

most probable top 5 friends the user Ui may want to follow.


● Ideally, We want high Precision and high Recall when we are

recommending Uj to Ui.

Performance Metric for Supervised Learning

● Both precision and recall are important, hence F1 score is a good choice here.

● We will also go for Confusion matrix.

● Another reasonable metric is Precision@topK.

Let’s say our K = 10

Let’s say Ui = {U1, U2, U3, ….,U10}, here these are the top 10 probable

vertices or friends Ui may want to follow.

Now, Precison@top10 means how many of them are actually correct ?

As in most social networks you don’t get show all the users whom Ui may want

to follow, as we have limited space. So, this metric is sensible.

Exploratory Data Analysis(EDA)- basic stats

● Here , we are using networkx library, which is one of the most popular library

where you want perform computation on a network or a graph.


● We will use networkx extensively.

● There is a file called train.csv which is the raw data.

● We will create a new file with out the headers “source_node” and

“destination_node” present in train.csv.

● Once we have created that, we will build the graph using networkx library.

Here, we will provide the file without headers and ask the networkx library to

read the edge list and build the graph.

When we are creating the graph, we ask it to build directed graph

(nx.DiGraph()).

● There is a very simple function “nx.info(graph)”, which gives us information

about the graph.

It says :

- Type : DiGraph

- Number of nodes : 1862220

- Number of edges : 9437519

- Average in degree : 5.0679

- Average out degree : 5.0679

● Outdegree : No of edges orginating from U1 and travelling to other vertices.

Indegree : No of edges coming into U1 from other vertices.


● Here, in the graph information , we got values of average in degree and average

out degree. So for every vertex, it will measure the in degree and out degree and

find the average. Our graph information says that on an average there are 5

edges going into a vertex and 5 edges coming out of a vertex.

● - A subgraph is basically a subset of vertices and edges in between those

vertices. If we want to visualize the subgraph, we can easily create a subgraph

using the function “nx.read_edgelist()”.

- We use the file without headers and state that we need the top 50 rows.

- We can easily draw using “nx.draw()” function and save the figure.

- In the below subgraph information, in degree, and out-degree are 0.75. It is less

than 1 because there are some empty nodes with no edges at all.

- In the above case, we can assume that those with no edges may be new users or

temporarily banned users or users having zero followers or friends.


This is a small subgraph.

● So, networkx is a very important tool box.

Exploratory Data Analysis(EDA)- Follower and Following stats

The below code tells us how many people are there in the graph.

dict in_degree → keys are unique nodes in the graph and corresponding values are number of

indegree edges to that node.


dict out_degree → keys are unique nodes in the graph and corresponding values are number

of outdegree edges from that node.

In Degree Analysis

● How many followers are there for each person?

The no of followers is exactly equal to in degree.

In the below graph, we can observe that:

— There are many people with 0 followers.

— There are some users with more than 0 followers, but you can notice a sharp

line which states that there is some user or at least one user with more than 500

followers in our dataset.

— But, most of the users have less than 40 followers.


● As there is a scale difference in the above graph, so let’s zoom in.

— As we move from 0th user to 1.86 million user, sorted in terms of followers, I

get this stiff curve.

— I’m zooming in on the graph from 0th user to 1.5 million user.

— So, in the below figure, we can observe that around 1.5 million users have

followers less than 7.

● Let’s plot a boxplot.

— Here, there are a bunch of outliers.

— There are people with more followers, but most people don’t have more

followers.

— Here mean, median, 25th percentile,75th percentile, 50th percentile are very

small. So, this is quite difficult to read the plot.


● Let’s look at 90th to 100th percentile.

— We can observe that 90% of the users have followers less than 12.

— 99% of the users have followers less than 40 or fewer.

— There is one user with 552 followers.

● Let’s zoom in from 99th percentile to 100th percentile.

— Only 0.1% of users have more than 112 followers.


● Of course, we can plot the pdf and we observed the same.

Out Degree Analysis


● No of people each person is following ?

— There is one person who is following more than 1500 people.

— Most people are following less people.

● Let’s zoom in on the graph from 0th user to 1.5 million user.

— We can observe from the graph that, around 1.5 million users follow less than

7 people.
● Let’s draw box-plot for analysis.

— There are bunch of outliers.

— Reading other parameters like mean, median, percentiles from the boxplot

graph is difficult as they are very small.


● There is one person who is following 1566 people and 99 percentile people are

following less than 40 people.

● 99.9 % of people are following less than 123 people.

● pdf of out degree analysis


● No of persons having zero followers are 274512 and % is 14.741115442858524.

● No of persons having zero followers are 188043 and % is 10.097786512871734.

● No of persons those are not not following anyone and also not having any

followers are 0.

Both ( followers + following )

In Di-Graph:

degree(Ui) = In_degree(Ui) + Out_degree(Ui)


● Min of no of (followers + following) is 1.

334291 people are having minimum no of (followers + following).

● Max of no of (followers + following) is 1579.

Only 1 person is having maximum no of (followers + following).

● No of persons having (followers + following) less than 10 are 1320326.

● No of weakly connected components 45558

weakly connected components wit 2 nodes 32195

— If all the nodes in a graph are connected by some path ignoring direction

then that entire graph is weakly connected component

You might also like