0% found this document useful (0 votes)

14 views22 pages

unit 5 (1)

Uploaded by

Pandu snigdha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views22 pages

unit 5 (1)

Uploaded by

Pandu snigdha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

UNIT 5

Problem definition

● Here, we present each user of a social network with a dot called node/vertex and

we connect people who are friends/followers/connections with edges called

links. So this whole thing in computer science and applied mathematics is called

graphs.

● In Facebook, you must have encountered the recommendation which is known

as a friend recommendation, and similarly, in Instagram, it is known as a

follower recommendation.

● Given the graph, we take the help of already existing links. Here, u5 is a friend

of u3 and u3 is a friend of u1. Now, Is it sensible to assume that u5 could be a

friend of u1?

Is u1-u5 link/edge a valid link or not?

The graph nodes, vertices and edges.

Mathematically, it is written as :

G = <V,E>

Each edge is a pair of two vertices.

edge = (ui, uj)

The directed graphs are usually used in Instagram and Twitter, where one person follows

another person, as well as the other person, can also follow back. So, there can be two

edges between two vertices too.

The dataset we have is also a directed graph for our problem statement.
● For ex: Instagram has follow and follow back system.

Follow means you’re opting to follow the person. Follow back means someone

followed you and you’re being offered the option to follow them back.
● Path is a sequence of valid edges to go from one vertex to another vertex and

the length of a path is equal to the number of edges between two points.

Here, valid path between a and h :

a → f → c → b → h, where length = 4

a → f → c → d →e →b → h, where length = 6

● Why directed graphs?

When we are dealing with traveling across path and cost of moving along the

same path changes due to some factor, we have to consider the directions

here.So directed graph is used in such situations.

Data Format & Limitations

train.csv

● The dataset was provided by facebook which is removed currently. I’ll provide

the data in my github repository.

● In a nutshell, the dataset consists of a simple file with two columns.

● Each datapoint is a pair of vertices.

● In the dataset, we are given pairs of vertices/nodes which contains directed

edges/links.

● Data columns ( 2 columns )

- Source node

- Destination node

● The first datapoint in the above diagram states that there is an edge from U1 to

U690569. Here, U1 is the source node and U690569 is the destination node.

● Here, U1 follows U690569, U315892, U189226. Here, all the edges originating

from “1” are listed below. Similarly, there are edges from 2, 3…etc. Remember,

these are directed edges.

● In total, we have 1862220(1.86 million) vertices/nodes and 9437520( 9.43

million) edges/links in our directed graph.

● We could have metadata like about the city in which the person lives in or in

which college they studied or did they study in the same college or some

additional information. But, as far as this task is concerned, we have directed

graph nodes and edges and have to work on it.

● So, this is a purely graph-based link prediction problem.

● But, as the network grows, people are following new people. The network is

very dynamic in real-world because today I may have discovered my old friend

on Facebook and started following them. So as far as the problem is concerned,

Facebook gave us a snapshot of the graph at one time. So, there are some

constraints as we cannot understand the evolution of the graph.

Mapping to a Supervised Classification Problem

Let’s map our data to a supervised classification problem.

● Let’s say we have vertex Ui and Uj.

If Ui is following Uj or there is a directed edge between Ui and Uj:

→ Then we will label it as “1”.

If Ui is not following Uj or there is no edge between Ui and Uj:

→ Then we will label it as “0”.

So, we are mapping this to a binary classification task with “0” implying the

absence of an edge and “1” implying the presence of a directed edge.

● Now, how do we featurize our data ?

→ feature :

So, in the below figure, we are trying to predict that if the edge between U1 and

U2 should be present or not.

U1 follows → {U3,U4,U5}

U2 follows → {U3,U4,U6}

Here, U1 and U2 are having respective sets of nodes they follow. As they have

so many common vertices or two sets are highly overlapped, there is a high

chance that U1 and U2 have common interests.

So, there is a high chance that U1 may want to follow U6 and U2 may want to

follow U5.

Similarly, there is a high chance that U1 could follow U2 and U2 could follow

U1.

The fact that U1 is following U2 signifies that there is a high chance that U2 will follow back

U1.
● So, these are called graph features.

Business Constraints & Metrics

● No low-latency requirements

You can precompute the top 5 or 10 friends Ui should follow once in 2 days or

weekly or every night and store it in a hash table-like structure and show it

whenever Ui logs in. As we can precompute, so there’s no strong latency

requirement

● We will recommend the highest probability links to a user, so we need to

predict the probabilities of the links which are useful.

I could have 100 such users which Ui could follow and I can have the

probability values. I might have 5 slots or 10 slots, where I want to show the

most probable top 5 friends the user Ui may want to follow.

● Ideally, We want high Precision and high Recall when we are

recommending Uj to Ui.

Performance Metric for Supervised Learning

● Both precision and recall are important, hence F1 score is a good choice here.

● We will also go for Confusion matrix.

● Another reasonable metric is Precision@topK.

Let’s say our K = 10

Let’s say Ui = {U1, U2, U3, ….,U10}, here these are the top 10 probable

vertices or friends Ui may want to follow.

Now, Precison@top10 means how many of them are actually correct ?

As in most social networks you don’t get show all the users whom Ui may want

to follow, as we have limited space. So, this metric is sensible.

Exploratory Data Analysis(EDA)- basic stats

● Here , we are using networkx library, which is one of the most popular library

where you want perform computation on a network or a graph.

● We will use networkx extensively.

● There is a file called train.csv which is the raw data.

● We will create a new file with out the headers “source_node” and

“destination_node” present in train.csv.

● Once we have created that, we will build the graph using networkx library.

Here, we will provide the file without headers and ask the networkx library to

read the edge list and build the graph.

When we are creating the graph, we ask it to build directed graph

(nx.DiGraph()).

● There is a very simple function “nx.info(graph)”, which gives us information

about the graph.

It says :

- Type : DiGraph

- Number of nodes : 1862220

- Number of edges : 9437519

- Average in degree : 5.0679

- Average out degree : 5.0679

● Outdegree : No of edges orginating from U1 and travelling to other vertices.

Indegree : No of edges coming into U1 from other vertices.

● Here, in the graph information , we got values of average in degree and average

out degree. So for every vertex, it will measure the in degree and out degree and

find the average. Our graph information says that on an average there are 5

edges going into a vertex and 5 edges coming out of a vertex.

● - A subgraph is basically a subset of vertices and edges in between those

vertices. If we want to visualize the subgraph, we can easily create a subgraph

using the function “nx.read_edgelist()”.

- We use the file without headers and state that we need the top 50 rows.

- We can easily draw using “nx.draw()” function and save the figure.

- In the below subgraph information, in degree, and out-degree are 0.75. It is less

than 1 because there are some empty nodes with no edges at all.

- In the above case, we can assume that those with no edges may be new users or

temporarily banned users or users having zero followers or friends.

This is a small subgraph.

● So, networkx is a very important tool box.

Exploratory Data Analysis(EDA)- Follower and Following stats

The below code tells us how many people are there in the graph.

dict in_degree → keys are unique nodes in the graph and corresponding values are number of

indegree edges to that node.

dict out_degree → keys are unique nodes in the graph and corresponding values are number

of outdegree edges from that node.

In Degree Analysis

● How many followers are there for each person?

The no of followers is exactly equal to in degree.

In the below graph, we can observe that:

— There are many people with 0 followers.

— There are some users with more than 0 followers, but you can notice a sharp

line which states that there is some user or at least one user with more than 500

followers in our dataset.

— But, most of the users have less than 40 followers.

● As there is a scale difference in the above graph, so let’s zoom in.

— As we move from 0th user to 1.86 million user, sorted in terms of followers, I

get this stiff curve.

— I’m zooming in on the graph from 0th user to 1.5 million user.

— So, in the below figure, we can observe that around 1.5 million users have

followers less than 7.

● Let’s plot a boxplot.

— Here, there are a bunch of outliers.

— There are people with more followers, but most people don’t have more

followers.

— Here mean, median, 25th percentile,75th percentile, 50th percentile are very

small. So, this is quite difficult to read the plot.

● Let’s look at 90th to 100th percentile.

— We can observe that 90% of the users have followers less than 12.

— 99% of the users have followers less than 40 or fewer.

— There is one user with 552 followers.

● Let’s zoom in from 99th percentile to 100th percentile.

— Only 0.1% of users have more than 112 followers.

● Of course, we can plot the pdf and we observed the same.

Out Degree Analysis

● No of people each person is following ?

— There is one person who is following more than 1500 people.

— Most people are following less people.

● Let’s zoom in on the graph from 0th user to 1.5 million user.

— We can observe from the graph that, around 1.5 million users follow less than

7 people.
● Let’s draw box-plot for analysis.

— There are bunch of outliers.

— Reading other parameters like mean, median, percentiles from the boxplot

graph is difficult as they are very small.

● There is one person who is following 1566 people and 99 percentile people are

following less than 40 people.

● 99.9 % of people are following less than 123 people.

● pdf of out degree analysis

● No of persons having zero followers are 274512 and % is 14.741115442858524.

● No of persons having zero followers are 188043 and % is 10.097786512871734.

● No of persons those are not not following anyone and also not having any

followers are 0.

Both ( followers + following )

In Di-Graph:

degree(Ui) = In_degree(Ui) + Out_degree(Ui)

● Min of no of (followers + following) is 1.

334291 people are having minimum no of (followers + following).

● Max of no of (followers + following) is 1579.

Only 1 person is having maximum no of (followers + following).

● No of persons having (followers + following) less than 10 are 1320326.

● No of weakly connected components 45558

weakly connected components wit 2 nodes 32195

— If all the nodes in a graph are connected by some path ignoring direction

then that entire graph is weakly connected component

Graphs
No ratings yet
Graphs
136 pages
Week 3 4 SNA+Recommender
No ratings yet
Week 3 4 SNA+Recommender
92 pages
Unit - 4
No ratings yet
Unit - 4
22 pages
Twitter Social Networking Analysis
No ratings yet
Twitter Social Networking Analysis
36 pages
Social Network Analytics Session2
No ratings yet
Social Network Analytics Session2
40 pages
Unit-4 Graphs Notes
No ratings yet
Unit-4 Graphs Notes
35 pages
Graph
No ratings yet
Graph
31 pages
Unit 2 BCAM-061
No ratings yet
Unit 2 BCAM-061
26 pages
56.3 - Data Format & Limitations..mp4
No ratings yet
56.3 - Data Format & Limitations..mp4
2 pages
Facebook Large Page-Page Network Analysis Using Gephi
No ratings yet
Facebook Large Page-Page Network Analysis Using Gephi
13 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
58 pages
Sna Project
No ratings yet
Sna Project
29 pages
Lab 13 Implementation of Graphs
No ratings yet
Lab 13 Implementation of Graphs
5 pages
56.4 - Mapping To A Supervised Classification Problem..mp4
No ratings yet
56.4 - Mapping To A Supervised Classification Problem..mp4
2 pages
UNIT 6 (1)
No ratings yet
UNIT 6 (1)
34 pages
Graph Databases: Phil Bartie
No ratings yet
Graph Databases: Phil Bartie
83 pages
ADSA UNIT 3
No ratings yet
ADSA UNIT 3
64 pages
menendezLlorente
No ratings yet
menendezLlorente
22 pages
Social Network Analytics Notes
No ratings yet
Social Network Analytics Notes
14 pages
lec12
No ratings yet
lec12
25 pages
Unit 4 - Non-Linear Data Structure - Binary - Graph - 1923081007
No ratings yet
Unit 4 - Non-Linear Data Structure - Binary - Graph - 1923081007
105 pages
BDA Experiment 8
No ratings yet
BDA Experiment 8
12 pages
200104092_DA_4
No ratings yet
200104092_DA_4
14 pages
Definitions: Graph, Vertices, Edges
No ratings yet
Definitions: Graph, Vertices, Edges
13 pages
Unit I Graph Theory and concepts
No ratings yet
Unit I Graph Theory and concepts
35 pages
Lecture 12 Graph
No ratings yet
Lecture 12 Graph
57 pages
2018-09-01 Foreign Policy
No ratings yet
2018-09-01 Foreign Policy
100 pages
Social Network Analysis
No ratings yet
Social Network Analysis
20 pages
Social Network Analysis Con Python PDF
No ratings yet
Social Network Analysis Con Python PDF
80 pages
Graphs Are A Generalization of Trees. Like Trees, Graphs Have Nodes and Edges. (The Nodes Are Sometimes
No ratings yet
Graphs Are A Generalization of Trees. Like Trees, Graphs Have Nodes and Edges. (The Nodes Are Sometimes
18 pages
Unit 3 Graph.pptx
No ratings yet
Unit 3 Graph.pptx
58 pages
Data Structures Lab Exp 13 - 14 - 16 Graphs BFS - DFS - Prims - Kruskals
No ratings yet
Data Structures Lab Exp 13 - 14 - 16 Graphs BFS - DFS - Prims - Kruskals
50 pages
19MAM81-GRLmidsem 1 answer key - Copy
No ratings yet
19MAM81-GRLmidsem 1 answer key - Copy
14 pages
DSC Unit-4
No ratings yet
DSC Unit-4
30 pages
Basics of Network Analysis
No ratings yet
Basics of Network Analysis
38 pages
Social Network Analysis (SNA) Is A Quantitative Method To Examine and Measure
No ratings yet
Social Network Analysis (SNA) Is A Quantitative Method To Examine and Measure
10 pages
Graph Theory
No ratings yet
Graph Theory
89 pages
GraphBasedDataScience
No ratings yet
GraphBasedDataScience
37 pages
C2 - Social Network Measurement
No ratings yet
C2 - Social Network Measurement
42 pages
56.1 - Problem Definition..mp4
No ratings yet
56.1 - Problem Definition..mp4
2 pages
Machine Learning Presentation Bushra Kambo Roll No 6
No ratings yet
Machine Learning Presentation Bushra Kambo Roll No 6
18 pages
15. W-09_L-1_Introduction to Graph Algorithms and Graph Representation.pptx
No ratings yet
15. W-09_L-1_Introduction to Graph Algorithms and Graph Representation.pptx
26 pages
Thesis
No ratings yet
Thesis
80 pages
Graph Data Science - Vipin Kumar
No ratings yet
Graph Data Science - Vipin Kumar
17 pages
Data Structures Lab 12 Graphs BFS DFS - R
No ratings yet
Data Structures Lab 12 Graphs BFS DFS - R
50 pages
End-To-End Learning of Latent Edge Weights For Graph Convolutional Networks
No ratings yet
End-To-End Learning of Latent Edge Weights For Graph Convolutional Networks
49 pages
Lecture 4 - Analyzing Massive Graphs Part I
No ratings yet
Lecture 4 - Analyzing Massive Graphs Part I
27 pages
Facebook Friend Recommendation
No ratings yet
Facebook Friend Recommendation
23 pages
b38778f0 6cb6 4dfb b4d3 898075eea15f the Little Book of Shadow Work Vol 2
No ratings yet
b38778f0 6cb6 4dfb b4d3 898075eea15f the Little Book of Shadow Work Vol 2
211 pages
Week 16
No ratings yet
Week 16
47 pages
I am sharing 'DSE ASSIGNMENT ADITI CHAUDHARY' with you
No ratings yet
I am sharing 'DSE ASSIGNMENT ADITI CHAUDHARY' with you
7 pages
P3 - Graph Theory - 19-10-2022
No ratings yet
P3 - Graph Theory - 19-10-2022
23 pages
UNIT - 4 Graphs PDF
No ratings yet
UNIT - 4 Graphs PDF
19 pages
Graph Theory and Path Searches in Python
No ratings yet
Graph Theory and Path Searches in Python
3 pages
Lesson 1
No ratings yet
Lesson 1
50 pages
datastructure5
No ratings yet
datastructure5
34 pages
Rec Sys Network
No ratings yet
Rec Sys Network
45 pages
Graph Theory
No ratings yet
Graph Theory
8 pages
Graph
No ratings yet
Graph
46 pages
Graph Algorithms: Timothy Vismor June 11, 2011
No ratings yet
Graph Algorithms: Timothy Vismor June 11, 2011
30 pages
Nse Volume 2 Report
No ratings yet
Nse Volume 2 Report
82 pages
Query Rearks 3
No ratings yet
Query Rearks 3
182 pages
BDC Final Record
No ratings yet
BDC Final Record
36 pages
Author Name Title Paper/Submission ID Submission Date Total Pages Document Type
No ratings yet
Author Name Title Paper/Submission ID Submission Date Total Pages Document Type
92 pages
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
24 pages
REPORTED SPEECH k12 - exercises (14.3.2025)
No ratings yet
REPORTED SPEECH k12 - exercises (14.3.2025)
2 pages
EXAM 3IS Final
100% (5)
EXAM 3IS Final
2 pages
Anh11_Ma 813 HKI 2023-2024
No ratings yet
Anh11_Ma 813 HKI 2023-2024
4 pages
HND in Computing and Software Engineering: Lesson 16 - Graph Data Structure
No ratings yet
HND in Computing and Software Engineering: Lesson 16 - Graph Data Structure
40 pages
Syrup
100% (2)
Syrup
31 pages
Time Series Forecasting Using Python: Bachelor of Technology Information Technology
No ratings yet
Time Series Forecasting Using Python: Bachelor of Technology Information Technology
36 pages
Mindful Listening': The Basics
No ratings yet
Mindful Listening': The Basics
14 pages
Relationship PDF
No ratings yet
Relationship PDF
15 pages
Aristotle 10 12
No ratings yet
Aristotle 10 12
2 pages
Mini Projects - SE
No ratings yet
Mini Projects - SE
128 pages
Application On Multilevel Queue Scheduling
No ratings yet
Application On Multilevel Queue Scheduling
10 pages
Start Predicting In A World Of Data Science And Predictive Analysis
From Everand
Start Predicting In A World Of Data Science And Predictive Analysis
Matthew Abbitt
No ratings yet
Academy Planning For 2017 - 2018
No ratings yet
Academy Planning For 2017 - 2018
3 pages
Example Question #1: Praxis Core Skills: Reading: History (1902, Ed. David Starr Jordan)
No ratings yet
Example Question #1: Praxis Core Skills: Reading: History (1902, Ed. David Starr Jordan)
11 pages
Nguyen Thi Hong Tham
No ratings yet
Nguyen Thi Hong Tham
104 pages
KP Model 2014 New
No ratings yet
KP Model 2014 New
58 pages
Network Commands: Rajesh Kumar Gunupudi It Dept, Vnrvjiet
No ratings yet
Network Commands: Rajesh Kumar Gunupudi It Dept, Vnrvjiet
14 pages
Student Result: Session: 2020-21 (Regular) Semesters: 1,2 Result: PASS Marks: 1242/1475
No ratings yet
Student Result: Session: 2020-21 (Regular) Semesters: 1,2 Result: PASS Marks: 1242/1475
2 pages
Recruitment System
No ratings yet
Recruitment System
18 pages
Timeseries Paper
No ratings yet
Timeseries Paper
1 page
Design and Finite Element Analysis of Shell & Tube Heat Exchanger Using Nano Fluids
No ratings yet
Design and Finite Element Analysis of Shell & Tube Heat Exchanger Using Nano Fluids
87 pages
MKT005 P1 Reviewer
No ratings yet
MKT005 P1 Reviewer
5 pages
FINAL (PS) - PR1 11 - 12 - UNIT 6 - LESSON 1 - Constructing Interviews, Observations, and Surveys
100% (1)
FINAL (PS) - PR1 11 - 12 - UNIT 6 - LESSON 1 - Constructing Interviews, Observations, and Surveys
38 pages
Individual Project - Mason Leary
No ratings yet
Individual Project - Mason Leary
15 pages
LNGC BORNO - IMO 9322803 - Machinery Operating Manual
No ratings yet
LNGC BORNO - IMO 9322803 - Machinery Operating Manual
331 pages
Trend Spotting and Design Collection For Kids Knit Wear Domestic Market PDF
No ratings yet
Trend Spotting and Design Collection For Kids Knit Wear Domestic Market PDF
83 pages
Deform 3D PDF
No ratings yet
Deform 3D PDF
2 pages
Mutual Fund Project Report
No ratings yet
Mutual Fund Project Report
53 pages
GS Series-EN 2020.1
No ratings yet
GS Series-EN 2020.1
6 pages
V2Soft Application For Employment
No ratings yet
V2Soft Application For Employment
2 pages
Matthew - Life For Today Bible Commentary - Andrew Wommack PDF
100% (2)
Matthew - Life For Today Bible Commentary - Andrew Wommack PDF
231 pages
Huma101 Fall2016syllabus
No ratings yet
Huma101 Fall2016syllabus
6 pages
Understanding Deuteronomy On Its Own Terms
No ratings yet
Understanding Deuteronomy On Its Own Terms
5 pages
English Language Arts - Grade 2
No ratings yet
English Language Arts - Grade 2
4 pages
Prof Elec 3 OPERATIONS AUDITING FULL
No ratings yet
Prof Elec 3 OPERATIONS AUDITING FULL
95 pages