Performance Comparison of Graph Database and Relational Database
Performance Comparison of Graph Database and Relational Database
net/publication/370751317
CITATIONS READS
0 2,395
3 authors, including:
4 PUBLICATIONS 0 CITATIONS
San Jose State University
1 PUBLICATION 0 CITATIONS
SEE PROFILE
SEE PROFILE
All content following this page was uploaded by Cajetan Rodrigues on 13 May 2023.
Abstract—We aim to present a comprehensive Graph databases are particularly useful for
comparison between a graph database, Neo4j, and a applications that deal with complex and interconnected
relational database, MySQL, focusing on their data, such as social networks, recommendation engines,
performance based on different types of queries. Graph and fraud detection systems. They provide a more
databases utilize graph structures, nodes, edges, and natural and intuitive way to represent data than relational
properties to represent data, while relational databases databases, especially when dealing with unstructured or
employ tables and relationships between them. This semi-structured data. Graph databases can also handle
study aims to evaluate the performance of Neo4j and large amounts of data and scale horizontally, making
MySQL in terms of data query execution time by them suitable for applications with a high volume of
data.
examining representative queries from four categories:
selection/search, recursion, aggregation, and pattern One of the main reasons for the popularity of graph
matching. Real-world data from Career Village was databases is their ability to perform complex queries
used for the experiment. The results show that Neo4j quickly and efficiently. Graph databases use a traversal-
outperforms MySQL in most cases, particularly in based query language known as Cypher that allows users
pattern matching and recursive queries. However, to search for patterns and relationships within the data.
MySQL has advantages in terms of data consistency This makes it easy to perform tasks such as pathfinding,
and transactional support. recommendation generation, and fraud detection.
Relational databases are one of the most widely used
types of databases, popular for their ability to store and
Keywords—Databases, Neo4j, NoSQL, Graph
manage large amounts of data in an organized and
Databases, Relational Databases efficient manner. They represent data in a tabular form,
with each table consisting of rows and columns, where
each row represents a record and each column represents
I. INTRODUCTION a specific attribute of that record. Relational databases
are based on the principles of relational algebra and are
designed to enforce data integrity and consistency.
Graph databases revolutionized the way data is
stored and processed. By representing data as nodes and One of the main reasons for the popularity of
edges in a graph, they enable us to uncover insights that relational databases is their ability to handle complex
would be impossible to detect or require complex and data relationships. By organizing data into tables and
expensive join operations with traditional relational establishing relationships between them, relational
databases. They allow us to efficiently navigate through databases make it easier to perform complex queries and
vast and intricate networks of data, making them analysis. They also provide a standardized language for
invaluable tools for applications ranging from e- querying and manipulating data,
commerce to scientific research. The research is
We aim to determine the difference between
motivated by the comparison of MySQL and Graph
databases and suggesting which database is suited under traditional RDBMS and a graph-based NoSQL
which scenarios. database. We execute a comprehensive comparison by
using a dataset and querying the same data in both
schemas across different categories. To facilitate a
1
comparison between a graph-based NoSQL database relational databases excel at managing structured data
and a traditional relational database management and enforcing integrity constraints.
system (RDBMS), we will employ Neo4j as the [3] offers a rather broad and detailed view of various
representative for the graph-based NoSQL database graph database models like property graphs, RDF
category and MySQL as the exemplar for the traditional graphs, Hypergraphs among others. They also provide
RDBMS category. We compare the performance of strength and weaknesses of different approaches for
various operations like search, pattern matching, managing graph data.
recursion, and aggregation. Neo4j is touted to be one of While [4] concludes that while relational databases
the best graph bases systems in the industry; well excel at managing structured data and enforcing data
known for its execution speed and the benefits that integrity constraints, graph databases are more effective
come with having a graph structure with nodes and at handling unstructured and semi-structured data with
edges to model the data effectively. MySQL is a very complex relationships.
popular and widely used RDBMS. The purpose is to In [5], the authors evaluate the performance and
show which is better and how significant of a difference scalability of both database models using various
it makes if either database is chosen. metrics, including response time, throughput, and CPU
usage. The authors found that the non-relational
This paper is structured as follows: In Section 2, we database model performed better in terms of response
conduct a survey of previous studies that are relevant to time and scalability, while the relational database model
performance comparison between MySQL and Neo4j. performed better in terms of data consistency and
Section 3 outlines the dataset used to assess the availability. To analyze further about Graph Databases
performance of these two database systems, specifically and how to query them, [6] offers a comprehensive
comparing graph databases (Neo4j) with relational overview of query languages for graph databases,
databases (MySQL). Section 4 details the Neo4j test providing readers with a solid foundation for
environment. Section 5 represents SQL test understanding how to query and manipulate graph data
environment. Section 6 represents the implementation in different contexts. The authors describe the features
and comparison between the SQL & Neo4j Queries.
of modern graph query languages, such as Cypher,
Section 7 outlines performances strategies used and
comparative analysis. Section 8 showcases performance Gremlin, and SPARQL, and provides examples of how
results. Finally, Section 9 concludes the paper and to use these languages to perform different types of
provides a discussion of the findings. queries thereby providing a solid foundation for
understanding how to query and manipulate graph data
in different contexts. [7-10] take a deeper look into
performance of graph databases on different datasets
II. RELATED WORK and focus on their performance on aggregation and
recursive queries.
Various studies have compared the performance of
MySQL and Neo4j graph databases for different types
of queries and datasets. Some studies have found that III. DATASET
Neo4j performs better than MySQL in terms of query
speed, while others have found that MySQL is faster
and more memory efficient. The types of queries tested A. Collection of Dataset
include selection, aggregation, recursion, pattern
matching. The studies also explore the use of graph The CareerVillage dataset provides a valuable
databases in various domains, such as social network resource for researchers interested in studying career
analysis, web-based applications, IoT data guidance and counseling. In this research paper, we will
management, and Customer Relationship Management use the dataset to compare the performance of SQL vs
(CRM) systems. Overall, the studies suggest that the Neo4j, two popular database management systems.
performance of graph databases is better than that of Specifically, we will analyze how these systems perform
conventional databases for certain types of queries and when querying and processing the dataset's information,
datasets. which includes questions asked by students, answers
In [1] and [2], the authors draw comparisons provided by professionals, and demographic data of both
between a graph based and relational based database students and professionals. Our evaluation criteria will
and highlight the advantages and disadvantages of both focus on four query groups: selection, recursion,
databases. In [2], authors highlight the use of graph for aggregation, and pattern matching. These query groups
represent common types of queries that are used to
tracking the relationships and origins from the
analyze large, complex datasets. By evaluating the
perspective of data provenance and talk about how performance of SQL and Neo4j on these query groups,
we hope to gain insights into the strengths and
2
limitations of each database management system. IV. NEO4J TEST ENVIRONMENT
Ultimately, our research aims to provide guidance to
researchers and practitioners in selecting the most
appropriate database management system for analyzing
similar datasets.
B. Dataset representation
3
1. Begin by invoking the bash shell 1. Count all nodes.
initialize_neo4j.sh.
2. In the bash shell, execute the script called The following query counts all the nodes loaded
execute.sh. on the Neo4j DBMS.
3. The execute.sh script executes the first Python
script called load_nodes.py, which reads the data
from the CSV file and creates the corresponding
nodes in the Neo4j database.
4. Once the load_nodes.py script completes its
execution, the setup.sh script executes the second
Python script called create_relationships.py. Fig. 2. Query to count all nodes
5. The create_relationships.py script reads the data
from the CSV file and creates the relationships
between the nodes created in the previous step. 2. Count all relationships.
6. End the script.
The following query counts all the possible
Thus, the bash script executes both Python scripts relationships between a pair of nodes.
sequentially, where the first script loads the nodes into
the Neo4j database, and the second script creates
relationships between the nodes. This approach allowed
us to automate the entire process of loading data into the
Neo4j database and creating relationships between the
nodes using a single command.
Post loading the dataset onto the neo4j database, we 3. List count of each node
used the Cypher language to write data profiling
queries to visualize the created nodes and The following query counts the number of nodes
relationships. present in each entity or label.
Cypher is the query language used in Neo4j, a
popular graph database management system. It is a
declarative, pattern-matching language that is
specifically designed for querying and manipulating
graph data. With Cypher, users can express complex
Fig. 4. Query to list count of each entity
queries in a concise and readable syntax that is easy to
understand and maintain.
Cypher provides a range of expressive syntax for TABLE I
filtering, aggregating, and transforming data stored in DISPLAYING THE COUNT OF EACH ENTITY
a graph. It also supports several advanced features
such as pattern matching, traversals, path finding, and Node Count
spatial operations. Cypher queries are constructed Matches 4316275
using ASCII art-like patterns, which makes them easy Emails 1850101
to read and understand.
Tag Users 136663
Overall, Cypher is a powerful and flexible language
Tag Questions 76553
that enables users to query and manipulate graph data
Answers 51123
in Neo4j quickly and easily. It is a key component of
the Neo4j ecosystem, and is widely used by Students 30971
developers, data analysts, and data scientists to build Professionals 28152
graph-based applications and solve complex data Questions 23931
problems. Tags 16269
Comments 14966
The following are some of data profiling queries we School Memberships 5638
used to visualize the created nodes and relationships. Group Memberships 1038
Groups 49
4
• Installed instance of MySQL server (MySQL
server community edition 8.0.32 was used in
4. Visualize all nodes and relationships. our implementation)
• Installation of python (Python version 3.9.13
The following is the representation of the nodes and the was used in our implementation)
relationships between the nodes.
• Python library: mysql-connector-python
5
Fig. 8. Entity-Relationship diagram
1. Selection
SQL
SELECT p.* FROM professionals p JOIN
tag_users tu ON p.professionals_id =
tu.tag_users_user_id JOIN tags t ON
tu.tag_users_tag_id = t.tags_tag_id
WHERE tags_tag_name = 'college';
Cypher
MATCH (p:Professionals)-[]->(t:Tags)
WHERE t.tags_tag_name='college'
RETURN p,t
6
SQL
SELECT * FROM professionals p JOIN
emails e ON p.professionals_id =
e.emails_recipient_id WHERE
p.professionals_id =
'0079e89bf1544926b98310e81315b9f1';
Cypher
MATCH
(p:Professionals{professionals_id:
'0079e89bf1544926b98310e81315b9f1'})-
[:GOT_EMAIL]->(e:Emails)
RETURN e
2. Recursion
Fig. 10. Query to find professionals in a specific tag Q4: Looking for the questions with answers
recursively many times?
SQL
WITH RECURSIVE answer_replies AS(
SELECT answers_id, answers_author_id,
answers_question_id,
answers_date_added, answers_body FROM
answers WHERE answers_question_id IS
not null UNION all SELECT
Fig. 11. EXPLAIN ANALYSE on SQL Query a.answers_id, a.answers_author_id,
a.answers_question_id,
a.answers_date_added, a.answers_body
Q2: Looking for students in a specific group and FROM answers a INNER JOIN
interested in a specific tag? answer_replies ar ON ar.answers_id =
a.answers_question_id ) SELECT * FROM
SQL answer_replies ar LEFT JOIN questions
SELECT * FROM students s JOIN q ON ar.answers_question_id =
group_memberships gm ON students_id = q.questions_id;
gm.group_memberships_user_id JOIN Cypher
groups_ g ON g.groups_id = MATCH (q:Questions)<-
gm.group_memberships_group_id JOIN [:IS_REPLY_TO*1..]-(a:Answers)
tag_users tu ON tu.tag_users_user_id RETURN q,aWHERE
= s.students_id JOIN tags t ON t.tags_tag_name='college'
t.tags_tag_id = tu.tag_users_tag_id RETURN p,t
WHERE t.tags_tag_name = 'college' AND
g.groups_group_type = 'youth
program';
Cypher
MATCH (t:Tags)<-[:HAS_TAG]-
(s:Students)-
[:MEMBER_IN]->(b)
WHERE t.tags_tag_name='college'
AND b.groups_group_type='youth
program'
RETURN s,t,b
7
LEFT JOIN questions q ON
ar.answers_question_id =
q.questions_id;
Cypher
MATCH (q:Questions)<-
[:IS_REPLY_TO*1..3]-
(a:Answers)
RETURN q,a
3. Aggregation
Fig. 13. EXPLAIN ANALYSE SQL Command
Q7: Count the number of professionals who answered
the questions.
Q5: Looking for questions with answers recursively
twice?
SQL
SELECT count(professionals_id) FROM
SQL professionals p JOIN answers a ON
WITH RECURSIVE answer_replies p.professionals_id =
AS(SELECT 1 as level,answers_id, a.answers_author_id;
answers_author_id, Cypher
answers_question_id, MATCH (p:Professionals)-[]-
answers_date_added, answers_body FROM >(a:Answers)
answers WHERE answers_question_id IS RETURN count(p)
not null UNION all SELECT level+1,
a.answers_id, a.answers_author_id,
a.answers_question_id,
a.answers_date_added, a.answers_body
FROM answers a INNER JOIN
answer_replies ar ON ar.answers_id =
a.answers_question_id WHERE level
<=2) SELECT * FROM answer_replies ar
LEFT JOIN questions q ON
ar.answers_question_id = Fig. 14. Cypher query to count the number of
q.questions_id; professionals who answered the question
Cypher
MATCH (q:Questions)<-
[:IS_REPLY_TO*1..2]-
(a:Answers)
8
Cypher
MATCH (p:Professionals)-[:HAS_TAG]-
>(t:Tags)
WHERE t.tags_tag_name='college'
RETURN count(p)
SQL
SELECT tags.tags_tag_id,
tags_tag_name,
COUNT(p.professionals_id) AS
number_of_professionals FROM tags
JOIN tag_users tu ON tags.tags_tag_id Fig. 16. Cypher query to find question answered in tags
= tu.tag_users_tag_id JOIN
professionals p ON p.professionals_id
= tu.tag_users_user_id GROUP BY
tags.tags_tag_id, tags_tag_name ORDER
BY COUNT(p.professionals_id) DESC
LIMIT 1;
Cypher
MATCH (p:Professionals)-[:HAS_TAG]-
>(t:Tags)
RETURN t.tags_tag_name AS TagName,
COUNT(p) ORDER BY COUNT(p) DESC
LIMIT 1
Fig. 17. EXPLAIN ANALYSE SQL Command
4. Pattern Match
Q11: Looking for students and professionals with the
same group?
Q10: Looking for the question answered in tags?
SQL
SELECT g.groups_id, professionals_id,
SQL
students_id FROM groups_ g JOIN
SELECT q.questions_id, t.tags_tag_id,
group_memberships gm ON g.groups_id =
a.answers_id FROM tags t JOIN
gm.group_memberships_group_id JOIN
tag_questions tq ON t.tags_tag_id =
(SELECT group_memberships_group_id AS
tq.tag_questions_tag_id JOIN
group_id, professionals_id FROM
questions q ON professionals p JOIN
tq.tag_questions_question_id = group_memberships gm1 ON
q.questions_id JOIN answers a ON gm1.group_memberships_user_id =
a.answers_question_id = questions_id;
p.professionals_id) pg ON pg.group_id
Cypher = gm.group_memberships_group_id JOIN
MATCH (a:Answers)-[]->(q:Questions)- (SELECT group_memberships_group_id AS
[]->(t:Tags) group_id, students_id FROM students s
RETURN a,q,t JOIN group_memberships gm2 ON
s.students_id =
gm2.group_memberships_user_id) sg ON
sg.group_id=
gm.group_memberships_group_id;
Cypher
MATCH (p:Professionals)-[]-
>(g:Groups)<-[]-(s:Students)
RETURN p, g, s
9
Q12: Looking for patterns that students and experts
in the same tag?
SQL
SELECT pt.tags_id, st.students_id,
pt.professionals_id FROM tags t JOIN
tag_users tu ON t.tags_tag_id =
tu.tag_users_tag_id JOIN (SELECT
u.tag_users_tag_id AS tags_id,
professionals_id FROM professionals p
JOIN tag_users u ON
p.professionals_id =
u.tag_users_user_id) pt ON
pt.tags_id= t.tags_tag_id JOIN
(SELECT u.tag_users_tag_id AS Fig. 19. Cypher query using Explain command in Neo4j.
tags_id, students_id FROM students s
JOIN tag_users u ON s.students_id = The ‘profile’ command provides more detailed
u.tag_users_user_id) st ON st.tags_id information than the explain command. It provides
= t.tags_tag_id LIMIT 100000; information on the execution plan, as well as
Cypher additional statistics on how the query was executed,
MATCH (p:Professionals)-[]- such as the number of database hits, the number of
>(t:Tags)<-[]-(s:Students) rows processed at each stage, and the total processing
RETURN p, t, s LIMIT 100000
time. The command used is :
A. Neo4j Performance Strategy Fig. 20. Cypher query using PROFILE command in
Neo4j.
Now that we have out data modelled and setup, we
use the Neo4j Browser client in the Neo4j Desktop
app to run our Cyphers as discussed in the paper in the The profile command is more useful than the explain
previous section. Neo4j provides a lot of functionality command when optimizing queries because it provides
out of the box as we can use the EXPLAIN and more detailed information about the performance of the
PROFILE keywords. query. By examining the statistics provided by the
PROFILE command, developers can identify
The ‘explain’ command is used to show the performance bottlenecks and adjust optimize their
execution plan of a Cypher query. It provides queries. Figures 21 and 22 show the more detailed
information on how the query will be executed, such execution plan and at the bottom of figure 22 we can also
as which indexes will be used, which operations will see the execution time displayed for the query to
be performed, and the estimated number of rows that complete execution. We will be using the same for all
will be processed. The command used is : the 12 queries and run each query 3 times and take the
average of their runtimes to consider that value for
further comparison against relation execution times.
Fig. 18. Cypher query using Explain command in Neo4j.
10
query results on the client-side, which is important to
calculate the exact execution time purely reflective of
the MySQL DB engine capability. For demonstrative
purposes, consider the same query, that is used in the
previous subsection, i.e., query Q4. Fig. 23 shows the
output of executing the EXPLAIN ANALYZE
command on Q4.
Fig. 21. Visualising the PROFILE command Fig. 24. Zoomed-in version of Fig. 23
We will run the EXPLAIN ANALYZE command on
each of the 12 queries three times to find the average
execution time of each query so that it can be used for
performance comparison with Neo4j in the next section.
• Selection/Search
B. MySQL Performance Startegy • Recursive/Related
• Aggregation
Once the data loading process is complete, we
• Pattern Matching
use the ‘mysql’ command-line client tool to query our
database and test if the data is loaded so that we can go
ahead with performance evaluation. For analyzing the For the scope of this comparison, we focus on one
execution of a query and finding out the exact of the most practical parameters to judge performance
execution time we use the EXPLAIN ANALYZE of a query: execution time. We compare the execution
command. times across the four categories, and we ensure to have
This command is essential as it gives the at least 3 queries per category.
complete breakdown of how the query was executed The hardware configuration used was Apple
(types of join strategy used, result size – intermediate MacBook Pro with M1 Pro Apple Silicon Chip coupled
and final, estimated cost, actual execution times at all with 16 GB RAM and running the latest version of
steps, etc.) and exactly how much time was spent on MacOS Ventura. We have three machines of the same
each aspect of the query. This can help identify configuration with each running a local instance of both
bottlenecks and optimize performances where needed. databases i.e. Neo4j Desktop and MySQL so we can
Additionally, it ignores the time required to render later record and take average and ensure there was no
11
swaying of results due to any other external factors include metrics such as how much memory tradeoff is
related to our local systems. there is storing duplicated data in a NoSQL system and
We then recorded the time it took for both databases it’s scalability and cost consequences with respect to the
to do the job and present the results in the table below : gain in performance we obtain by following the graph
structure. Similar to this, there can be further work and
TABLE III analysis based on the use cases, size of data and various
PERFORMANCE COMPARISON BETWEEN other parameters.
NEO4J & MYSQL
12
[9] L. Jachiet, P. Genevès, N. Gesbert, and N. Layaïda, [10] J. Hölsch, T. Schmidt, and M. Grossniklaus,
"On the Optimization of Recursive Relational "On the performance of analytical and pattern
Queries: Application to Graph Queries," in matching graph queries in neo4j and a relational
Proceedings of the 2020 ACM SIGMOD database," in EDBT/ICDT 2017 Joint Conference:
International Conference on Management of Data, 6th International Workshop on Querying Graph
2020, pp. 681-697, doi: 10.1145/3318464.3380594. Structured Data (GraphQ), 2017, pp. 15-22, doi:
10.1145/3035918.3035930.
13