Large Scale and MultiStructured Databases
Large Scale and MultiStructured Databases
There are different sources of big data, starting from social media and networks, scientific
instruments, mobile devices, transactions, sensor technology and networks, and genomics.
These are the 5 most important Vs on Big Data:
- Volume: huge amount of data
- Variety: different formats of data from various sources
- Value: extract useful data
- Velocity: high speed of accumulation of data
- Veracity: inconsistencies and uncertainty in data
Input data is divided in logical splits, the mapper running on a PC will produce results and these
results will be re-organized in order to be re-elaborated by the reducer, that will aggregate them
and produce the final result:
The mapper takes as input couples of key/value and produces couples of key/value then all the
key values resulted by different mappers are sorted to generate key/value pairs where the value is
a list of values, each of them is elaborated by a specific reducer, that produces an output.
In this example the input key is the word and the value is the count, all key/value couples are
shuffled and sorted and the reducer counts the number of elements and then produces a result.
We can see the physical infrastructure of the Google Software Architecture for Big Data.
The distributed file system was the most important part and they used the map-reduce
architecture working with the framework Hadoop and BigTable.
Hadoop is an open-source (the first) framework of Google written in Java used for the distributed
storage of very large data sets (Big Data) and the distributed processing of very large data sets.
This framework consists of a number of modules:
- Hadoop Common: the common utilities that support the other Hadoop
modules.
- Hadoop Distributed File System (HDFS)
- Hadoop YARN – A framework for job scheduling and cluster resource management
- Hadoop MapReduce – programming model
The HDFS – is a distributed File System written in Java, it scales to clusters with thousands of
computing nodes.
Each node stores part of the data in the system and it is fault tolerant due to data replication.
It is designed for big files and low-cost hardware and efficient for read and append operations.
We have some limits, for example Hadoop is optimized for one-pass batch processing of on-disk
data. In interactions we have to push back the data from memory to the file system. A loop of
several map-reduce stages requires more operations with memory.
In iterative programs it is very slow.
Apache Spark is an open-source which has emerged as the next generation big data processing
tool due to its enhanced flexibility and efficiency.
Spark allows employing different distributed programming models, such as MapReduce and
Pregel, and has proved to perform faster than Hadoop, especially in case of iterative and online
applications.
Unlike the disk-based MapReduce paradigm supported by Hadoop, Spark employs the concept of
in-memory cluster computing, where datasets are cached in memory to reduce their access
latency.
At high level, a Spark application runs as a set of independent processes on the top of the dataset
distributed across the machines of the cluster and consists of one driver program and several
executors.
The driver program, hosted in the master machine, runs the user’s main function and distributes
operations on the cluster by sending several units of work, called tasks, to the executors.
Each executor, hosted in a slave machine, runs tasks in parallel and keeps data in memory or disk
storage across them.
The main abstraction provided by Spark is the resilient distributed dataset (RDD).
RDD is a fault-tolerant collection of elements partitioned across the machines of the cluster that
can be processed in parallel. These collections are resilient, because they can be rebuilt if a portion
of the dataset is lost.
The applications developed using the Spark framework are totally independent of the file system
or the database management system used for storing data.
Databases revolution
A database is an organized collection of data.
It can be a book with a enforced structure such as dictionaries or a library and other indexed
arcvhives of information. They can be seen as the first databases.
In the 19th century we had a mechanical database, loom card for creating complex fabric patterns.
We needed to store what we wanted to have in our fabric.
Another kind of databases was perforated paper strips for musical boxes, strips contained
information.
We also had punched cards used in tabulating machine for producing census statistics.
The zero revolution was the tape revolution. Data were stored sequentially adopting paper tape or
magnetic tape.
Features was ‘fast forward’ and ‘rewind’ through the datasets.
But the problem was no direct high-speed access to individual records.
The first electronic database, OLTP (Online Transaction Processing) appeared around 1950.
Data was stored in the file system, we had a database but without a DBMS, they had a ISAM
system that had records containing indexes.
A DBMS is a software in charge of managing the DB in term of querying, consistence and other
rules we have to follow when work with data.
We had no application in the data layer and all the control of the database was in the hand of the.
application. Every application had to reinvent the database wheel, with problems like errors in
handling code, corrupting data, concurrent access, duplication in each application.
The first database revolution was when we have a new layer that allowed the separation between
the database handling logic and the application itself, with DBMS.
It reduced the programmer overhead and increased the performance and integrity of data.
The DBMS need two elements to work correctly:
- a scheme for the data to store in the database
- an access path for navigating among data records
The navigation model has to do with the organization of the data.
In the hierarchical model the only way to get information was to start from a product.
We had to define the navigational model keeping in mind what queries we have to do.
The network model can be better in some cases.
If you start from a product all the time, the hierarchical model can be okay.
Problems of the Navigation Models were that they ran exclusively on the mainframe computers, in
only queries that could be anticipated during the initial design were possible, a hierarchical model
you can just start from the product and you cannot start from a department for example, and also
it was difficult to add new data. Also, complex analytic queries requires hard coding.
The second database revolution started few years later.
The demand for complex analytics were increasing, existing databases were hard to use and
existing databases mixed logical and physical implementations.
In 1970 a paper was published talking about the Relational theory with the concept of tuples, an
unordered set of attribute values, and relations, collections of distinct tuples.
There were also constraints to enforce consistency of the database and key constraints used to
identify tuples and relationships between tuples.
Operations on relations, such as joins, projections and unions, were introduced too, together with
the third normal form, that says that non-key attributes must be dependent only on the key.
Then, the concept of transactions was introduced too, which is a concurrent data change requests
on a database system.
In transactions we need to ensure consistency and integrity of data, and a solution are ACID
transactions.
A transaction is a transformation of the state which has the properties of atomicity (all or
nothing), durability (survive failures) and consistency (correct).
An ACID transaction should be:
- atomic, the transaction is indivisible.
- consistent, the DB remains in a consistent state before and after the execution of the
transaction.
- isolated, while multiple transactions can be executed simultaneously, one transaction
should not see the effects of other in-progress transactions.
- durable, once a transaction is saved to the DB, its changes are expected to persist even if
there is a failure of operating system or hardware.
If the transaction fails after completion of T1 but before completion of T2 we have an inconsistent
state.
Then the SQL language appeared and was adopted. as a standard query language and also, a war
started for DBs between IBM and ORACLE.
In the client-server paradigm the relational database had a lot of importance.
Database systems were hosted on remote servers, while presentation logic was hosted on a PC
terminal. Objected-oriented application had problems with relational databases.
Multiple SQL operations were required to convert from the object-oriented representation to the
relational representation.
A new DBMS model, namely Object Oriented Database Management System (OODBMS), allowed
to store program objects without normalization and applications were able to load and store their
object easily.
Object-Relational Mapping (ORM) frameworks alleviated the pain of OO programmers.
Internet started to pervade our lives and the era of massive web-scale applications created
pressures on the relational database model. They needed large volumes of read and write
operations, low latency response times and high availability.
The performance of RDBM may be improved with vertical scaling (upgrading CPUs, adding
memory and faster storage devices) but it is costly and there are limitations on the number of
CPUs and memory.
But in any case, it would still be difficult to modify the structure of existing data, difficulties in
deploying on multiple servers, performance issues and ACID transactions hard to implement on
multiple servers.
The third database revolution started with the needs of large-scalable DBMSs characterized by:
- Scalability: for example, a spike of traffic on a website.
We could add a new server and then shut it down when traffic become normal but hard to
adopt with RDBMS.
- Flexibility: for example, application which handles several kind of items.
We could add/remove/modify one or mode fields which describe items in the database.
But to modify the schemes of RDBMs is not an easy task and involves the modification also
of the application code.
We then move to a Schema-less model from a Schema model.
- Availability: for example websites such as Social Networks or e-commerce services
must be always available for users.
A solution could be to deploy the whole systems on multiple servers. If one server goes
down, the services continue to be available for users, even though performances (and also
data consistency) may be reduced.
Usually RDMSs run on a single server. If it fails, the services will not be available unless
there is a backup server. However, this solutions are not efficient because the second
server will not improve the performance of the DBMSs during its normal workflow.
We can’t exploit the replication and sharding and do a load balancing.
- Low cost: most of the RDBMSs used by enterprises and organizations have often licensing
costs. These costs depends on the number of users.
A solution might be to use open source DBMSs.
The issue is that enterprises and organizations should change their database architectures
and even re-design entirely their applications.
So, to recap:
Currently, scalable services for handling relational databases are offered by big enterprises such as
Google and Amazon. Using these services, enterprises may continue to use their software without
the needs to migrate towards NoSQL databases. These services are cloud-based and pay-per-use
services.
The offered solutions are often scalable up to thousands of nodes and ensure high availability of
the service, but still the flexibility is reduced as in classical SQL-based DBMSs.
Everything is also managed on the cloud and also a spending manager is needed for taking the
costs under control.
Functional requirements describe the main functionalities of the application and modules and they
depend on the type of software, expected users and the type of system where the software is
used.
Functional requirements may be high-level statements of what the system should do.
Functional requirements can define if/then behaviours.
EX. The user shall be able to search either all of the initial set of databases or select a subset from
it. The system shall provide appropriate viewers for the user to read documents in the document
store.
Specifying some fields in these requirements is not an error, but all of them yes.
Non-functional requirements define system properties and constraints (e.g. reliability, response
time and storage requirements).
Ex. At least concurrently 100 years, at least it should have xMB/s in download, access should be
regarded only to authorized users
They may regards also programming language or development method. Non-functional
requirements may be more critical than functional requirements. If these are not met, the system
is useless.
Ex. In this application we decided to use C#.
A categorization of non functional requirements can be this one.
- Product requirements: requirements which specify that the delivered product must behave
in a particular way (execution speed, reliability, ecc.)
- Organizational requirements: consequences of organizational policies and procedures
- External requirements: which arise from factors which are external to the system and its
development process (e.g. interoperability and legislative) requirements
Use cases represent a formal way of describing how a system interact with its environment. A use
case illustrates the activities that are performed by the users of the systems.
Use cases are sequence of actions a system performs that yields a valuable result for a particular
actor.
This is a scenario-based technique in UML.
You can add mookup at the use case diagram.
Use case diagrams describe what a system does from the standpoint of an external observer, the
emphasis is on what a system does rather than how.
They are closely connected to scenarios, which are example of what happens when someone
interacts with the application.
A user or outside system that interacts with the system being designed, in order to obtain some
value from that interaction is an actor.
Use cases describe the interaction between the actor and the system.
It can be a human, a peripheral device, external system or subsystem, time or time-based event.
A base use case is dependent on the included use case. Without it/them the base use case is
incomplete as the included use case represent sub-sequences of the interaction that must always
happen.
Whenever we make an ATM transaction, we also have to run the Customer Authentication action.
Otherwise, it cannot be executed. For example, Customer Authentication -> Authentication Check.
The extending use case is dependent on the base use case; it extends the behavior described by
the base use case. The base use case should be a fully functional use case without the extending
use case’s adding functionality.
You can’t make help without doing the Bank ATM Transaction. It is an additional feature that you
have.
Data models provide a structure and formal description of the data and data relationships among
them required for an information system.
EX. Entities: Company, Customer, Departments
Relations: Company has a collection of departments
Entity Sets and Relationship Sets do not depend on the type of Data Base and DBMS.
The overall data modelling process includes three stages:
- Data analysis
- Designing a conceptual data model (entity relationship model)
- Converting the conceptual data model into an actual DB data organization model
We will model this using UML analysis classes.
A department has a collection of employes. The order in which they’re added is the reverse of the
E/R diagram.
We have a generalization, we can assume that employee is an abstract class, which can not be
instantiated, and from it we can obtain some generalized classes.
In a relational model we would need different tables.
In the graph-based model we have vertices which are entities and edges with some properties.
In particular, they are records, and not general entities.
We found these actors: Manager
Selected it is specified to make understand that first a department has to be selected.
We resemble the manager as a man.
Add can’t be an extension of browse, you don’t care about browsing, it’s not important to run the
browsing. Users do not browse before adding, even if it is part of the mockup.
So, it’s a direct action, connected directly to the actor.
If everytime we add a department we want to see a department we need to add an inclusion from
add to view. In that case it will be directly showed. (add is not shown in the diagram)
If we want to show an acknowledgement we could include add department -> show ack.
An extension of find department can be set parameters, but this advised in case we have default
parameters.
Once a department has been selected, a manager can browse the list of its employees.
Once an employee has been selected, a manager can browse the list of its project, so we have
another extension.
Once an employee has been selected, a manager can assign a project.
Once we view a project, we can show its details.
Putting inside an entity the id of the entity with which it has a 1-* relation is wrong, it’s already
specified by the relation in the diagram.
We can avoid attributes and specify them with a relation to an entity features (relation 1-*) and
use generalizations or even use different relationships with a lot of entities which represents the
features with 1-1 relations.
ACID vs BASE
Databases must allow users to store and retrieve data. To this aim, three tasks must be in charge
to DBMSs:
- Store data persistently
- Maintain data consistency
- Ensure data availability
Most of the recent NoSQL DBMSs can be deployed and used on distributed systems, namely on
multiple servers rather than a single machine.
Data persistency consists in being sure that data must be stored in a way that is not lost when the
database server is shut down, or for example in case of failure.
For example, a financial database and customers that need to add money to the company. We
need to decrease the customer balance and increase the total fund available. In order that Bob to
use the money correctly he has to wait a consistent data.
Data Availability: Data stored in a single server DBMS may be not available for several reasons,
such as failures of the operating system, voltage drops, disks break down. In the figure we show a
possible solution that ensures a high data availability: the two-phase commit. We may think to
have a backup server.
If the primary server goes down, the Backup server takes it places and the DBMS continues to
offer its services to the users.
Into an operation we have to write to a primary server, write the copy to a backup server, wait for
the acknowledgment from the backup server and then wait for the finish ack from the primary
server.
We have to wait more but it’s needed if we want the fully consistency. We also have the network
overhead.
You can have consistent data, and you can have a high-availability database, but transactions will
require longer times to execute than if you did not have those requirements.
Everything depends on functional and non-functional requirements.
We can have consistent but not available, we have to wait to complete the multi-phase commit.
When dealing with two-phase commits we can have consistency but at the risk of the most recent
data not being available for a brief period of time.
While the two-phase commit is executing, other queries to the data are blocked, the server will
not be available.
The updated data is unavailable until the two-phase commit finishes. This favors consistency over
availability.
We have a trade-off between availability and consistency.
All these things are summarized in the CAP theorem (Brewer’s Theorem), the unique theorem in
distributed systems.
Distributed Databases cannot ensure at the same time:
Consistency (C), the presence of consistent copies of data on different servers
Availability (A), namely to providing a response to any query
Partition protection (P), Failures of individual nodes or connections between nodes do not impact
the system as a whole.
At maximum two of the previous features may be found in a distributed database.
This triangles have on vertices three features and on the sides we have different situations we
have when two features are present.
We have ACID transactions that are too ACID for new large-scale application, so we will move
towards BASE.
Let’s talk about the latency issue.
The level of consistency may be tunable. We have different types of eventual consistency:
- Read-Your-Writes Consistency ensures that once a user has updated a record, all of his/her
reads of that record will return the updated value. He will read the correct value but not
other users.
Let’s say Alice updates a customer’s outstanding balance to $1,500.
The update is written to one server and the replication process begins updating other
copies.
During the replication process, Alice queries the customer’s balance. She is guaranteed to
see $1,500 when the database supports read- your-writes consistency.
- Monotonic read consistency ensures that if a user issues a query and sees a result, all the
users will never see an earlier version of the value. If you get a result, another user that
makes the same result cannot receive an earlier result.
Let’s assume Alice is yet again updating a customer’s outstanding balance. The outstanding
balance is currently $1,500. She updates it to $2,500.
Bob queries the database for the customer’s balance and sees that it is $2,500.
If Bob issues the query again, he will see the balance is $2,500 even if all the servers with
copies of that customer’s outstanding balance have not updated to the latest value.
- Monotonic write consistency ensures that if a user makes several update commands, they
will be executed in the order he/she issued them.
Alice is feeling generous today and decides to reduce all customers’ outstanding balances
by 10%. Charlie, one of her customers, has a $1,000 outstanding balance. After the
reduction, Charlie would have a $900 balance. Now imagine if Alice continues to process
orders. Charlie has just ordered $1,100 worth of material. His outstanding balance is now
the sum of the previous outstanding balance ($900) and the amount of the new order
($1,100), namely $2,000. Now consider what would happen if the database performed
Alice’s operations in a different order. Charlie started with a $1,000 outstanding balance.
Next, instead of having the discount applied, his record was first updated with the new
order ($1,100). His outstanding balance becomes $2,100. Now, the 10% discount operation
is executed and his outstanding balance is set to $2,100– $210, namely $1890 (instead of
$2,000).
In the first box we have the timeline with the right order.
If we want to ensure casual consistency, we may accept a view order like the second box.
We cannot accept a view like the third box, we cannot have mike’s posts commented by mike
himself.
We have that comments on mike that have casual consistency on Pietro’s and Luca’s posts.
The comment is logically depending on the post.
I may have a different order of the comments but answers to the comment must be viewed under
the comment itself because it depends on the comment.
Key-value Databases
An array is an ordered list of values, where each value In the array is associated with an integer
index. The values are all the same type.
We have some limitations:
- The index can only be an integer
- The values must all have the same type
Associative arrays are collections of key/value elements that generalize the idea of an ordered list.
The key has the same role of indexes in array but can be a generic identifier (string, integer, ecc.)
The values can be of different types.
Key/value databases are associated to associative arrays, but data are persistently stored in the
disk. There is no need for tables if we do not need groups of related attributes.
The unique requirement is that each value has a unique identifier in the form of key.
We usually define a namespace, the bucket, collection of key/value elements.
Keys must be unique within the namespace defined in each specific bucket.
Developers tend to use key/value databases when ease of storage and retrieval are more
important than organizing data into more complex data structures, such as tables or networks.
Essential features of key/value databases are:
- Simplicity
It’s useful if we do not need all features of the relational models, if we do not have to
define a database schema nor to define data types for each attribute.
We can simply modify the code of our program if we need to handle new attributes,
without the necessity of modifying the databases.
They are also flexible and forgiving, we can make mistakes assigning a wrong type of data
to an attribute or use different types of data for a specific attribute. It will not be a
problem, we don’t get any error of the DBMS, in fact we do not define the type. But we do
not have the check we may have in relation databases.
- Speed
To reduce latency the best way is to have data in the disk.
The key/value databases can write the recently updated value to disk, while the user
program is doing something else.
Operations may be stored first in the cache and then, after returned the control to the
application, the disk can be update.
Read operations can be faster if data is stored in memory. This is the motivation for using a
cache: the first time the program fetches the data, it will need to read from the disk but
after that the results can be saved in memory.
Next time data can be searched in cache and then if not found, in the disk.
We have to consider that RAM is not infinite.
When the key/value database uses all the RAM memory allocated to it, the database will
need to free some records.
Several algorithms exist for selecting the records to remove from the RAM.
The least recently used (LRU) is one of the most used algorithms.
Data that has not been read or written as recently as other data can be removed from the
cache. We are removing in this way data from cache, not from the database.
Use case: key-value DB used to store items in customers’ online carts.
In that case if we have an item inside the memory representing the cart of one user that is
not visiting the site for a while, that data will be removed.
It will always find that when he’ll come back, it’s still in the DB, but we remove the data
from the cache.
- Scalability
The scalability is the capability to add or remove servers from a cluster of servers as
needed to accommodate the load on the system.
These databases can be distributed.
When dealing with NoSQL databases, employed in web and other large-scale applications,
it particularly important to correctly handle both reads and writes when the system has to
be scaled, we need to estimate the amount of reads and writes operations and depending
on their balance we will need to decide the architecture to use.
Dividing in chunks a table of a relational database and know where to deploy them is not
easy.
In the framework of key/value database, two main approached are adopted, namely
master-slave replication and masterless replication.
The master-slave replication is an architecture in which we have a master, that is a server that
accepts write and read requests. This is the only server that can accept write operations.
The master is in charge of collecting all the modification of the data and updating data to all the
slaves. It will propagate all the updates results to other server. Everything about consistency is
managed by the master.
The slaves may only respond to read requests.
This architecture is especially suitable for applications with a growing demand for read operations
instead of write operations.
Pros:
- Very simple architecture: only the master communicates with other server
- There is no need to coordinate write operations or resolve conflicts between multiple
servers accepting writes
Cons:
- If the master fails, the cluster cannot accept writes.
- In this case, the entire system fails or at least loses a critical capacity, accepting writes.
It depends on the application if it’s possible to wait or not.
We may add more complexity in this architecture increasing the availability.
Each single slave server may send a simple message to ask a random server in the cluster if it is still
active, asking other slaves for how long they are receiving messages from the master.
If several slave servers do not receive a message from the master within some period, the slaves
may determine the master has failed. The slaves initiate a protocol to promote one of the slaves as
a new master.
It may be hard to handle the load because we lost one server, but we improve the availability
increasing the complexity.
The master-slave replication architecture will suffer an increasing number of write requests (only
the master accept writes and all the updated must be sent to the slaves).
Use case: applications in which the number of writes may increase and achieve peaks.
The solution is to use the masterless replication, a replication model in a cluster of servers where
all nodes may accept both read/write requests.
In the picture we see a logical view of a possible architecture, that means that one server can
communicate with each server in its side.
In this architecture there is not a single server that has the master copy of updated data.
There is not a single server that can copy its data to all other servers.
Each node is in charge of helping its neighbors.
The ring structure is an abstraction. Usually, in data center, all the servers are connected to a
single network hub and are able to communicate with each other.
Let’s suppose to have a dataset divided in 8 chunks and each server can host for replicas, and it’s
in charge to update its data and data of other 3 servers.
If I write on partition 1, it may carry out the update to 2, 7 and 8 and so on, server 2 may update
the chunk in server 3. If server 2 answer it may update the chunks on its neighbors.
Different servers can answer to queries of write to a specific chunk.
We can use keys that start with customer, then I would have the ID of the customer and the name
of attribute.
For handling customer information, we can just use one namespace.
If I want to add an attribute I can do without problem.
I would not occupy space for null values, I would skip an attribute for a customer.
In the same namespace I would be able to put other entities, for example the orders of the
customer.
The general rule for defining keys is the following (single Entity, no relationships):
Entity Name: Entity ID: Entity Attribute
I would have to add a key/value couple for each attribute
One attribute can also be a complex attribute, for example a JSON complex file, and there will be
other problems to handle, deciding if it’s the case to use key/value databases.
If you have all JSON in your database, it would be better to use a document database.
We can use an hash function, the modulus, that gives us the server that answer to a query.
If we divide the hash value (the result) by the number of servers, we can collect the reminder,
namely the modulus, that can be used to identify the server.
The modulo span from 0 to 7.
Each of the 8 servers can be assigned a number (the address).
We generate a sort of load balancing.
Problem: we can use it to avoid selling tickets to the same seat, at the same venue, in the same
city, on the same night to more than one person.
Two fans want to purchase the same ticket, they will if we don’t put attention.
We can use the following key for the specific ticket:
Anyone trying to purchase that same seat on the same night would generate the same key, that
will be transformed by the hashing + modulo function in the same server address.
There is no chance for another server to sell that seat.
One client would have to wait, and this situation could be avoided.
Key/value databases do not expect us to specify types for the values we want to store.
Examples for storing the address of a customer:
Some considerations:
- Consider the design characteristics of the key/value DB we choose to use
- Consult the documentation for limitation on keys and values
- When choosing a key/value DB consider the trade-off of various features
- One key/value DB might offer ACID transactions but set limit on the dimension on keys and
values
- Another key-value data store might allow for large values but limit keys to number or
strings
- Our application requirements should be considered when weighing the advantages and
disadvantaged of different DB systems
We could define a function that would return all addresses with a certain city stating the interval
of IDs of users.
Some key-value databases incorporate search functionality directly into the database, indexes.
A built-in search system would index the string values stored in the database and create an index
for rapid retrieval.
The search system keeps a list of words with the keys of each key-value pair in which that word
appears.
It will be a problem in writing because we need to update this data structure but for reading this is
a very good solution and also, we need additional memory to reserve for indexes.
In the key we can specify details regarding the entity and the attribute.
In general, keys may be strings, numbers, list or other complex structures, such as JSON files.
In order to identify a value in the database, we need the “address” of the specific location in which
this value is stored. Hash functions as usually used to obtain the address, namely a number, from
an arbitrary key.
Usually, hash functions return values that seem to be random values.
Values returned by hash functions may be not unique (collision problems!!), we will have to
manage these collisions.
They can be used do distribute the load between different servers in a cluster.
At a minimum, a key is specified as a string of characters: Patient:A384J:Allergies
Strings representing keys should not be too long, or we’ll occupy a lot of memory.
We may encoure the risk of saturating the main memory.
Long keys will use more memory and key-value databases tend to be memory-intensive systems
already.
At the same time, avoid keys that are too short. Short keys are more likely to lead to conflicts in
key names.
A value is an object, so at low level a set of bytes, that has been associated with a key. Values can
be integers, floating-point numbers, strings of characters and even complex objects such as
picture, video, and JSON files.
Key-value implementations will vary in the types of operations supported on values.
Limits on the dimension of a single value may be fixed by the different frameworks for Key-Value
databases, some may have limits on the dimension of the value.
In some application we may need different modules in which we have the same attribute’s name
but with a different meaning.
ES. Name in social network is the nickname of the participant, and another module of the
application has the name for anagrafic purposes. It would be useful to use separate namespaces.
It is allowed to have duplicate values in a namespace.
Namespaces enable duplicate keys to exist without causing
conflicts by maintaining separate collections of keys.
Namespaces are helpful when multiple applications use the
same key-value database.
Namespaces allows to organize data into subunits.
For each namespace below it will generate a new name with a
prefix, which is the name of the namespace, but it will be
transparent for the developer.
Data Partitioning
Subsets of data (partitions or shards) may be handled by the different nodes of a cluster. A cluster
may contain more than one partition.
Several strategies exists for data partitioning.
The main objective of partitioning data would be to evenly balance, both write and read loads,
among servers.
If needed, additional nodes would be easily added to the cluster and data appropriately relocated.
An idea is to divide keys starting with ‘A’-‘L’ to the first server and the rest for the second. It can be
useful to balance the work, we’re not sure it’s right an half but it’s an idea.
One particular attention will be to relocate data when we add/remove a server.
A partition key identifies the specific partition in which the value has been stored.
Any key in a key-value database is used as a partition key.
In the previous example, the first letter of a key (string) acts as the value of partition key.
Usually, hashing functions are adopted for identifying the specific cluster or partition in which data
should be stored.
We are not required to define all the keys and types of values we will use prior to adding them to
the database, this schema-less property determines the flexibility of our database.
We may decide to change how storing the attributes of a specific entity.
Regarding the example in the table, we might decide that storing a customer’s full name in a single
value is a bad idea, thus we will separate first and last names.
We need to update the application code to handle both ways of representing customer names or
convert all instances of one form into the other or take care about integrity, references. But now
we don’t have these problems, but in contrast we need to pay more attention on the code.
In a masterless architecture we can exploit the logical ring organization of servers in which servers
can communicate with their neighbors.
Let consider a hashing function that generates a number from a key and calculates the modulo.
Whenever a piece of data is written to a server, it is also written to the two servers linked to the
original server (high availability). Also, we have load balancing.
If Server 4 fails, both Server 3 and Server 5 could respond to read/write requests for the data on
Server 4. When Server 4 is back online, Servers 3 and 5 can update it.
High availability is ensured by using replication, namely saving multiple copies of the data in the
nodes of a cluster. The number of data replicas is often a parameter to set.
The higher number of replicas, the less likely we will lose data, the lower the performance of the
systems (in terms of response time).
The lower the number of replicas, the better the performance of the systems, the higher the
probability of losing data, which is the probability of all servers to fail.
A low number of replicas may be used whenever data is easily regenerated and reloaded. If we
must regenerate data, we may prefer to reduce the number of replicas.
Most frameworks who exploit NoSQL databases give us the opportunity to set:
N = number of replicas
W = number of copies to be written before the write can complete
R =number of copies to be read for reading a data record, before assuming that the reading is
correct
In the first situation each time we read we want to maintain a strong consistency thus before
returning the control to the application we want to update all copies.
If we read just once, we are sure that we read consistent data in this way, though.
We have the highest level of consistency, the slowest write operations, the fastest read
operations.
In the second case we still maintain a good level of consistency because if writing, before returning
the control, at least we must update one replica.
If we read just once we are not sure that we get the last updated value, but if we leave two times,
we are sure that one copy will be the most recent (we may have timestamp associated), we have a
bit faster write but slower read operation but still providing consistency.
We prefer consistency rather than the availability, relating to the latency,
In the third case we write just once, fastest write, but in order to ensure consistency we have to
read N times, slowest reading.
The fastest situation with just 1 write and 1 read give us the highest availability, the lowest latency
but we may have problems with consistency.
What to choose depends on the context, we must decide the key but also the setup for replicas
and sharding.
For sharding we may use geographical approach coding in the key the country, or we may use
other strategies.
An example of sharding function, given a key, can translate it to a hash value, which in this case is
a number in hexadecimal format.
One of the important characteristics of hash algorithms is that even small changes in the input can
lead to large changes in the output.
Hash functions are generally designed to distribute inputs evenly over the set of all possible
outputs. The output space can be quite large. This is especially useful when hashing keys.
No matter how similar your keys are, they are evenly distributed across the range of possible
output values.
Assume we have a cluster of 16 nodes and each node is responsible for one partition.
The key 'cust:8983:firstName' has a hash value of
4b2cf78c7ed41fe19625d5f4e5e3eab20b064c24
and would be assigned to partition 4.
The key 'cust:8983:lastName' has a hash value of c0017bec2624f736b774efdc61c97f79446fc74f
and would be assigned to node 12 (c is the hexadecimal digit for the base-10 number 12).
Is it a problem? Depends. We can put together firstName and lastName to be sure that we
retrieve both in the same server.
To resolve the collision problem, in which hashing function can have the same output, we can use
a particular collision resolution strategy.
From a logical point of view, the table that projects a hashed key to the corresponding value may
include a list of values.
In the list we append all the values associated with the key.
In each block of the list, also the original key must be present.
We may then scan a list looking for the key in order to retrieve the value. We add complexity but
we have different advantages.
Use Case: adding or removing a node, even for a short time period.
Problem: we need to change the hashing function and to re-locate all the data among the servers
of the cluster.
If we change the number of server if we use the general hashing function, we need to change it, at
least the modulo, we need to increase it or decrease it.
We also need to relocate all records in the database, and we need to stop servers to relocate
everything.
Solution: to exploit the ring structure and a consistent hashing function that allows us to remap
only the keys mapped to the neighbors of the new node.
We have 4 nodes and, in this ring, the portion in the circular array organization are the keys for
each node.
We need to work only on keys on node 1 and 4, and we can keep use the same hashing function.
The same hashing function is applied to both the keys and the server/partition ID/Address/Name.
The hash values must be in the same range, for example hexadecimal on 32-bit representation, in
this case the result will vary from 0 to 2^32.
We will have a distribution of keys in the circular array, but we can also calculate the hashing value
of the id of the server.
A specific server will host all keys which precede it.
The actual server (Node) kj associated to a specific key (object) oi is its successor in the
hashing/address space. We will consider the first server we find in the ring.
If we remove server 2 and add server 4, we must relocate O18 on K3, the first successor, and
relocate O15 in K4.
The hashing function never changes, and we relocate only two objects.
The higher the range, the lower is the probability to have collisions but the more complex will be
the data structure to manage the hashing.
We use the hashing to reload the balancing on the server, but if we change server this may be a
solution.
If data organization and management is more important than the performances, classical
relational databases are more suitable rather than key-value databases.
However, if we are more interested to the performances (high availability, short response time,
etc.) and/or the data model is not too much complicated (no hierarchical organization, flexibility,
simplicity, limited number of relationships) we may use key-values databases, even together with
other databases.
Indeed, key-value stores are really simple and easy to handle, data can be modeled in a less
complicated manner than in RDBMS.
Let’s consider this data structure:
We need to join tables to make this operation and join operations are time consuming.
Now, we want to translate the data modelling from a relational model to a key-value model.
Take in mind that keys embed information regarding Entity Name, Entity Identifier and Entity
Attributes. Thus, we can translate the Employees table as follows:
We can use also use for example System.out.format(“%s %s is %d”, firstName, LastName, age),
instead of getString.
If we define objects in the try block they will be automatically closed.
Using the interface java.sql.Statement the SQL statement is compiled at the execution time.
Using prepared statements allows us to pre-fill a statement and improves performance when the
same query is executed multiple times, even with different parameters.
The parameters are indicated by question marks and set with appropriate setters.
INSERT, UPDATE and DELETE statements do not return a ResultSet object.
The executeUpdate() method executes the statement and returns the number of rows actually
inserted/modified/deleted.
We can also make transactions in JDBC, which are set of operations that are atomic, either all of
them or none are executed. For example we want to update salaries in an atomic way.
We can disable autoCommit for executing the two commands before and then commit.
If the commit raises an exception, we would catch it and in case the problem is not in the
connection we would try to rollback.
We can also exploit a safe point so that when we want to rollback, we can specify the safe point.
This can be useful if operations are complicated.
Maven
“Apache Maven is a software project management and comprehension tool. Based on the
concept of a project object model (POM).
Maven can manage a project's build, reporting and documentation from a central piece of
information.”
We will use Maven for managing the dependencies in our JAVA programs.
Roughly speaking, we will avoid to manually add libraries and API.
A POM file is an XML external file that contains information about the project and configuration
details used by Maven to build the project.
It contains default values for most projects.
Whenever we need a new dependency, we add new fields to the dependencies list.
groupId uniquely identifies a project across all projects.
artifactId is the name of the jar without version.
By default, Maven will download from the central repository.
We don’t have the relation between message and user but at the end they added the senderID in
message.
Key-value databases insights
We have different naming convention for keys:
- Use meaningful and unambiguous naming components, such as 'cust' for customer, and
not for example cc, or 'inv' for inventory.
- Use range-based components when you would like to retrieve ranges of values. Ranges
include dates or integer counters.
- Use a common delimiter when appending components to make a key(e.g. the ‘:’
delimiter ).
- Keep keys as short as possible without sacrificing the other characteristics mentioned in
this list. The higher the lenght of a key, the higher is the occupation of memory.
- Well-designed key pattern helps minimize the amount of code a developer needs to write
to create functions that access and set values.
cust:1234123:firstName: “Pietro” custr:1234123:lastName: “Ducange”
We need to take care in the code that we should use in the firstName cust and in the
lastName custr, which is a problem.
- Using generalized set and get functions helps improve the readability of code and reduces
the repeated use of low-level operations, such as concatenating strings and looking up
values. In this way we can avoid to write again pieces of code.
The first thing to do when having to do with key/value databases is defining these
functions.
AppNameSpace is the name of the namespace holding keys and values for this application.
If we want to use an incremental id, we should use a variable inside.
- Consider to have a naming convention for namespaces.
- In production applications, we should include appropriate error checking and handling. It’s
really important to write test code.
If the code is too complicated, there must be something wrong with the choice.
An important choice regards whether to use simple or complex values.
Consider the following function which retrieves both the name and the address:
The function makes 6 access to the database (on the disk) using the getCustAttr function.
To speedup the function execution, we should reduce the number of times the developer has to
call getCustAttr or caching data in memory.
We may store commonly used attribute values together as follows:
This entire structure can be stored under an order key, such as 'ordID:781379’.
We are forcing a schema, but the advantage of using a structure such as this is that much of the
information about orders is available with a single key lookup.
But as the structure grows in size, the time required to read and write the data can increase,
because the value, will be store in more than one memory block.
In general, if we need to use complex structure for the DB of our application, it is better to move
towards different architectures, such as Document Databases.
If we use too complex values it can be affected by the latency to retrieve the object.
Limitation of Key-Value DB
- The only way to look up values is by key.
Some DBMSs for key-value DB offer APIs that support common search features, such as
wildcard searches, proximity searches, range searches, and Boolean operators. Search
functions return a set of keys that have associated values that satisfy the search criteria.
- Some key-value databases do not support range queries.
Some DBMSs (ordered KV databases) keeps a sorted structure that allows for range queries
and/or support secondary indexes and some text search.
- There is no standard query language comparable to SQL for relational databases.
Case Study: K-V DBs for Mobile App Configuration
We consider a Mobile Application used for tracking customer shipments.
The following configuration information about each customer are stored in a centralized
database:
- Customer name and account number
- Default currency for pricing information
- Shipment attributes to appear in the summary dashboard, for example where it started,
where it is now
- Alerts and notification preferences
- User interface options, such as preferred color scheme and font
Most of the operations on the DB will be read operations.
The most suitable solutions are key-value database, even because it is a simple application.
We must define a naming Convention for keys. We choose:
entity type:account number.
Identified entities:
- Customer information, abbreviated ‘cust’
- Dashboard configuration options, abbreviated ‘dshb’
- Alerts and notification specifications, abbreviated ‘alrt’
- User interface configurations, abbreviated ‘ui’
We use complex values, putting things together because we have to read them always together.
We use just one namespace.
The value is a JSON file, and we store them together because this information we will always be
used together.
We specify that for this user alerts when we have a pick-up and delay event must be sent to these
contacts.
LevelDB
If data organization and management Is more important than performances, classical relational
databases are more suitable rather than key-value databases.
However, if we are more interested to performances (high availability, short response time, etc..)
and/or the data model is not too much complicated (no hierarchical organization, limited number
of relationships) we may use key/value databases.
Indeed, key/value stores are simple and easy to handle, data can be modeled in a less complicate
manner than in RDBMS.
If I want some features that I find in a cache maybe a key/value database is an optimal choice.
We want to translate the data modelling from a relational model to a key/value model.
We need to remember that keys embed information regarding Entity Name, Entity Identifier and
Entity Attributes. The key must be unique, and it also must be easy to retrieve a key/value pair.
At the end of the translation process, data will be organized in a unique bucket:
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping
from string keys to string values.
Main features are:
- Keys and values are arbitrary byte arrays
- Data is stored sorted by key (lexicographic style)
- Callers can provide a custom comparison function to override the sort order
- The basic operations are put(key,value), get(key) and delete(key)
- Multiple changes can be made in one atomic batch
- Users can create a transient snapshot to get a consistent view of data
- Forward and backward operation are supported over the data
To import LevelDB on your project you must insert just one of these dependencies:
1. Port of LevelDB’s C++ version
To open/close a DB connection:
We can also destroy the database if it is already present, just to be sure that the file will be empty.
The second version will delete a pair we inserted, instead the previous one will delete a key
pair that do not exists. The order matter.
- Iterators, which allow to iterate through the DB without knowing the keys of the
corresponding values
The snapshot is not a complete copy of the db, it gets resources to do the following
operations.
If we do not destroy the snapshot, differences will get bigger, and we will consume a lot of
resources.
If we know the first part of the key and want to get all pairs that start with that value:
An HashMap has the key with the first data type and the value with the second data type, it stores
key/value pairs.
We will get them in a lexicographic order.
The last one is a dictionary, the union of the first two information, the string conversion of an
hashmap.
We start from where it first finds the prefix, and we exploit the lexicographic order.
The order is lexicographic but the values 10,11 create problems because they will be stored in this
order: 10, 11, 1, 2 3. The ASCII characters are used.
If we want to avoid this, we can use another version of opening the DB connection:
We specify a new comparator, which is an interface provided by Java.
If i == 1 we are comparing numbers. If we find a difference, comparison != 0, we will get out from
the loop.
Document Databases
Document databases are non-relational databases that store data as structured documents,
usually in XML or JSON formats.
They ensure a high flexibility level and allow also to handle complex data structures (despite key-
value databases).
Document databases are schema-less.
One record is usually stored in one document, so they recall the organization of data in OOP
language.
They allow complex operations, such as queries and filtering. Some document databases allows
also ACID transactions.
XML Documents
XML stands for eXtensible Markup Language.
A markup language specifies the structure and content of a document.
Tags are added to the document to provide the extra information.
XML tags give a reader some idea what some of the data means.
XML can represent almost any form of information.
XML and Cascading Style Sheets (CSS) allowed second-generation websites to separate data and
format.
XML is also the basis for many data interchange protocols and was a foundation for web service
specifications such as SOAP (Simple Object Access Protocol).
XML is a standard format for many document types, including word processing documents and
spreadsheets (docx, xlsx and pptx formats are based on XML).
Advantages:
XML is text(Unicode)based. Can be transmitted efficiently.
One XML document can be displayed differently in different media and software platforms.
XML documents can be modularized. Parts can be reused.
XML databases are platforms that implement the various XML standards such as XQuery and XSLT,
They provide services for the storage, indexing, security, and concurrent access of XML files.
XML databases did not represent an alternative for RDBMSs.
On the other hand, some RDBMSs introduced XML , allowing the storage of XML documents
within A BLOB (binary large object) columns.
We may have a storage layer in which we put the data, where we can find binary XML, the DOM,
indexes for a better navigation in the XML files.
We access the data using a processing layer where we have engines and a module for loading
documents. We can then have a client layer for interacting with the DB.
The drawbacks of XML are:
- XML tags are verbose and repetitious, thus the amount of storage required increases.
- XML documents are wasteful of space and are also computationally expensive to parse.
- In general, XML databases are used as content-management systems: collections of text
files (such as academic papers and business documents) are organized and maintained in
XML format.
- On the other hand, JSON-based document databases are more suitable to support web-
based operational workloads, such as storing and modifying dynamic contents.
JSON Documents
JSON is the acronym of JavaScript Object Notation.
It is used to format data.
Thanks to its integration with JavaScript, a JSON document has been often preferred to an XML
document for data interchanging on the Internet.
Despite the name, JSON is a (mostly) language-independent way of specifying objects as name-
value pairs
Similarities:
– Both are human readable
– Both have very simple syntax
– Both are hierarchical
– Both are language independent
– Both supported in APIs of many programming languages
Differences:
– Syntax is different
– JSON is less verbose
– JSON includes arrays, in XML we have to put the tags
– Names in JSON must not be JavaScript reserved words
In XML for each user we have to repeat opening and closing tags.
Main features of JSON databases:
Data is stored in JSON format.
A document is the basic unit of storage. It includes one or more more key-value pairs and
may also contain nested documents and arrays.
Arrays may also contain documents allowing for a complex hierarchical structure.
A collection or data bucket is a set of documents sharing some common purpose. We can
collect items with different structures with just some common fields.
Schema less: predefined document elements must not be defined.
Polymorphic Scheme: the documents in a collection may be different.
A schema is a specification that describes the structure of an object, such as a table.
Data modelers have to define tables in a relational database before developers can execute code
to add, remove, or update rows in the table. Altering the schema after is not a good practice.
Document databases do not require this formal definition step.
Developers can create collections and documents in collections by simply inserting them into the
database.
Schema-less Pros: High flexibility in handling the structure of the objects to store
Schema-less Cons: the DBMS may be not allowed to enforce rules based on the structure of the
data.
We can use mongoDB a relational database. If we have a many-many relation, in a RDB we would
need a solution like this. In a RDBMS all checks will be automatically done.
We can implement this in a document database. We have documents describing fields, and we
have an array of documents specifying actors.
Document databases usually adopts a reduced number of collections for modeling data.
Nested documents are used for representing relationships among the different entities.
Document databases do not generally provide join operations. In this approach we do not need
to do them if we want to retrieve actors who made films.
We will have redundancy, with information about actors, and in changing the description of one
actor we need to change it in all documents of fields, and we also have to do different checks, we
have to scan more information.
Programmers like to have the JSON structure map closely to the object structure of their code!
If we add another collection that puts together the information about films and authors. This is
not the right thing to do.
Document linking approaches are usually somewhat an unnatural style for a document database.
However, for some workloads it may provide the best balance between performance and
maintainability.
When modeling data for document databases, there is no equivalent of third normal form that
defines a “correct” model.
In this context, the nature of the queries to be executed drives the approach to model data.
We collect the need of the client, we identify use cases and then think about how organizing the
database. From requirements we will think about queries in our database.
We exploit a document linking approach, for each course we specify the enrolled students with
the ID of them. Same approach for students.
We have to take care when updating data in this kind of relationship, about the integrity of
references. We don’t know it that ID represents a student or a course.
Indeed, the DBMS will not control the referential integrity as in relational DBMSs.
We still don’t have a fixed schema.
Depending on the queries we have to do, it could be preferable to use document embedding.
We specify in the same document the ID of its parent, and after its info, and the same for its
parent until arriving to the root.
A possible application is a GUI in which we can select one object considering the categories.
We have repeated information in each new document in this way. truck_id, deriver_name and
date are stored and transmitted each time.
At the end of the day, the DB will include 200 new documents for each truck (we consider 20
transmissions per hour, 10 working hours).
We organize a document for each day, with an header and a field with a vector in which we specify
time and fuel_consumption. We reduce the amount of information to transmit and store.
We also have a better organization of data for making analytics.
Pay attention: we can have a potential performance problem!
When a document is created, the DBMS allocates a certain amount of spaces for the document.
It exects that the document will be modified and so it allocates more space than needed at the
beginning.
If the document grows more than the allocated space, the DBMS has to relocate it to another
location.
Moreover, the DBMS must free the previously allocated space.
The previous steps can adversely affect the system performance.
A solution for avoiding to move oversized document is to allocate sufficient space at the moment
in which the document is created with default values.
We can consider the maximum dimension of the document. If we realize later that the allocated
memory is bigger we can use splitting. We can have more documents, one for the morning, one
for the evening and one for the night.
Regarding the previous problem, the following solution may be adopted:
In conclusion, we have to consider the life cycle of a document and planning, if possible, the
strategies for handling its growing.
The document is not ordered by store but we may need to retrieve students with a certain range
of scores. We could define an index building a data structure which is a data structure with a
collection of pointers towards documents with that specific score.
In the figure we show the classical scheme of a read-heavy application (business intelligence and
analytics applications) where indexes are really useful.
In this kind of applications, the use of several indexes allows the user to quickly access to the
database. For example, indexes can be defined for easily retrieve documents describing objects
related to a specific geographic region or to a specific type.
The example of the truck information transmission (each three minutes) is a typical write-heavy
application.
The higher the number of indexes adopted the higher the amount of time required for closing a
write operation. Indeed, all the indexes must be updated (and created at the beginning).
Reducing the number of indexes, allow us to obtain systems with fast write operation responses.
On the other hand, we have to accept to deal with slow read operations.
In conclusion, the number and the type of indexes to adopt must be identified as a trade-off
solution.
We may think to have two databases. A first DB, maybe a relational one, tuned for writes and
another where we maintain information of analytics.
Usually in this last db information are not extracted in real time.
In the first we may have rough data and this can be extracted, transformed, and loaded to the db
devoted for analytics. These operations can be done at night too.
There’s no need to think about consistency in the right.
We have different types of data regarding the same process.
Maybe we could use a document db in the left and a graph db in the right.
In this way, this information is not stored in the same portion of the disk.
Sharding or horizontal partitioning is the process of dividing data into blocks or chunks.
Each block, labeled as shard, is deployed on a specific node (server) of a cluster.
Each node can contain only one shard but in case of data replication, a shard can be hosted by
more than one node. It ensures a higher availability and a lower latency.
Advantages of sharding:
Allows handling heavy loads (load balancing) and managing peaks of increase of system
users.
Data may be easily distributed on a variable number of servers that may be added or
removed by request.
Cheaper than vertical scaling (adding ram and disks, upgrading CPUs to a single server).
Combined with replications, ensures a high availability of the system and fast responses.
To implement sharding, document database designers have to select a shard key and a
partitioning method.
A shard key is one or more fields, that exist in all documents in a collection, that is used to separate
documents.
Examples of shard keys may be: Unique document ID, Name, Date, such as creation date, Category
or type, Geographical region.
Any atomic field in a document may be chosen as a shard key.
We need servers to store data and a server to manage sharding. A possible sharding technique
could be geographical.
There are three main categories of partition algorithms, based on:
Range: for example, if all documents in a collection had a creation date field, it could be
used to partition documents into monthly shards.
Hashing: a hash function can be used to determine where to place a document. For
example, hashing of the key. Consistent hashing may be also used. If we make queries of
averages, it should be avoided for example, otherwise we need to access to different
servers.
List: for example, let imagine a product database with several types (electronics,
appliances, household goods, books, and clothes). These product types could be used as a
shard key to allocate documents across five different servers. For example, all data about
electronics can be placed in a server.
If we make analytics, we may look for data from different servers. We could lose in terms of
performance.
Which one to use depends on the requirements, the collection, the queries.
Collections are sets of documents.
A collection can store documents of different types (no need of a specific structure/scheme for a
document).
In general, collections should store documents about the same type of entity.
What is the «type» of entity?
Are these two entities, or are they two instances of the same entity?
It depends on how we are going to use this data.
Let’s consider them as instances of the same entity.
Entity Name: System Event
We need to add a field, doc_type, to discriminate clickstream from server logs.
If you were to store these two document types in a single collection, you would likely need to add
a type indicator so your code could easily distinguish between a web clickstream document and a
server log event document.
If we do something like this data will be stored in the same block of the disk. Mixing document
types in the same collection can lead to multiple document types in a disk data block.
This can lead to inefficiencies whenever data is read from disk but not used by the application that
filters documents based on type. If we use this information separately. this is inefficient.
Filtering collections is often slower than working directly with multiple collections, each of which
contains a single document type.
If you are using disk storage, you will likely retrieve blocks of data that contain both clickstream
and server log documents. This will adversely impact performance.
If we make analysis on timestamp for example, this is not a problem but if for example, we make
queries to find clickstreams this is not a good solution.
We can use indexing but in general we have to think at how to use data.
If the decide to use indexes, recall that they reference a data block that contains both clickstream
and server log data, the disk will read both types of records even though one will be filtered out in
your application.
We don’t have to scan all of them, we can use the index and retrieve one specific data type.
In the case of large collection and large number of distinct “types” of documents, it may be faster
to scan the full document collection rather than use an index. Also, we have to update indexes
after each update.
The tip is to store different types of documents in the same collection only if they will be
frequently used together in the application.
In general, the application code written for manipulating a collection should have:
1) A substantial amount of code that apply to all documents
2) Some amount of code that accommodates specialized fields in some documents.
The case of High-Level Branching like in the picture, can
indicate a need to create separate collections.
Branching at lower levels is common when some documents have optional attributes.
This example is of documents of different types used often together.
process_click_stream and process_server_log are functions specific. We use two different types of
code for elaborating data.
There’s no need in the first case to put together documents, but if we have a general part like the
lower-level branching, it is correct to put together information in the same collection.
The only difference is that on ebook we have a size field, that can be added on the description.
Let’s suppose that we have this type of data, with an abstract class Product and some
specialization.
For a specific query we have to consider one only collection, two in some cases.
We have to consider the customer too then.
Notice that: If we separate the product into different collections, and the number of product types
grows the number of collections would become unwieldy.
If you find yourself writing separate code to process different document subtypes, you should
consider separating the types into different collections. Poor collection design can adversely affect
performance and slow your application. There are cases where it makes sense to group somewhat
dissimilar objects (for example, small kitchen appliances and books) if they are treated as similar
(for example, they are all products) in your application.
Database normalization is the process of organizing data into tables in such a way as to reduce the
potential for data anomalies. An anomaly is an inconsistency in the data.
Normalization helps avoid data anomalies, but it can cause performance problems.
With normalized data, we need join operations, which must be optimized for improving
performances. It also reduces the amount of redundant data in the database.
If we use denormalized data, we may introduce redundancies and cause anomalies. If we change
information without doing it on duplicates, we may have anomalies.
On the other hand, we may improve the performances of the queries because we reduce the
number of collections and avoid join operations.
Denormalization supports improving read operations when indexes are adopted.
Designing databases entails trade-offs. You could design a highly normalized database with no
redundant data but suffer poor performance. When that happens, many designers turn to
denormalization.
As the name implies, denormalization undoes normalization—specifically, it introduces redundant
data. You might wonder, why introduce redundant data? It can cause data anomalies.
It obviously requires more storage to keep redundant copies of data. The reason to risk data
anomalies and use additional storage is that denormalization can significantly improve
performance.
When data is denormalized, there is no need to read data from multiple tables and perform joins
on the data from the multiple collections. Instead, data is retrieved from a single collection or
document. This can be much faster than retrieving from multiple collections.
In general, we have to achieve a reduced number of collections, especially involved in queries.
How do document data modelers and application developers get better performance? They
minimize the need for joins. This process is known as denormalization. The basic idea is that data
models should store data that is used together in a single data structure, such as a table in a
relational database or a document in a document database.
For those using document databases, avoiding data anomalies is still important, but they are
willing to assume more responsibility to prevent them in return for scalability and flexibility.
For example, if there are redundant copies of customer addresses in the database, an application
developer could implement a customer address update function that updates all copies of an
address.
By denormalizing the design, you could create a collection of documents that would require only
one lookup operation.
Exercise on Document DB
Put fields in UML.
No search film, directly browse films.
No find by parameter, but find film
No by country but set country.
We can choose to do not put film information directly on the film, doing it separately like this.
But if we do analytics, it doesn’t make sense to have it separated.
We can have analytic use cases directly connected to the user, if we make analytics on a single
entity, for example film, we could put them connected to display film as an extend.
*View top rated films (specify the subject of the action)
Get top users is not that right, view user is better.
They decided to use two collections.
Delete on admin is wrong and separate view top rated for country and for year.
They have an array of ratings in which they put the rate a user may generate. Exploiting it they
were able to make statistics on the top-rated films.
Top rated production = production with top rated films, in this way it is possible to do it fastly.
They put writer and authors together with movies, but they did not use an array of documents
describing actors or writers but just a string, not a good choice. We need all fields.
They must embed them but with a vector of documents. Every time we have a one-to-many
relationship, we need to use an array.
We also have the production and the country, but there could have been a better choice for this. It
is a denormalized version that could have been better.
For the user consider using security aspects for password, use an array of rates done to movies
with sub-information about movies.
They also realized an index for production.
It’s acceptable to have redundancy but not that much, queries with 2 collections involved can be
accepted.
It was better to have language inside without an entity.
Column Databases
We have two main activities in a system:
- OLTP - Online Transaction Processing activities: software architectures oriented towards
handling ACID transactions.
- OLAP - Online Analytics Processing activities: software architectures oriented towards
interactive and fast analysis of data. Typical of Business Intelligence Software.
We have two servers devoted in one side on OLTP activities and on the other side OLAP activities.
Usually, OLTP systems exploit normalized data and queries we can do is typically on “Who bought
something”.
On the other side we have a system devoted to analytics, there is no interest in fast queries,
Since the beginning of digital files, the data of each record were physically organized in rows,
which were located in one section of the disk.
OLTP processing is mainly oriented towards handling one record at a time processing, so there’s
no problem with this organization.
When the first relational databases were designed, the world was mainly experiencing the OLTP
era.
The record-oriented workflow handled by the first relational databases and the row-oriented
physical structure of early digital files provided good performance.
In the record-based processing era (up to the end of the 80s of the last century), CRUD operations
(Create, Read, Update, Delete) were the most time-critical ones.
Reporting programs were typically iterated through entire tables and were run in a background
batch mode. The time response to the queries was not a critical issue.
When Business Intelligence software started to spread, OLAP processing assumed a big relevance,
thus the time response to queries became a critical issue to appropriately handle.
In the last decades Data warehouses appeared.
Data warehouses: information systems in which relational databases are tasked with supporting
analytic and decision support applications.
The OLAP part was supported by another DBMS organized in an appropriate way based on
relational database, devoted to providing analytic and decision service.
We have a separation between the two. In the OLAP side we generate analytics that can be used
in the business field and produce transactions, that thanks to ETL processes can pass from a
normalized to a denormalized version. So, transactions are transformed and used data warehouse
to produce reporting.
We need a different organization in order to produce fast queries.
We get a sepration for OLTP and OLAP and on the OLAP we have the data warehouse.
It is a relational solution where central large fact tables are associated with numerous smaller
dimension tables.
In this example fact regards quantitative facts, quantity, price, and discount.
Then we need to know what customer, store, product, and time. These are dimensions.
Aggregate queries could execute quickly with this organization thanks to the separation.
If we want a statistic regarding the customer, we involve the facts and the customer dimension.
Star Schemes do not represent a fully normalized relational model of data (redundancies are often
accepted, information sometimes depends partly on the primary key).
We have higher redundancy in …
The main drawbacks of star schemas are:
- Data processing in data warehouses remained severely CPU and IO intensive.
- The data volumes grew up along with and user demands for interactive response times.
- In general, around mid 90s of the last century, the discontent with traditional data
warehousing performance increased.
If data are stored by rows, in order to retrieve the sum of salaries we must scan five blocks.
In the case of column store, a single block access is enough.
In general, queries that work across multiple rows are significantly accelerated in a columnar
database.
Columnar Compression
Data compression algorithms basically remove redundancy within data values.
The higher the redundancy, the higher the compression ratio.
To find redundancy, data compressions algorithms usually try to work on localized subsets of the
data.
If data are stored in a column database, very high compression ratio can be achieved with very low
computational overheads.
As an example, often in a columnar database data are stored in a sorted order. In this case, very
high compression ratios can be achieved simply by representing each column value as a “delta”
from the preceding column value.
Column databases perform poorly during single-row modifications. In level of rows we can find
problems.
The overhead needed for reading rows, can be partly reduced by caching and multicolumn
projections (storing multiple columns together on disk).
In general, handling a row means accessing to more than once blocks of the disk.
Problem: The basic column storage architecture is unable to cope with constant stream of row-
level modifications.
These modifications are needed whenever data warehouse have to provide real-time “up-to-the-
minute” information.
Solution: Delta store is an area of the database, resident in in-memory, uncompressed and
optimized for high-frequency data modifications. It is a row-oriented area in memory.
Data in the delta store is periodically merged with the main columnar-oriented store (often data
are highly compressed). If we can accept in a query only the information stored in the columnar
part we can use only this, otherwise queries may need to access both the delta store and the
column store in order to return complete and accurate results.
In this architecture we have the column store, we have bulk loads which are loads of group of data
using ETL process from a database devoted to collect all transactions.
If we want to make queries that make accesses to more than one portion of the disk.
The super-projection contains all columns with unsorted data.
In the first projection we have redundancy but support to have just one access to the disk.
This will ensure low latency.
Projections can be created manually or through tools that based on queries give us projections.
Hybrid Columnar Compression (Oracle Solution)
We may need to edit rows and also need support for analytics on attributes.
At high-level, rows of data are contained within compression units of about 1 MB.
Inside these blocks we have columnar organization within smaller 8KB blocks.
BigTable was introduced in 2006 by Google and was one of the first column databases.
Main Features:
Column based with a dynamic control over columns
Data values are indexed by row identifier, column name, and a time stamp
Reads and writes of a row are atomic
Rows are maintained in a sorted order
Rows are composed of several column families. Each family consists of a set of related columns.
Each family consists of a set of related columns (frequently used together.)
Column values within a column family are kept together on the disk, whereas values of the
same row belonging to different families may be stored far.
Columns families must be defined a priori (like relational tables).
Applications may add and remove columns inside families (like key-value pairs).
We have to define a keyspace whenever we work with column DBs, which is the analogous of
key/value databases.
It is the top-level logical container. It holds column families, row keys, and related data structures.
Typically, there is one keyspace per application.
Row keys are one of the components used to uniquely identify values stored in a database.
Row keys are also used to partition and to order data, to create shards, distributed in the different
servers.
Algorithms for data partitioning and ordering depends on specific databases.
As an example, Cassandra adopts the partitioner, a specific object, which is used for sorting data.
Moreover, by default, the partitioner randomly distributes rows across nodes.
A column, along with a row key and version stamp, uniquely identifies values.
It has the same role of a key in key/value databases.
Values may be typed, such as in Cassandra, or not, such as in BigTable.
The time stamp or other version stamp is a way to order values of a column.
Columns that are frequently used together are grouped into the same column family.
Column families are analogous to a table in a relational database.
The entries regarding the address can be put together in a column family.
However:
- Rows in columns families are usually ordered and versioned (using timestamp)
- Columns in a family can be modified dynamically.
- Modifications just require making a reference to them from the client application
We do not occupy space for null values and we can make analytics in a fast way with this
approach.
The most recent column databases are designed for ensuring high availability and scalability,
along with different consistency levels.
Clusters and partitions are fundamental for the distributed implementation of the databases.
Thanks to partitions we can help the system to balance the load.
A partition is a logical subset of a database.
Partitions are usually used to store a set of data based on some attribute of the data in column
databases, they can be generated on the basis of:
A range of values, such as the value of a row ID (all IDs in a range)
A hash value, such as the hash value of a column name or the row ID or the columnar
family name
A list of values, such as the names of states or provinces (e.g. Canada in a server and USA in
another)
A composite of two or more of the above options
We do not have to access to different server to get information from a column family, this is
important to achieve.
Let’s see a collection of data structures and strategies that are used in Cassandra.
If we make a write operation on a server, we need to wait the time to store data on the disk, and
we’re not sure data are stored in adjacent positions. We have a latency to read and return the
control to the application. We can speed-up this problem with commit logs.
Commit Logs are append only files that always write data to the end of the file. We write one
record after the other.
These files allow column databases to avoid waiting too much for the definitive writes on
persistent supports.
Whenever a write operation is sent to the database, is stored into a commit log. Then, the data is
updated into the specific server and disk block. It ensures a high level of control of failures. If while
updating we have a failure we lose information.
In this case if we have any failure, we can exploit the commit log to recover the information.
After a failure, the recovery process reads the commit logs and writes entries to partitions.
It reduces latency in write operations and can be used in case of failures to recover information.
Bloom filters are probabilistic data structures.
Bloom filters help to reduce the number of reading operations, avoiding reading partitions that
certainly do not contain a specific searched data.
Given a row key, a column and timestamp we are not sure that making access to the disk we will
retrieve the data.
A Bloom filter tests whether or not an element is a member of a set: the answer of the test is “no”
or “maybe”. Thus, false positive may be obtained.
However, if the answer is “no” for sure the element, namely the searched data, is not present into
the specific partition. False negatives are not allowed, if the answer is “no” we are sure the data is
not present in the database. We limit the number of accesses in the disk.
If we use a member function, we have to store a lot of information, we need more memory but we
are sure if an element is present or not. In a bloom filter we can have some “no” sure answers,
and in case of “yes” we are not sure this is not a FP. For a and c we are sure, not for e.
Let’s see how to implement a bloom filter for a partition.
Let consider a specific partition. A bloom filter for this partition can be an array of n binary values.
We want to know if that specific shard contains or not data. Before accessing we want to know if
data is stored or not.
At the beginning, all the elements are set to 0, no data are stored inside.
Let consider also a set of k << n hash functions (n is the dimension of the array) which map each
data to be stored into the partition in an element of the array, into one position.
- When a new data is added (write) to the partition, all the hash functions are calculated and
the corresponding bit are set to one.
- When a read operation is required, all the hash functions are calculated and the access to
the partition is actually performed if and only if all the corresponding element in the array
are set to 1, if the k positions are all set to 1.
The maximum false positive rate depends on the values of n and k.
Let consider a Bloom Filter with n=10 and k=3.
The Insert function describes a write operation.
If we want to insert x we calculate k hash values and set them to 1.
The Lookup function describes a read operation. If we have an index set to 0 we will be sure that
y is not present in the partition.
If we would have found all 1s we would have returned maybe. Maybe another value activated this
bit.
This filter cannot be flushed, if we remove one data we cannot put to 0 the corresponding indexes
of the hash function, we do not know if any other data have activated the same index, or we
would allow FN.
Removing an element from the simple Bloom filter is impossible because false negatives are not
allowed.
We reduce latency when we perform read operations with this.
The complexity to access to the bloom filter is a constant, so it is to add or check.
Important property of Bloom filters is that the time it takes both to add an element and to check if
an element is in the set is constant.
The probability of a false positive decreases as n, the dimension of the bloom filter, increases,
while it increases as the number of elements stored in the Bloom filter decreases. The longer the
vector, the lower will be the probability to have FP but it increases with the number of elements
we are storing in the DB.
Suggestion: use n = O(h) and k=ln2(n/h) where h is the number of value to check against the filter.
If we know how many checks we will do (estimated) we can set them in this way.
If h = 10, we could choose n = 20,30,50.
Consistency level refers to the consistency between copies of data on different replicas.
We can consider different level of consistency:
Strict consistency: all replicas must have the same data
Low consistency: the data must be written in at least one replica
Moderate consistency: all the intermediate solutions, for example a certain number of
replicas must be updated before returning the control to the application and then for
others we can have eventual consistency.
The consistency level depends on the specific application requirements. We must decide two
sides of the CAP triangle.
Transferring hash trees rather than the entire dataset allow us to transfer less data.
The first check will regard the roots, if the two roots have the same value we won’t have
differences.
If we have any differences we will explore the tree and check for differences for the roots of the
subtrees and so on, until finding the partition with differences. Once identified we can transfer
data and update the value.
One important aspect is communication protocols, that allow the node to communicate so to be
aware about the status of others.
In distributed systems, each node needs to be aware about the status of the other nodes.
A fundamental challenge in any distributed system is keeping member nodes up to date on
information about other members of the system. This is not too much of a problem when the
number of nodes in a cluster is small
The number of total messages to exchange for a complete communication is equal to n*(n– 1)/2,
where n is the number of nodes of a cluster.
While the number of nodes increase, we would spend all the time only for communication, the
fully connected approach is not acceptable.
We have three typical approaches:
Let’s see how the architecture of Cassandra work for read/write operations. In the Cassandra
Read Protocol any node can handle a client request, the node who answer will depend on several
factors.
Any node knows where an information is located thanks to the gossip protocol, it acts as a proxy.
Any node in a Cassandra cluster can handle a client request. All nodes have information about the
state of the cluster and can act as a proxy for a client, forwarding the request to the appropriate
node in the cluster.
The Cassandra Write Protocol is the same. A write request can be answered by any node, and it
knows which is the node that should be updated, the primary server with the primary replica.
If it is available the information is updated in the node but if it is not the node maintain the
request, for example in a sort of commit log.
We have an hinted handoff process that checks periodically the availability of the server that
should be updated and update the server when it is available.
What happens if the node interested in a write operation is down?
Two scenarios may happen:
1) The operation will be lost; thus the target node will be in inconsistent state after its recover
2) A specific node can be designated to receive write operations and forward them to the target
node when it becomes available.
The hinted handoff process is in charge of:
1. Creating a data structure to store information about the write operation and where it
should ultimately be sent
2. Periodically checking the status of the target server
3. Sending the write operation when the target is available
Once the write data is successfully written to the target node (primary replica), it can be
considered a successful write for the purposes of consistency and replication.
The target node will be in charge to update other eventual replicas.
When I delete an author I have to delete all books, so I can keep only that list.
In the main we will retrieve a book and we can delete it.
In the deleteBook I have those general DB methods that depend on the particular database.
When I delete an author we scan the list and delete all books and then delete the author on the
DB.
Column DB Tips
Queries should guide the design we must carry out to design our application.
Some examples of typical queries, most of them analytical, that may led to choose column DB are:
How many new orders were placed in the Northwest region yesterday?
When did a particular customer last place an order?
What orders are en route to customers in London, England?
What products in the Ohio warehouse have fewer than the stock keeping minimum
number of items?
These are queries that require some sort of statistics, histograms, analytics.
Many-to-many relationships
Let’s see how to use column families in a non-conventional way with a pattern to implement
many-to-many relationships.
Keeping a copy of the product name in the customer table will increase the amount of storage
used, but we improve the read performance.
This is a form of denormalization, we introduce redundancy because this info will be present in the
tables regarding the products and similarly in the other table, we can do the same with columns
regarding the id of the user.
We can use it on the analytics side using an ETL starting from another database in the other part.
We will have one or more general attributes of product in a column family and some columns of
the appliance in another column family.
This information will be stored close in the disk.
A single instance of one entity, such as a particular customer or a specific product, should have all
its attributes in a single row.
In this case, typical of column DBs, some rows store more column values than others.
The figure shows the classical organization in a relational DB of an entity that can assume different
“shapes”.
In order to ensure the atomicity of the operation in column DB, the same entity than in the figure
must be organized in just one table with 4 column families.
If we add a new product we need to redesign the table, this is the limitation.
One strategy is to define at the beginning another column family ‘other_products’ so that inside
we can build any kind of column, even if we don’t exploit features totally, we will mix together
information regarding different entities.
One server is overloaded while the three remaining servers are underloaded.
If we read information in other systems and write information in our system we write in a
sequential way and the server interested for these operations will be always the same.
We can prevent hotspotting by hashing sequential values generated by other systems, hashing
the row key in order to distribute the load.
Alternatively, we could add a random string as a prefix of the row ID to the sequential value as a
key to access to one of the servers.
Another way to avoid it is to use parallel loading of the data; we have several threads in charge to
take data from another system and different threads will work in parallel on the different servers.
We won’t have hotspots even if we continue to use lexicographic order.
With one thread in the ETL process, we may risk an hotspot.
These strategies would eliminate the effects of the lexicographic order of the source file on the
data load process.
These information will be stored in a value, but if we do this we won’t exploit the features of the
database, of putting and separating attributes that should be retrieved together.
Using separate columns for each attribute makes it easier to apply database features to the
attributes, like indexing. If we want to index one attribute we need to have data quickly
retriavable, this can be implemented using a column. If we put a JSON file inside we cannot build
an index of one element of a JSON file.
Separating attributes into individual columns allows you to use different column families if
needed.
Index: a specific data structure in a DBMS that allows the database engine to retrieve data faster
than it otherwise would.
In column databases, we can look up a column value to quickly find rows that reference that
column value.
SELECT fname, lname FROM customers WHERE state = ‘OR’;
Primary indexes are indexes on the row keys of a table (created and maintained by the DBMS).
Keys are organized in order to quickly retrieve the row key.
Secondary indexes are indexes created on one or more column value.
In order to optimize the precedent query we need a inde for the state.
Either the database system or your application can create and manage secondary indexes.
Most of DBMS allow us to identify secondary indexes, but some DBMS do not have this feature
and this should be implemented In the size of application.
General and common sense rule:
“if we need secondary indexes on column values and the column family database system provides
automatically managed secondary indexes, then we should use them.”
The code is optimized in fact.
Example: we can speedup this query
SELECT fname, lname FROM customers WHERE state = ‘OR’ AND lname = ‘Smith’ By setting an
index on the state and on the lname attributes.
Typically, the DBMS will use the most selective index first. If we have 10.000 customers in Oregon
and 150 customers with the last name Smith.
The first secondary index will be the last name, we have only 150 records with Smith and we will
look from these the records from Oregon without scanning all 10.000 records. This is automatically
done by the DBMS.
The automatic use of secondary indexes has the advantage we do not have to change our code to
use the indexes if the application requirements will change.
Queries can be added/updated in fact.
In the first situation we have few values more or less equally distributed. An index will probably
not help much, especially if there are roughly equal numbers of each value. We will in fact have
almost the same number of scans, and we need to add and manage an additional data structure.
In the second situation we have many distinct values. We will have an entry in the index for each
distinct value and this is not suitable.
Indexes may not help much here because the index will have to maintain so much data it could
take too much time to search the index and retrieve the data.
In the third situation we have an attribute highly sparse, so we have too few values associated to a
specific column. It is not worth creating and managing an index data structure for this attribute.
We have problems in following this approach.
In this case, we must explicitly create and manage tables to store data we would like to access via
the index.
The programmer will be responsible for maintaining the indexes and two main strategies may be
adopted:
a) We may perform the write operation, update the table, thew index and then return the control
to the application. Updating an index table during write operations keeps data synchronized
(they’re always updated) but increases the time needed to complete a write operation. We have
an extra time for concluding the right operation.
b) We may perform batch updates in the indexes, which means we perform write operation, we
update the table and return the control. Then there will be a thread that will update the index
table. But this can introduce periods of time when the data is not synchronized, but this may be
acceptable in some cases.
In Column DBS the atomicity of write operations is guaranteed in the level of the column family.
We cannot use it with transactions, we cannot ensure atomicity with a group of rows. We are sure
we manage to conclude the update of column families, all, or nothing.
MongoDB
mongoDB stands for “Humongous DB”
Open-source
Document-based
“High performance, high scalability”
Different configurations on CAP triangle (CP and AP mainly)
The data model is document-based (max 16MB), they’re not of infinite size. If I think an array of
transactions will grow exponentialy, maybe it’s not a good idea to embed an array of transactions
insidde a user.
Documents are in BSON format, like JSON format but in a more efficient way, consisting of field-
value pairs. With documents we have flexibility, for example we don’t have to worry if an
information is missing or is additional. Each document stored in a collection.
Collections are like tables of relational DBs, their documents do not need to have uniform
structures. They have index set in common.
JSON stands for “JavaScript Object Notation”.
Easy for humans to write/read, easy for computers to parse/generate
JSON document is an unordered collection of “field: value” pairs
6 main data types (string, number, object, array, boolean, null)
Objects can be nested, we can have embedded objects
JSON objects fields are not ordered, but array elements are, if we have objects inside an
array the order will be maintained
By default, each document contains an _id field. This field has several special characteristics:
The value serves as primary key for collection.
The value is unique, immutable, and may be any non-array type.
Default data type is ObjectId, which is “small, likely unique, fast to generate, and ordered.”
Sorting on an ObjectId value is roughly equivalent to sorting on creation time.
It’s not necessary to specify it in the schema, automatically the MongoDB server will add this field.
There is a default index for a collection that is by id, we don’t have to specify if we want to sort by
id because it is done automatically. Also, it will be auto-incremented.
High Performance
- MongoDB provides high performance data persistence.
- Support for embedded data models reduces I/O activity on database system. It can be
efficient in reading, because information can be gathered inside the document.
- Indexes support faster queries and can include keys from embedded documents and
arrays. It’s not efficient to maintain a lot of indexes.
Rich Query Language
MongoDB supports a rich query language to support read and write operations (CRUD) as well as:
Data Aggregation
Text Search and Geospatial Queries.
High Availability
MongoDB’s replication facility, called replica set, provides:
automatic failover
data redundancy
Horizontal Scalability
We shard data in a collection.
MongoDB provides horizontal scalability as part of its core functionality:
Sharding distributes data across a cluster of machines.
MongoDB supports creating zones of data based on the shard key.
In a balanced cluster, MongoDB directs reads and writes covered by a zone only to those
shards inside the zone.
We do not perform a query to a specific shard but in the lower layers of MongoDB, queries will be
routed according to some needs.
The core processes in MongoDB are:
mongod: the main daemon process for the MongoDB system. It handles data requests,
manages data access (it will handle concurrency), and database process;
mongos: the controller and query router interfaced between client applications and the
sharded cluster; Before going to the mongod process it will intercept the query and decide
where to send the query. It manages shards.
mongo: the interactive MongoDB Shell. The MongoDB Shell is transitioning to mongosh
which is shipped with MongoDB Compass (graphical application);
Primary and secondaries are physically inside different machines. We will have the entire DB
replicated to each node.
The writing operations will be intercepted by the primary node, while the reading operations can
be redirected to secondary nodes. After updated the primary, asynchronously the secondary
nodes will ask for updates.
There will be an algorithm to elect the primary node and if it goes down, a secondary node will be
elected as the new primary node.
The following is the sharding architecture.
mongos process will intercept the query and coordinated with other mongos processes will route
them, and config servers that will configure mongos processes in order to have the same view of
the sharding architecture.
We can have different mongos for different application servers.
This is the overall MongoDB architecture, with the two architectures.
The main application will ask the driver to perform the query. The driver will send the query and
the query router will intercept the query and find the correct shard. Within the shard it will be
redirected to a primary or secondary server if it will be a write or read operation.
Main differences between MongoDB and SQL are:
In MongoDB a schema definition is not required. We can put constraints to emulate something
similar to the schema definition.
Installation requires Community MongoDB and MongoDB Compass.
The default configuration file is, on macOS, in /usr/local/etc/mongod.conf.
After installing it, it will run automatically. We consider a localhost installation where 27017 is the
default port the standalone mongod listens on. Use --port for setting a prot number different than
the default one.
#Stop the running mongoDB server daemon
mongod --shutdown
#Start the mongoDB server daemon
mongod
#Start the mongoDB server daemon on port 28018
mongod --port 28018
We can have different server deamons on the same machine in this way.
Mongo ATLAS is useful for cloud services.
The mongo shell is an interactive JavaScript interface to MongoDB.
We can use the mongo shell to query and update data as well as perform administrative
operations.
To connect to a MongoDB instance running on localhost with a non-default port 28015:
mongo --port 28015
To connect to a MongoDB instance running on a remote server:
mongo --host mongodb0.example.com --port 28015
We can check the list of databases:
show databases (or show dbs)
We can display the database currently used:
db
We can switch databases (<database> is the name of a specific database):
use <database>
We can create both the database myNewDatabase and the collection myCollection during the
insertOne() operation (if they do not exists):
use myNewDatabase
db.myCollection.insertOne( { x: 1 } );
To look for all documents in a collection:
db.myCollection.find()
We can see the list of collections:
show collections
If we insert a document with two fields with the same name: {x:5, x:10}, we will have the last value
for that field. We cannot have in fact duplicate fields, the second field will overwrite the first one.
In MongoDB Compass we will first have the URI of the connection to connect to the cluster. We
can connect our DB to the ATLAS.
We can also create indexes for a field and specifying the order we want. In Explain Plan we can
evaluate the performance of our queries. This is a useful tool to design indexes.
In Aggregation we can sort by field ($sort {pop: -1}, descending by population) and then limit the
result on the first ($limit 1), and we will see the pipeline.
Graph Databases
Use graph databases only for network structures or something similar.
A graph is a collection of vertexes and edges.
A vertex represents an entity and for entity we consider an instance of an entity in this
environment.
In the figure, we show instances of an entity (city) represented by a set of vertices. We are in the
context of highways networks.
The properties that the vertices, namely the cities, may have are:
• Population
• Coordinates
• Name
• Geographic Region
The relationships are specified among edge and are among two instances of the same or different
entities.
The properties that the edge in the highways network, may have are:
- Distance
- Speed limit
- Number of lanes
A Weight is a commonly used property of edge.
It represents some value about the strength of the relationship (cost, distance, etc.)
Directed edges have a direction, whereas undirected edge does not have a direction.
The presence or not of a direction in edges depends on specific applications.
An undirected graph can be always specified in terms of two directed edges. If we have just the
same value for both, it’s better to use just one to reduce the amount of memory to use.
Example: Highways Network Graphs Example: Family Relation Graphs
A path through a graph is a set of vertices along with the edges between those vertices and allow
us to move from an entity to another.
Paths allow us to capture information about the relationships of vertices in a graph.
There may be multiple paths between two vertices. In this case (highways networks, for instance),
often the problem of finding the best (the shortest, the cheapest, etc.) path is taken into
consideration, if we have a weighted graph.
All elements In the path represents ancestors of one node.
Proteins, for example, may interacts with proteins of the same type.
The presence of loop may not have sense in some specific domains, such as family tree graphs.
The union of two graphs is the set of edges and vertexes put together from the two graphs.
A graph is not always connected, so if there’s no edge in common, we may only have vertexes.
The intersection of two graphs is the set of edges and vertexes in common of the two graphs.
Graph traversal is the process of visiting all vertices in a graph.
The purpose of this is usually to either set or read some property value in a graph, in edges or
vertexes.
The traversal of the graph is often carried out in a particular way, such us minimizing the cost of
the path.
Traversing all the graph may be expensive, if our query need traversal we need to analyze if this
traversal is needed or not and if using a graph DB is a good solution.
Two graphs are isomorphic if:
for each vertex in the first graph, there is a corresponding vertex in the other graph
for each edge between a pair of vertices in the first graph, there is a corresponding edge
between the corresponding vertices of the other graph.
We have the same vertexes and relations among the same but relations that may be different.
In the example above, both vertices 1 and 2 will have high betweenness levels. Indeed, they form
a bottleneck in the graph and if one of them were removed, it would leave the graph
disconnected.
A graph is a collection of Vertices and Edges. It will be persistent on the disk and managed by a
DBMS.
Vertex: specifies an entity. Usually, entities in a graph belongs to the same category.
Edge: specifies a relation between two vertices. Relations may be long terms or short terms.
Vertices and edges may have properties.
A highway could be a single edge between two cities, in which case it models the road traffic in
both directions.
Alternatively, a graphical representation could use two edges, one to represent travel in each
direction (for example, east to west and west to east).
Issue: Which is the “right way” to model highways? Answer: It depends!
- If the goal is to model distance and approximate travel times between cities, then a single
edge might be sufficient.
- If we are interested in more detailed descriptions of highways, such as direction, number
of lanes, current construction areas, and locations of accidents, then using two edges is a
better option. We need to avoid to put too much information in the graph.
Graph databases adopt vertices and edges to store explicit relations between entities.
In relational databases, connections are not represented as links. Instead, two entities share a
common attribute value, which is known as a key.
Join operations are used in relational databases to find connections or links.
When dealing with huge amount of data (several tables with too much rows) join operation
became computationally expensive.
Example of query: list all the course a particular student is enrolled in.
In relational models, we need a join procedure. The edges between students and courses allow us
to quickly query the databases.
Instead of making joins, at low level we have pointers in the object that points the address of the
object describing the course.
If the entries grow too much the join operation is really expensive.
Graph databases allows to modeling many-to-many relations in a easier way than relational
databases.
Edges allow us to explicitly modelling many-to-many relations, rather than using tables.
The amount of entries would be huge in a social network.
Different type of relations between entities may be handled exploiting multiple type of edges.
Multiple type of edges can be simply obtained specifying different values of edge properties
When dealing with NoSQL databases, it is not easy to decide which solution should be adopted. In
general, data base designer should have in mind the main kinds of queries that the application will
perform on the data. Graph databases are suitable for problem domains easily described in terms
of entities and relations. Typical queries are:
How many hops (that is, edges) does it take to get from vertex A to vertex B?
How many edges between vertex A and vertex B have a cost that is less than 100?
How many edges are linked to vertex A? (degree)
What is the centrality measure of vertex B?
Is vertex C a bottleneck; that is, if vertex C is removed, which parts of the graph become
disconnected?
In the previous examples of queries, there is less emphasis on:
selecting by particular properties (how many vertices have at least 10 edges?). To make
this query we need to traverse all the graph.
aggregating values across a group of entities (selects all customer orders from the
Northeast placed in the last month and sum the total value of those orders)
Moreover, the previous queries are fairly abstract (computer science/math domain).
Designer needs to translate queries from the problem domain to the domain of experts working
on graphs.
Problems:
We need to make several join operations.
SQL lacks the syntax to easily perform graph traversal, especially traversals where the
depth is unknown or unbounded.
Performance degrades quickly as we traverse the graph. Each level of traversal adds
significantly response time to a specific query.
In the following, we show a SQL query for finding all actors who have ever worked with Keanu
Reeve.
We don’t know the dimension of the graph and also it can grow a lot.
We can exploit the Acted in Role property of the edge to specify the role. Exploiting the graph
structure we can build the db. At low level we will just traverse the graph exploiting pointers.
We will satisfy this query in this way:
If we need analytics query and this kind of query, we can use MongoDB and GraphDB and we will
have to work on the consistency of the DB. We will have a server for MongoDB and a server for
neo4j and the application will deal with it.
If we have someone that connects two groups of developers, we have two disconnected groups
and posts of a group will not be seen by the other.
Basic steps for designing GraphDB:
- Identify the queries you would like to perform.
- Identify entities in the graph, among all entities we have.
- Identify relations between entities
- Map domain-specific queries to more abstract queries
- Implement graph queries and use graph algorithms. We need to exploit algorithms and in
particular the optimized form.
Some advice are:
1. Pay attention to the dimension of the graph: the bigger the graph the more computational
expensive will be performing queries.
2. If available, define and use indexes for improving retrieval time. We need to take care of it,
to keep it updated we may traverse all the graph.
3. The higher the number of edges, the higher the occupied memory. Use appropriate type of
edges:
a. In case of symmetrical relations, use undirected edge, otherwise use direct edges.
b. Properties requires memory! If possible, use simple edge without properties or few,
but do not exploit another DB just to store details. Also pay attention to the data
type of the properties.
4. Watch for cycles when traversing graphs: in order to avoid endless cycles, use appropriate
data structure for keeping track of visited nodes.
Most of the famous implementation of graph databases do not consider the deployment on
cluster of servers (no data partitions nor replicas). Indeed, they are designed for running on a
single server and can be only vertically scaled.
Thus, the developer must take a special attention to the growing of:
The number of nodes and edges
The number of users
The number and size of properties
Since last year the community version of Neo4j, it’s possible to have replicas of the graph.
Execution Times of Two Typical Algorithms:
Increasing the number of vertexes, the execution time explodes with the maxima clique, the
largest group of people who all follow each other.
Parent POM
A parent pom is a pom file that can be imported in the application.
We have an application composed of 4 different modules aggregated in the parent pom. We want
to provide the service for books and music, so we use two modules in common and the two
modules separated.
If one of the libraries part of the framework requires another library of the same framework, you
don’t import it manually.
We have the parent containing 4 modules, and we have that the login requires the cache library.
If we want to upgrade login utils version to the latest but also, we have an upgrade of the cache
module, if we’re forced to specify the version of cache and login util in the POM there’s a high
probability to update util without updating cache. If we use the multi-module function, we
upgrade the dependency of the parent and not of the single module, importing the coherent set.
If we have an upgrade of two modules, it will work correctly.
We don’t want to show outside the PriceGridServiceImpl but just the GridService interface.
Objects that we want to be accessible outside have public scope, otherwise package scope.
Removal is a problem because if it was public, appications using it will have compilation errors.
Rather than removing an API item it’s better to declare it deprecated, we keep the retro-
compatibility.
It is a set of items, for example an interface, a library is exposing but that’s not implemented in the
library because it’s up to the application to implement it based on the logic of the application.
The library could invoke the implementation of that.
We can modify the behavior of the library doing it.
If we add something mandatory, upgrading library version we need to add something to make it
work.
It’s important to separate API & SPI:
We can run a plugin in the pom to parse the source files and do this transformation.
If we want to distribute a library, it’s not a good idea.
Regression: bug
Test always edge cases.
We can fix a minimum coverage of the code and confirm the test only if it’s greater than this.
A common approach is:
@Test for test methods.
@BeforeEach every time before a @Test is executed.
@BeforeAll at the beginning.
Same for after.
We have to increase the visibility scope of the constructor removing private and flagging it with
that annotation.
The init method whenever one of these tests is executed before them creates a new instance of
the system under test.
We create a new object because we want to test in isolation.
Being the package name the same production code and test code can expoit the package visibility.
The one above is the original code.
After the refactory we have the class in the right:
They build at run-time the mock and we just specify the behavior we expect.
Whenever my mock is invoked in the method getDataById passing any possible string then return
me a new data allocated constructed in this way.
We have two fields, a mock and a service under test.
We initialize the mock and allocate an instance of the sut.
The first test is the one that the singleton pattern is working.
We check instances returned by the getInstance methods are the same.
CRUD operations on MongoDB
Create or insert operations add new documents to a collection.
If the collection does not currently exist, insert operations will create the collection.
In MongoDB, insert operations target a single collection.
Only aggregation will allow us to consider more than one collection.
All write operations in MongoDB are atomic on the level of a single document.
In MongoDB, each document stored in a collection requires a unique _id field that acts as a
primary key.
If an inserted document omits the _id field, the MongoDB driver automatically generates an
ObjectId for the _id field.
The insertOne() statement allow us to insert a document.
Db.NameCollection.insertOne().
The following example inserts a new document into the inventory collection.
If the document does not specify an _id field, MongoDB adds the _id field with an ObjectId value
to the new document.
We will receive the acknowledgment with the id associated to the object if everything worked.
insertMany() can insert multiple documents into a collection.
To this aim, an array of documents must be passed to the method.
The following example inserts three new documents into the inventory collection.
We have to specify a vector of documents.
A query filter document can use the query operators to specify conditions in the following form:
{ <field1>: { <operator1>: <value1> }, ... }
The following example retrieves all documents from the inventory collection where status equals
either "A" or "D":
The following example retrieves all documents in the inventory collection where the status equals
"A" AND qty is less than ($lt) 30:
The following example (last query in the figure) retrieves all documents in the collection where the
status equals "A" OR qty is less than ($lt) 30:
To use the update operators, we need to pass to the update methods an update document as
follows:
The db.collection.updateOne() method on the inventory collection update the first document
where item equals "paper":
The $set operator updates the value of the size.uom field to "cm" and the value of the status field
to "P".
The $currentDate operator updates the value of the lastModified field to the current date. If
lastModified field does not exist, $currentDate will create the field.
The db.collection.updateMany() method on the inventory collection update all documents where
qty is less than 50:
To replace the entire content of a document except for the _id field (it is immutable), pass an
entirely new document as the second argument to db.collection.replaceOne().
When replacing a document, the replacement document must consist of only field/value pairs;
i.e. do not include update operators expressions.
The replacement document can have different fields from the original document.
The following example replaces the first document from the inventory collection where item:
"paper":
We will continue to have the item but also this new field and not the old fields.
To delete all documents from a collection, pass an empty filter document {} to the
db.collection.deleteMany() method.
The following example deletes all documents from the inventory collection:
We can specify criteria, or filters, that identify the documents to delete. The filters use the same
syntax as read operations.
To specify equality conditions, use <field>:<value> expressions in the query filter document:
{ <field1>: <value1>, ... }
A query filter document can use the query operators to specify conditions in the following form:
{ <field1>: { <operator1>: <value1> }, ... }
The following example removes all documents from the inventory collection where the status field
equals "A":
To delete at most a single document that matches a specified filter (even though multiple
documents may match the specified filter) use the db.collection.deleteOne() method.
We can use the mongoexport command-line tool to produce a JSON or CSV export of data stored
in a MongoDB instance. Run mongoexport from the system command line, not the mongo shell.
To export a specified collection to a specified output file from a local MongoDB instance running
on port 27017 use the following command:
mongoexport --collection=<collectionName> --db=<dbName>
--fields=<field1[,field2]> --out=<filename.json(-csv)> --type=<json/csv>
If we want to export from a specific host and port, use the following command
mongoexport --host= <hostAddress> --port=<portNumber>--collection=<collectionName>
--db=<dbName> --fields=<field1[,field2]> --out=<filename.json(-csv)> --type=<json/csv>
The mongoimport tool imports content from an Extended JSON, CSV, or TSV export created by
mongoexport, or potentially, another third-party export tool.
Run mongoimport from the system command line, not the mongo shell.
Starting in MongoDB 4.2, mongoimport expects import data to be in Extended JSON v2.0 by
default.
Mongoimport only supports data files that are UTF-8 encoded. Using other encodings will produce
errors.
Importing the order is not maintained, to do that we have to use a specific option.
In-memory Databases
The design of a database solution is driven by the application requirements.
Thus, if our application does not need to handle huge amount of data and to scale, both vertically
and horizontally, we can think about adopting different solutions that the ones discussed so far.
In-memory databases may be suitable for applications whose handled data may fit all in the
memory of a single server.
Queries will be in main memory and not in the disk and we will have low latency.
Moreover, there also can be databases that can reside in the memory capacity of a cluster (if
replication is needed). We can distribute the database and using sharding.
If we lose electricity we have to guarantee the reachability of data.
Very fast access to the data, avoiding the bottleneck of the I/O transfers.
Suitable for application that do not need to write continuously to a persistent medium.
Ad-hoc architecture for being aware that its data are resident in memory and for avoiding the data
loss in the event of a power failure (alternative persistency model).
Cache-less architecture: despite traditional disk-based databases, a memory caching system is not
need (everything is already in memory!!!).
Solutions for Data Persistency are the following.
In the following, we show some solutions for avoiding data loss in case of system failure:
Writing complete database images (called snapshots or checkpoints) to disk files
frequently or in front of a request.
Writing out transaction/operation records to an append-only disk file (called a transaction
log or journal) and then return the control to the application. It is similar to a commit log
and it can be used to avoid losing information in case of a failure.
Replicating data to other members of a cluster in memory.
These are the general solutions for data persistency.
The architecture of the solutions are the one we will see now.
HANA DB is a relational database, product of SAP, which combines an in-memory technology with
a columnar storage solution. It must be installed on an optimized hardware configuration.
Classical row tables are exploited for OLTP operations and are stored in memory.
Columnar tables are often used for also OLAP operations. By default, they are stored in the disk
and loaded in memory by demand. However, the database may be configured to load in memory a
set of columns or entire tables since the beginning.
Data in the row store is guaranteed to be in memory, while data in the column store is by default
loaded on demand. However, specified columns or entire tables may be configured for immediate
loading on database startup.
The persistence architecture of HANA uses the snapshot and journal file pattern found in Redis
and TimesTen. HANA periodically snapshots the state of memory to Savepoint files. These
Savepoints are periodically applied to the master database files.
ACID transactional consistency is enabled by the transaction “redo” log. As with most ACID-
compliant relational databases, this log is written upon transaction commit, which means that
applications will wait on the transaction log IO to complete before the commit can return control.
HANA’s columnar architecture includes an implementation of the write-optimized delta store
pattern. Transactions to columnar tables are buffered in this delta store. Initially, the data is held
in row-oriented format (the L1 delta). Data then moves to the L2 delta store, which is columnar in
orientation but relatively lightly compressed and unsorted. Finally, the data migrates to the main
column store, in which data is highly compressed and sorted.
The architecture is the following:
On start-up, row-based tables and selected column tables are loaded from the database files (1).
Other column tables will be loaded on demand. Reads and writes to row-oriented tables can be
applied directly (2). Updates to column-based tables will be buffered in the delta store (3), initially
to the L1 row-oriented store. Data in the L1 store will consequently be promoted to the L2 store,
and finally to the column store itself (4). Queries against column tables must read from both the
delta store and the main column store (5).
The row store is the relational database. We have snapshots of it in SavePoint.
From this store we make classical read/write operations; it is used for simple queries.
Then the system may be interested into writes to a columnar part, it decides it in fact where to
write data.
We move towards a more compressed and better organized columnar store.
Queries go towards to the three databases. The first will go to the column store, if data is missing
we go to the L2 delta store and if the same we go to L1 delta Store.
We assign to myDocument the first document pointed by the cursor and then the cursor will be
incremented by one. If we actually have a value, we will get the item of the JSON Document.
We transform the content to json and obtain the result.
The result will be journal, then notebook and so on until the end when we will get null.
The limit() method limits the number of documents in the result set (if we don’t want 20 elements
in the result). The following operation returns at most 3 documents in the collection inventory.
The skip() method controls the starting point of the results set. The following operation skips the
first 3 documents in the inventory collection and returns all remaining documents:
We will see the last two objects and the first three objects will be skipped.
The sort() method orders the documents in the result set. By default they will be sorted by _id,
considering the order of insertion. The following operation returns documents in the inventory
collection sorted in ascending order by the qty field:
The following operation returns documents in the inventory collection sorted in descending order
by the qty field:
We will order by status, and with objects with the same status we will have a descending order by
qty.
The previous cursor methods can be combined as follows:
The method db.collection.count(query, options) returns the count of documents that would match
a find() query for the collection.
The method db.collection.count does not perform the find() operation but instead counts and
returns the number of results that match a query.
The following query uses the less than operator ($lt) on the field “h” embedded in the “size” field:
The following query selects all documents where the nested field “h” is less than 15, the nested
field “uom” equals "in", and the “status” field equals "D":
Query an Array
To specify equality condition on an array, use the query document { <field>: <value> } where
<value> is the exact array to match, including the order of the elements.
The following example queries for all documents where the field “tags” value is an array with
exactly two elements, "red" and "blank", in the specified order:
If we wish to find an array that contains both the elements "red" and "blank", without regard to
order or if there are other elements in the array, use the $all operator:
To query if the array field contains at least one element with the specified value, use the filter
{ <field>:<value> } where <value> is the element value.
The following example queries for all documents where “tags“ is an array that contains the string
"red" as one of its elements:
To specify conditions on the elements in the array field, use query operators in the query filter
document { <array field>: { <operator1>: <value1>, ... } }
The following operations queries for all documents where the array ”dim_cm” contains at least
one element whose value is greater than 25 (first query) and is greater than 20 (second query).
The following example, queries for documents where the “dim_cm” array contains elements that
in some combination satisfy the query conditions; e.g., one element can satisfy the greater than 15
condition and another element can satisfy the less than 20 condition, or a single element can
satisfy both:
The following example queries for documents where the “dim_cm” array contains at least one
element that is both greater than ($gt) 22 and less than ($lt) 30:
Using dot notation, we can specify query conditions for an element at a particular index or
position of the array. The array uses zero-based indexing.
The following example queries for all documents where the second element in the array “dim_cm”
is greater than 25:
Use the $size operator to query for arrays by number of elements. For example, the following
selects documents where the array “tags” has 3 elements.
1.Import the restaurants datasets from the restaurants.json file.
db.restaurants.find().count();
db.restaurants.find({"borough": "Bronx"});
db.restaurants.find({"borough": "Bronx"}).skip(5).limit(5);
db.restaurants.find({grades : { $elemMatch:{"score":{$gt : 80 ,
$lt :100}}}});
7.Write a MongoDB query to find the restaurants that do not prepare any
cuisine of 'American' and their grade score more than 70 and latitude less
than -65.754168
db.restaurants.find(
{$and:[{"cuisine" : {$ne :"American "}},
{"grades.score" : {$gt : 70}},
{"address.coord" : {$lt : -65.754168}}
]
}
);
8. Write a MongoDB query to find the restaurants which do not prepare any
cuisine of 'American ' and achieved a grade point 'A' not belongs to the
borough Brooklyn.
The document must be displayed according to the cuisine in descending order.
db.restaurants.find( {
"cuisine" : {$ne : "American "},
"grades.grade" :"A",
"borough": {$ne : "Brooklyn"}
}
).sort({"cuisine":-1});
Projection operations and Aggregations in MongoDB
By default, queries in MongoDB return all fields in matching documents.
To limit the amount of data that MongoDB sends to applications, we can include a projection
document to specify or restrict fields to return.
A projection can explicitly include several fields by setting the <field> to 1 in the projection
document.
In the following example, the operation returns all documents that match the query, limited to
fields item and status.
Instead of listing the fields to return in the matching document, you can use a projection to
exclude specific fields.
The following example which returns all fields except for the status and the instock fields in the
matching documents:
With the exception of the _id field, we cannot combine inclusion and exclusion statements in
projection documents. We have to decide a list we want to include (1) or we want to exclude (0).
For fields that contain arrays, MongoDB provides the following projection operators for
manipulating arrays: $elemMatch, $slice, and $.
$slice allow us to specify the element that we want (-1 = last).
Specifying 2 we will have the first two elements and the same for the other positive numbers and
instead using -2 we will have the last two elements and the same for other negative numbers.
The $elemMatch operator project the first matching element from an array based on a condition.
We won’t skip the first two objects because the parameter of the query is qty and they don’t have
it.
The { item : null } query matches documents that either contain the item field whose value is null
or that do not contain the item field.
The first document has the item field but with no value, this is possible.
It is possible to query documents that contains field of a specific type.
Aggregation
Aggregation operations process data records and return computed results.
Aggregation operations group values from multiple documents together and can perform a variety
of operations on the grouped data to return a single result.
For example, we can calculate statistics on distribution.
MongoDB quickly group documents and calculate statistics.
MongoDB provides three ways to perform aggregation:
• the aggregation pipeline, we consider only this
• the map-reduce function
• single purpose aggregation methods.
Documents enter a multi-stage pipeline that transforms the documents into aggregated results.
For example:
First Stage: The $match stage filters the documents by the status field and passes to the
next stage those documents that have status equal to "A".
Second Stage: The $group stage groups the documents by the cust_id field to calculate
the sum of the amount for each unique cust_id.
The id of the result will be cust_id and for each we calculate the sum of the amount.
The MongoDB aggregation pipeline consists of stages. Each stage transforms the documents as
they pass through the pipeline.
Pipeline stages do not need to produce one output document for every input document; e.g.,
some stages may generate new documents or filter out documents.
Pipeline stages can appear multiple times in the pipeline except for $out, $merge, and $geoNear
stages.
The $lookup retrieve information from another collection. If we use it, it must be justified.
If we consider more than one collection there is a warning, three or more is not allowed for sure.
Unwind must be justified too, starting from the array we’re extracting information and we need to
explain why we used array and not a specific collection.
The new per-state documents have two fields: the _id field and the totalPop field.
Then we have the $match stage that filters the grouped documents to output only those
documents
whose totalPop value is greater than or equal to 10 million.
The $match stage does not alter the matching documents but outputs the matching documents
unmodified.
The aggregation pipeline consists of a $group stage followed by another $group stage.
We have to consider the population of each city and then calculate the average step by step.
The first $group stage groups the documents by the combination of city and state, uses the $sum
expression to calculate the population for each combination, and outputs a document for each city
and state combination.
The second $group stage groups the documents in the pipeline by the “_id.state” field (i.e. the
state field inside the _id document), uses the $avg expression to calculate the average city
population (avgCityPop) for each state, and outputs a document for each state.
Return the largest and the smallest cities by State.
To solve this query we build a pipeline which consists of a $group stage, a $sort stage, another
<$group stage, and a $project stage.
Let consider the first two stages:
The first $group stage groups the documents by the combination of the city and state, calculates
the sum of the pop values for each combination, and outputs a document for each city and state
combination.
The $sort stage orders the documents in the pipeline by the pop field value, from smallest to
largest; i.e. by increasing order. This operation does not alter the documents.
Let’s consider also the second $group stage in the pipeline:
We can explore the fact that we sorted cities in the previous stage.
The final $project stage: i) renames the _id field to state and ii) moves the biggestCity, biggestPop,
smallestCity, and smallestPop into biggestCity and smallestCity embedded documents.
Indexes
Indexes support the efficient execution of queries in MongoDB.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a
collection, to select those documents that match the query statement.
If an appropriate index exists for a query, MongoDB can use the index to limit the number of
documents it must inspect.
Indexes are special data structures that store a small portion of the collection’s data set in an easy
to traverse form.
The index stores the value of a specific field or set of fields, ordered by the value of the field.
The ordering of the index entries supports efficient equality matches and range-based query
operations.
In addition, MongoDB can return sorted results by using the ordering in the index.
MongoDB defines indexes at the collection level and supports indexes on any field or sub-field of
the documents in a MongoDB collection.
We have a collection of keys which are values associated with the indexes.
Rather than scanning all documents, we will scan the keys inside the index and only the ones less
than 30.
There’s a default index, which is the _id index.
MongoDB creates a unique index on the _id field during the creation of a collection.
A unique index ensures that the indexed fields do not store duplicate values.
The _id index prevents clients from inserting two documents with the same value for the _id field.
The index on the _id field cannot be dropped.
The following method can be used for creating an index (we need to have a collection and inside
at least one document):
db.collection.createIndex(keys, options)
keys -> a document that contains the field and value pairs where:
the field is the index key and
the value describes the type of index for that field. For an ascending index on a field,
specify a value of 1; for descending index, specify a value of -1.
options-> a document which contains a set of options that controls the creation of the index.
The method only creates an index if an index of the same specification does not already exist.
For a single-field index and sort operations, the sort order (i.e. ascending or descending) of the
index key does not matter.
We can view index names using the db.collection.getIndexes() method.
We cannot rename an index once created. Instead, we must drop and re- create the index with a
new name.
We will have a data structure in which we will have the userId in ascending order and among the
same userid we will have the score ordered in descending order.
Indexes can be created on keys in embedded documents in the same way that they are created on
normal keys.
Let’s suppose to have a collection of documents like the following one:
Regarding the indexing of arrays, let’s suppose to have a collection of posts extracted from a
social network. Each post includes a set of comments modelled as an array of documents:
To evaluate the performance of a query chain, the execution time and how many documents
examinated in the query the cursor.explain("executionStats") cursor method to the end of the
find command as follows:
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } ).explain("executionStats") which returns the
following document:
We have the # of objects returned, the totalKeysExamined (we have no indexes so 0) and total
documents examinated.
Let define the following index: db.inventory.createIndex( { quantity: 1 } )
Let use the command:
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } ).explain("executionStats") which returns the
following document:
When we use the index we scan 3 keys and only 3 documents. This can prove that our index will
be good together with the proof of the amount of read/write operations.
Consider the following query:
db.inventory.find( { quantity: { $gte: 100, $lte: 300 }, type: "food" } ) which returns:
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 5, "item" : "f3", "type" : "food", "quantity" : 300 }
We can define the following two compound indexes:
db.inventory.createIndex( { quantity: 1, type: 1 } )
db.inventory.createIndex( { type: 1, quantity: 1 } )
With compound indexes, the order of the fields matter.
The first index orders by quantity field first, and then the type field.
The second index orders by type first, and then the quantity field.
We can define both and perform the query, MongoDB will automatically use the best one. It will
give us the winningPlan and the rejected one, using the explain command.
It depends on the distribution of the data.
We can also use the hint() method to force MongoDB to use a specific index for the query to
understand how long it takes to execute on one index.
Let use the command:
db.inventory.find( { quantity: { $gte: 100, $lte: 300 }, type: "food" } ).hint({ quantity: 1, type:
1 }).explain("executionStats"))
which returns the following document:
MongoDB Java Driver
This driver allows us to manipulate using Java Application data stored in a MongoDB database.
The recommended way to get started using one of the drivers in your project is with a
dependency management system, such as MAVEN.
It’s possible to work with the Java MongoDB Driver importing the following dependency in the
MAVEN file:
Once we have a MongoDatabase instance, we can use its getCollection() method to access a
collection.
Specify the name of the collection to the getCollection() method. If a collection does not exist,
MongoDB creates the collection when you first store data for that collection.
For example, using the database instance, the following statement accesses the collection named
test in the mydb database:
MongoCollection<Document> collection = database.getCollection("test");
In the collection, Documents are BSON Documents but we can also use JSON Documents.
To insert a single document into the collection, we can use the collection’s insertOne() method.
collection.insertOne(doc);
To insert a set of documents, contained into a list of documents we can use the following example
of code:
We can iterate through query results by using a consumer function (statically or locally defined)
The foreach applies printDocuments to all results of the query.
Collect results in a list:
To query for documents that match certain conditions, pass a filter object to the find() method.
To facilitate creating filter objects, Java driver provides the Filters helper.
The following example retrieves all documents in the collection where 50 < i <= 100 and prints the
returned documents:
The and operator returns a BSON Document, useful if we don’t remember precisely the structure
of the document to use as a filter.
To update documents in a collection, we can use the collection’s updateOne and updateMany
methods.
These functions needs two parameters:
A filter object to determine the document or documents to update.
An update document that specifies the modifications.
To delete documents from a collection, we can use the collection’s deleteOne and deleteMany
methods.
As a parameter it requires just a filter object to select the documents to delete.
The example deletes at most one document that meets the filter i equals 110:
collection.deleteOne(eq("i", 110));
The following example deletes all documents where i is greater or equal to 100:
DeleteResult deleteResult = collection.deleteMany(gte("i", 100));
System.out.println(deleteResult.getDeletedCount());
The delete methods return a DeleteResult which provides information about the operation
including the number of documents deleted.
Import static filters, aggregations, projections and accumulators to improve readability of your
code.
Using static import allows us to do not use package names when using sum, aggregate, etc.
The match operator filters the documents to pass only the ones that match the specified
condition(s) to the next pipeline stage.
The group operator groups input documents by the specified _id expression (first argument) and
for each distinct grouping. It returns a document.
This operator can include accumulators (sum, avg, max, min, ...)
If we want to group by using multiple fields the second option has to be used, we cannot use the
utility.
The project operator passes along the documents with the requested fields to the next stage in
the pipeline. The specified fields can be existing fields from the input documents or newly
computed fields. I can exclude, include or compute new fields.
Sometimes we may want to rename the _id field in another name, because the name of the
variable for which it has been grouped won’t be shown but a new ‘_id’ field will by default.
The sort operator Sorts all input documents and returns them to the pipeline in sorted order.
The order can be ascending or descending.
The limit operator limits the number of documents passed to the next stage. In conjunction with
limit, we can implement queries that search, for example, for the top 3 biggest cities
The skip operator, instead, skips over the specified number of documents that pass into the next
stage.
The unwind operator deconstructs an array field from the input documents to output a document
for each element Each output document is the input document with the value of the array field
replaced by the element.
Given a document with an array inside, after the unwind stage, we will have a document for each
value of the array with all other fields equal to n the original document.
To create an index on a field or fields, pass an index specification document to the createIndex()
method. An index key specification document contains the fields to index and the index type for
each field: new Document(<field1>, <type1>).append(<field2>, <type2>) ...
For an ascending index type, specify +1 for <type>. For a descending index type, specify -1 for
<type>. The following example creates an ascending index on the i field:
collection.createIndex(new Document("city", 1));
Exercise)
ZIPS dataset
1. Find zip codes of Texan cities.
2. Find cities' zips with a population of at least 100’000, but no more than
200’000.
3. Find the 5 most populated cities.
4. For each state, find the average population of its cities.
POSTS dataset
1. Find posts published after 2012-11-20.
2. Find posts with the tag 'computer'.
3. Find, for each tag, the total number of posts.
4. Find the top three commentators according to the number of comments.
5. Find the most versatile commentator. Versatile means that he/she
commented on the highest number of distinct topics (a.k.a. tags). For a tag
to count, he/she must have at least five comments about that topic.
package main.java.exercises;
import com.mongodb.ConnectionString;
import com.mongodb.client.*;
import com.mongodb.client.model.Accumulators;
import com.mongodb.client.model.Filters;
import com.mongodb.client.result.DeleteResult;
import com.mongodb.client.result.UpdateResult;
import org.bson.Document;
import org.bson.json.JsonWriterSettings;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.function.Consumer;
import org.bson.conversions.Bson;
//posts.aggregate(Arrays.asList(unwind,
limit(10))).forEach(printDocuments());
List<Document> results = posts.aggregate(Arrays.asList(unwind,
group, sort, limit, project)).into(new ArrayList<>());
System.out.println("==> 3 most popular tags and their posts
titles");
results.forEach(printFormattedDocuments());
//4 - For each state, find the average population of its cities
//
Bson group1 = new Document("$group", new Document("_id", new
Document("city", "$city")
.append("state", "$state"))
.append("totalPop", new Document("$sum", "$pop")));
Bson project1 = project(fields(excludeId(), computed("city",
"$_id.city"),
computed("state", "$_id.state"), include("totalPop")));
Bson group2 = group("$state", avg("avgStatePop", "$totalPop"));
Bson project2 = project(fields(excludeId(), computed("state",
"$_id"), include("avgStatePop")));
"mongodb+srv://admin:[email protected]/
sample_training?retryWrites=true&w=majority");
MongoDB Clustering
We can also create a cluster made of different MongoDB servers, and all replications, updates will
be done automatically.
We can use the cluster as a single entity, as a standalone representation.
We can have a cluster of different instances in the same machine with different ports.
In Ubuntu:
mkdir ~/data/r{1,2,3}
Open 3 separate terminals and start 3 MongoDB servers on replica set mode (--replSet option):
mongod --replSet lsmdb --dbpath ~/data/r1 –port 27018 --bind_ip localhost --oplogSize 200
mongod --replSet lsmdb --dbpath ~/data/r2 –port 27019 --bind_ip localhost --oplogSize 200
mongod --replSet lsmdb --dbpath ~/data/r3 –port 27020 --bind_ip localhost --oplogSize 200
They will be running but not yet forming a cluster, they don’t know how to reach other members.
To fully configure the replica set, follow these steps:
1. Open a 4th terminal and connect to one of the three instances, for example, the first one
mongo --port 27018
2. Define the following configuration:
- W is the minimum number of writes that the application waits until it regains the flow of
execution. If W = 2 when the application performs a write the ack from the cluster will not
arrive until at least two nodes will perform the write (not only the primary).
The default value is majority (n/2+1 if odd or n/2 if even), but it can be any integer >= 1.
The higher the value, the higher the waiting time but we will have a higher consistency.
- Wtimeout is the maximum waiting time (in ms) before regaining the flow of execution. If
set to 0 the application will wait indefinitely or until “w” is satisfied.
If the primary shutdowns a new primary will be elected.
They will search periodically if the third member has come back up again.
Connect via VPN using the largescale.ovpn file and through ssh use user=root and pass=root.
Open 3 different terminals and issues the following commands
#Terminal 1
ssh [email protected]
mkdir data
mongod --replSet lsmdb --dbpath ~/data –port 27020 --bind_ip localhost,172.16.4.99 –oplogSize
200
#Terminal 2
ssh [email protected]
mkdir data
mongod --replSet lsmdb --dbpath ~/data –port 27020 --bind_ip localhost,172.16.4.100 –oplogSize
200
#Terminal 3
ssh [email protected]
mkdir data
mongod --replSet lsmdb --dbpath ~/data –port 27020 --bind_ip localhost,172.16.4.101 –oplogSize
200
Only the primary will return the readed value in the first case and so on.
We can use the tags configured into our replicas to state preferences on secondary replicas
(cannot specify a tag preference on a primary)
Exercise)
package main.java;
import com.mongodb.*;
import com.mongodb.client.*;
import com.mongodb.client.model.Updates;
import com.mongodb.client.result.DeleteResult;
import com.mongodb.client.result.UpdateResult;
import com.mongodb.connection.ClusterDescription;
import org.bson.Document;
import org.bson.json.JsonWriterSettings;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Date;
import java.util.List;
import java.util.function.Consumer;
"mongodb+srv://admin:<password>@largescaleandmultistruc.jhdlw.mongodb.net/
<dbname>?retryWrites=true&w=majority");
MongoClient mongoClientAtlas = MongoClients.create(uri);
// Method 1
MongoClient mongoClient = MongoClients.create(
"mongodb://localhost:27018,localhost:27019,localhost:27020/"
+ "?retryWrites=true&w=majority&wtimeout=10000");
// Method 2
// You can specify one or more nodes in the replica set.
// MongoDB will automatically discover PRIMARY and SECONDARY nodes
within the cluster
uri = new ConnectionString("mongodb://localhost:27018");
MongoClientSettings mcs = MongoClientSettings.builder()
.applyConnectionString(uri)
.readPreference(ReadPreference.nearest())
.retryWrites(true)
.writeConcern(WriteConcern.ACKNOWLEDGED).build();
MongoClient mongoClient2 = MongoClients.create(mcs);
/*
// Read Preferences at client level
mongoClient = MongoClients.create(
"mongodb://localhost:27018/?readPreference=secondary");
*/
// Read Preferences at DB level
MongoDatabase db = mongoClient.getDatabase("LSMDB")
.withReadPreference(ReadPreference.secondary());
myColl.find().forEach(printDocuments());
//populateDB(myColl);
Neo4J
Neo4j is a native graph database, built from the ground up to leverage not only data but also data
relationships.
Unlike traditional databases, which arrange data in rows, columns and tables, Neo4j has a flexible
structure defined by stored relationships between data records.
Each data record, or node, stores direct pointers to all the nodes it’s connected to.
Neo4j's design allows to perform queries with complex connections orders of magnitude faster,
and with more depth, than other databases.
Also, relationships are stored like entities.
MySQL does not handle well multiple joins.
In MySQL we need a lot of joins to reach some information, while here we need to follow edges.
The relationship of following stores in the node the direct address of the node we want to reach,
and there we will have the address of the previous node.
We can traverse even in reverse order.
A Neo4J graph is the typical graph composed by:
- Nodes
- Labels
- Relationships
- Properties
- Indexes
- Constraints
A node represents an object in the graph. It stores data similarly to rows in DBMS or
document/item in NoSQL.
A node can:
- have associated properties
(schema-free)
- connect with other objects
through a relationship (with a
direction)
- be labeled
- be indexed
Labels are used to shape the domain by grouping nodes into sets where all nodes that have a
certain label belong to the same set.
- A node can have from zero to many labels.
- Can be added and removed dynamically
- Conventionally expressed in CamelCase
Properties are name-value pairs that are used to add qualities to nodes and
relationships.
Property types comprise:
• Number, an abstract type, which has the subtypes Integer and Float
• String
• Boolean
• The spatial type Point
• Temporal types: Date, Time, LocalTime, DateTime, LocalDateTime and
Duration
They typically contain small information because it is stored in memory.
We can have subsets of information in the Graph DB and a subset on the
Document DB even with some redundancy, if it’s needed.
Constraints are used to make sure that the data adheres to the rules of the domain.
We can create constraints to:
- Enforce uniqueness (e.g. each Person node is unique)
- Enforce existence of a property in a node (e.g. each person must have the property name
defined)
In a Relational DB links between table are provided using foreign keys, and possibly a lookup table
(the central index in the analogy).
Traversing a foreign key requires an index lookup.
The operation requires the same time if we repeat the query.
The purpose of graphs is to do rapid traversal. The RDBMS model is too expensive for that.
Typically, in a RDBMS the index is stored in a B+ Tree, so the complexity for searching is O(logN).
Cypher is a declarative query language that focuses on what to get, not how to retrieve it.
Uses keywords such as MATCH, WHERE, CREATE.
Cypher queries run on the DB server where the graph is loaded into the memory.
The server must have enough memory to perform the query.
In the following, we show an example for creating a node in a small social graph:
As shown in the following, CREATE clauses can create many nodes and relationships at once:
The database won’t change because there is already a node with that name.
If we merge for a node that doesn’t exist, a new node will be added.
The merge checks the node, and it has to have the same set of label and properties.
The merge can have two conditions, create or match.
If it matches we don’t have any effect, otherwise a new node will be created.
Merge a node and set properties if the node needs to be created:
If this merge will create a new node, we want to put the timestamp in a new attribute.
Pattern Matching
We can match for a pattern:
• MATCH clause to describe the pattern from known Nodes to found Nodes
• (ee) starts the pattern with a Person (qualified by WHERE)
• -[:KNOWS]-matches "KNOWS" relationships (in either direction)
• (friends)will be bound to Johan’s friends
Pattern matching can be used to make recommendations. Johan is learning to surf, so he may
want to find a new friend who already does:
• () empty parenthesis to ignore these nodes, we have a node but without the interest on labelling
it
• DISTINCT because more than one path will match the pattern and it can return twice the same
result
• Surfer will contain Allison, a friend of a friend who surfs
In the following we show an example to set or update a property for a specific node:
We first have to search for that node/relationship, and once matched we can set the property to a
new value.
The setting may depend on to the result of a filtering:
We set the property to the new value only if the condition is satisfied.
The following statement shows how to set or modify the value of properties on relationships:
The example above removes the property worksIn from all the nodes with label Person.
To delete a relationship of a node we first find the endpoints:
To delete a node:
It will return an error if there are still relationships between these nodes, it works only if nodes are
isolated. To avoid it we have to use detach. To delete all nodes and relationships:
In the first query we will have as output also Tom Hanks; If we don’t want to have it we need to
specify: WHERE tom <> coactors.
We want no more than 4 hops distance, we have a range from 1 to 4. We can avoid specifying the
pattern.
We can have any number of hops in the 3rd query. p takes the shortest path.
strength: cocoactors more connected to Tom Hanks, how many ways to reach it (having in
common movies or cofactors).
A Neo4j client application will require a driver instance in order to provide access to the database.
• This driver should be made available to all parts of the application that need to interact with
Neo4j.
• The driver can be considered thread-safe, we don’t have to lock resources to avoid deadlocks or
other problems.
• Applications will typically construct a driver instance on startup and destroy it on exit.
Every time we open the driver instance, we must also close it.
• Destroying a driver instance will immediately shut down any connections previously opened via
that driver, and temporary results will be released.
If we have a driver object open but no connection open, then we are not consuming resources.
To construct a driver instance, a connection URI and authentication information must be supplied.
The default username is neo4j and the password is what we defined.
• Additional configuration details can be supplied if required.
• All of these details are immutable for the lifetime of the driver.
• Therefore, if multiple configurations are required (such as when working with multiple database
users) then multiple driver instances must be used.
• When we work with a single server database, we can create a direct driver via bolt URI, for
example: bolt://localhost:7687.
For a cluster, please use the neo4j URI neo4j://localhost:7687
Because if we use bolt, all requests will be redirected to the bolt server we specified.
Authentication details are provided as an auth token which contains the usernames, passwords or
other credentials required to access the database.
Neo4j supports multiple authentication standards but uses basic authentication by default.
A session works within a session context and within the session context we have different
transactions.
Within session transactions are performed according the CAUSALITY constraint.
This is also guaranteed within a cluster.
Bookmarking system allows Neo4J to recognize which members of a cluster can execute a
transaction (only up-to-date members).
If we bookmark a transaction, Neo4J it will look for all member up-to-date to this transaction.
The routing layer will search for an available member to performing the transaction respecting the
casuality constraints (for example a read after a write).
We just specify what transaction we want, and it will be handled by the driver.
Sequential transactions can also be executed by different members of the cluster but in a casuality
manner.
Transactions are atomic units of work consisting of one or more Cypher statement executions.
We execute all statements or none.
A transaction is executed within a session.
• To execute a Cypher statement, two pieces of information are required: the statement template,
a template string with inside bookmarks, and a keyed set of parameters that will be replaced
inside a string.
• The template is a string containing placeholders that are substituted with parameter values at
runtime.
• The Neo4j driver API provides for three forms of transaction:
- Auto-commit transactions
This is auto-commit, we perform a transaction directly into a session.
If we have just a single-line transaction in a session, it can be performed directly to the
session and it wil auto-commit.
If it fails, it will not retry to apply the command
- Transaction functions
The first time it fails, the system will try to re-execute it until the maximum number of tries
and there’s also a timeout.
- Explicit transactions
Of these, only transaction function can be automatically replayed on failure.
Transaction functions are the recommended form for containing transactional units of work.
Transaction functions are also able to handle connection problems and transient errors using an
automatic retry mechanism.
The retry capability can be configured on Driver construction.
We have a session, and we don’t apply session.run() but we call the writeTransaction() that needs
an anonymous inner class and inside execute we put the code of the transaction.
We need to specify the return value of the transaction in the function we return.
We can assign the writeTransaction to an integer in this case, if we specify a list of string or void
inside <> we will have a different return result.
The anonymous inner does not have a name.
Considering access modes, transactions can be executed in either read or write mode.
In a Causal Cluster, each transaction will be routed by the driver to an appropriate server based on
the mode.
When using a single instance, all transactions will be passed to that one server.
Access modes can be supplied in two ways: per transaction or per session.
In the general case, access mode should always be specified at transaction level, using transaction
functions (writeTransaction or readdTransaction).
The session-level setting is only necessary for explicit and auto-commit transactions.
Note that the driver does not parse Cypher and cannot determine whether a transaction is
intended to carry out read or write operations.
The driver does not see an operation as write operation seeing that it is a CREATE or something
else, we need to specify explicit transaction functions.
As a result of this, a write transaction tagged for read will be sent to a read server but will fail on
execution.
Here we have both write/read transaction within a session.
*StatementResult = Result
We get the first value with get(0). We can specify the position or the name of the column we want
to get, we need to convert it to the correct type.
Cypher
These are the Cypher types that always have the corresponding Java types.
package main.java;
import org.neo4j.driver.*;
import org.neo4j.driver.types.Node;
import org.neo4j.driver.types.Path;
import java.util.ArrayList;
import java.util.List;
import static org.neo4j.driver.Values.parameters;
public void addPerson( final String name, final String from, final int age )
{
try ( Session session = driver.session() ){
session.writeTransaction((TransactionWork<Void>) tx -> { //Void because we don’t want to
return anything
tx.run( "MERGE (p:Person {name: $name, from: $from, age: $age})",
parameters( "name", name, "from", from, "age", age ) );
// MERGE because if it already exists, CREATE will return an exception
// we always have to execute write operations that are idempotent
// we also have to handle exceptions
return null;
});
}
}
//Find the actor with the maximum number of movies who acted in
private void printMostFamousActor()
{
try ( Session session = driver.session() )
{
String mostFamousActor = session.readTransaction((TransactionWork<String>) tx -> {
String query = "MATCH (p:Person)-[:ACTED_IN]->(m:Movie) " +
"RETURN p.name AS Name, count(*) as NumMovies " +
"ORDER BY NumMovies DESC " +
"LIMIT 1";
Result result = tx.run( query );
return result.single().get("Name").asString();
});
System.out.println(mostFamousActor);
}
}
// If there is at least one member of the list to satisfy this condition, this will be true.
Result result = tx.run( query,
parameters("name1", firstActor, "name2", lastActor, "name3", middleActor));
// We can get the path (neo4j type)
Path path = result.single().get("path").asPath();
String pathString = "--";
for(Node n : path.nodes())
{
if(n.hasLabel("Person"))
{
pathString += n.get("name").asString() + "--";
}
else if(n.hasLabel("Movie"))
{
pathString += n.get("title").asString() + "--";
}
}
System.out.println("The length of the path is " + path.length());
System.out.println(pathString);
return null;
});
}
}
}
1. Implement the following queries in Neo4J Java, using the movies graph database
a) Find a title which contains a certain string
b) Find a movie by title and display the cast
c) Find the director that managed the least number of distinct actors
d) Find the top 3 movies with the highest ratio between number of actors and number of
producers
e) Find the shortest path from the youngest actor and the oldest director
package main.java;
import org.neo4j.driver.*;
import java.util.ArrayList;
import java.util.List;
import static org.neo4j.driver.Values.parameters;
@Override
public void close() throws Exception
{
driver.close();
}
return null;
});
}
}
//Find the director that managed the least number of distinct actors
private void printLeastPopularDirector()
{
try ( Session session = driver.session() )
{
session.readTransaction((TransactionWork<Void>) tx -> {
//If we use different variable names, we are stating the people are different
String query = "MATCH (p:Person)-[:DIRECTED]->(m)<-[:ACTED_IN]-(p2:Person)\n" +
"RETURN p.name AS Name, count(DISTINCT p2.name) AS numDistinctActors\n" +
"ORDER BY numDistinctActors\n" +
"LIMIT 1";
Result result = tx.run(query);
while(result.hasNext())
{
Record r = result.next();
String name = r.get("Name").asString();
int nActors = r.get("numDistinctActors").asInt();
System.out.println("The least popular director is " + name + ".");
System.out.println(name + " worked with only " + nActors + " actors");
}
return null;
});
}
}
//Find the top 3 movies with the highest ratio between number of producers and number of
actors
private void printTop3MoviesWithHighestActorsProducersRatio()
{
try ( Session session = driver.session() )
{
// we have to take care of duplicates
// for each movie we get the number of actors, then we want to match the movie produced
by a person, we need to use the where or we will have the cartesian product, we filter the match
by the previous match
session.readTransaction((TransactionWork<Void>) tx -> {
String query = "MATCH (m:Movie)-[:ACTED_IN]-(p:Person)\n" +
"WITH m.title AS Title, count(*) AS nActors\n" +
"MATCH (m2:Movie)-[:PRODUCED]-(p2:Person) WHERE m2.title = Title\n" +
"RETURN m2.title AS Title, toFloat(nActors / count(*)) AS Ratio, nActors\n" +
"ORDER BY Ratio DESC\n" +
"LIMIT 3";
Result result = tx.run(query);
while(result.hasNext())
{
Record r = result.next();
String title = r.get("Title").asString();
double ratio = r.get("Ratio").asDouble();
System.out.println("\"" + title + "\" with a ratio of " + ratio);
}
return null;
});
}
}
//Find the shortest path from the youngest actor and the oldest director
private void printShortestPathFromYoungestToOldest()
{
try ( Session session = driver.session() )
{
String youngestActor = session.readTransaction((TransactionWork<String>) tx -> {
//Find the youngest actor
String query = "MATCH (m:Movie)-[:ACTED_IN]-(p:Person) " +
"WHERE EXISTS(p.born) " + //if we don’t put it it will return actors with born = null,
and for Neo4j null < any integer
"RETURN p.name as Name, p.born as Birth " +
"ORDER BY Birth DESC " +
"LIMIT 1";
Result result = tx.run(query);
Record r = result.next();
String name = r.get("Name").asString();
int birth = r.get("Birth").asInt();
System.out.println("The youngest actor is " + name + " who is born in " + birth);
return name;
});