architecture of parallel databases
Parallel databases are designed to store and process large volumes of data by distributing the
workload across multiple processors or nodes in a cluster. The architecture of parallel databases can
be divided into four main components:
1. Compute nodes: These are the individual machines that make up the parallel database
cluster. Each node typically consists of a CPU, memory, and storage.
2. Network: A high-speed interconnect is used to allow the compute nodes to communicate
with each other. The network architecture can range from a simple Ethernet-based network
to a high-performance InfiniBand network.
3. Storage subsystem: This component is responsible for storing and retrieving data. The
storage subsystem is usually made up of a combination of disk-based and memory-based
storage, with the goal of optimizing performance and cost-effectiveness.
4. Database management system (DBMS): The DBMS is responsible for managing the overall
system and coordinating the activities of the compute nodes. It typically includes a query
optimizer, which is responsible for determining the most efficient way to execute a given
query, and a data distribution module, which is responsible for partitioning the data across
the compute nodes.
In addition to these four main components, parallel databases also typically use a shared-nothing
architecture, which means that each compute node operates independently and does not share
memory or disks with other nodes. This allows for efficient scaling of the system as more compute
nodes can be added without requiring changes to the architecture.
Distributed databases architecture
Distributed databases are designed to store and manage large volumes of data across multiple
servers or nodes, typically located in different geographic locations. The architecture of a distributed
database can be divided into two main components:
1. Distributed storage: The data is distributed across multiple nodes, often using a technique
called sharding or partitioning. Each node stores a subset of the data and can process queries
on that subset independently. The distribution of data can be based on various criteria, such
as range-based partitioning, hash-based partitioning, or key-based partitioning.
2. Distributed processing: The processing of queries is distributed across multiple nodes, with
each node processing the portion of the query that pertains to its subset of the data. The
coordination of the query processing across the nodes is managed by a distributed query
processor, which ensures that the results are merged correctly and returned to the user.
In addition to these two main components, a distributed database also includes a number of other
components, such as a distributed transaction manager, which ensures that transactions are
processed correctly across multiple nodes, and a distributed cache, which helps to improve
performance by caching frequently accessed data locally on each node.
One of the key benefits of a distributed database is scalability. As the amount of data and the
number of users accessing the database grows, more nodes can be added to the system, allowing it
to handle larger workloads. Another benefit is fault tolerance, as the distribution of data and
processing across multiple nodes means that the system can continue to operate even if some nodes
fail or become unavailable. However, designing and managing a distributed database can be
complex, and requires careful consideration of factors such as data consistency, availability, and
performance.
In a parallel database, inter-query parallelism and intra-query parallelism are two key concepts that
are used to optimize query processing.
Intra-query and inter query
Inter-query parallelism refers to the ability of a parallel database system to execute multiple queries
simultaneously. In other words, multiple users can submit their queries at the same time, and the
system can execute them in parallel, leading to faster query processing.
Intra-query parallelism, on the other hand, refers to the ability of a parallel database system to
execute a single query using multiple processors or nodes simultaneously. The query is divided into
smaller sub-queries, each of which is executed in parallel on a different processor or node. The
results of the sub-queries are then combined to produce the final result.
Both inter-query parallelism and intra-query parallelism can significantly improve the performance of
a parallel database system, especially when dealing with large amounts of data. By dividing the work
across multiple processors or nodes, parallel databases can process queries much faster than
traditional serial databases.
Explain the data fragmentation and allocation techniques used in distributed
databases. (4 marks)
In a distributed database, data fragmentation refers to the process of dividing the database into
multiple fragments, each of which can be stored on different nodes of the network. This can help to
improve performance by allowing parallel processing and reducing data transmission requirements.
There are several techniques for data fragmentation:
1. Horizontal fragmentation: In this technique, each fragment contains a subset of rows of a
table. For example, if we have a table with customer information, we can fragment it
horizontally such that each fragment contains information about customers from a particular
region.
2. Vertical fragmentation: In this technique, each fragment contains a subset of columns of a
table. For example, we can fragment the customer information table vertically such that one
fragment contains personal information like name and address, while the other contains
transactional information like purchases and payments.
3. Hybrid fragmentation: In this technique, both horizontal and vertical fragmentation are used
to divide the database.
Data allocation refers to the process of mapping database fragments to specific nodes in the
network. There are several techniques for data allocation:
1. Centralized allocation: In this technique, a central node is responsible for allocating database
fragments to nodes in the network. This can be simple to implement but can result in a
bottleneck if the central node becomes overwhelmed.
2. Decentralized allocation: In this technique, each node is responsible for allocating fragments
to other nodes in the network. This can reduce the burden on a central node but can result
in fragmentation issues if nodes do not communicate effectively.
3. Replication allocation: In this technique, each fragment is replicated on multiple nodes to
ensure high availability and fault tolerance. This can increase data consistency but can also
result in increased storage requirements.
Client-server architecture
Client-server architecture is a computing model in which a client communicates with a server to
request services or resources. The server processes the requests and returns the results back to the
client. The client and server can be on different computers or on the same computer.
Section D
active databases are databases that are able to perform actions automatically in response to changes
in the data or other events. Designing active databases requires careful consideration of the
principles that guide their functionality. Here are some design principles for active databases:
1. Event-driven design: Active databases should be designed with a focus on events that trigger
actions. This means that the database must be able to detect events, such as changes in
data, and respond to them by executing predefined actions.
2. Rule-based processing: Active databases should be designed to process rules that define
actions to be taken based on certain events. These rules can be simple or complex and may
involve conditions and actions that are triggered by specific events.
3. Real-time processing: Active databases should be able to process data in real-time, which
means that they should be able to respond to events and execute actions in real-time
without delay.
4. Transactional consistency: Active databases should be designed to maintain transactional
consistency, which means that they should be able to guarantee that all changes to the data
are complete and consistent, even if there are multiple users or processes accessing the data
simultaneously.
5. User control: Active databases should be designed to give users control over the actions that
are taken in response to events. This means that users should be able to define the rules and
actions that are executed when events occur.
6. Performance optimization: Active databases should be designed to optimize performance,
which means that they should be able to process events and execute actions quickly and
efficiently.
7. Error handling: Active databases should be designed to handle errors gracefully, which
means that they should be able to detect and respond to errors and exceptions in a way that
does not compromise the integrity of the data.
By adhering to these principles, active databases can be designed to provide automated functionality
that enhances the usability and functionality of the database.
XML (Extensible Markup Language) databa
An XML (Extensible Markup Language) database is a database management system that is specifically
designed to store, manage, and retrieve XML data. Unlike traditional relational databases, XML
databases are optimized for working with XML data and provide a more flexible data model that can
handle complex, hierarchical data structures.
There are two main types of XML databases:
1. Native XML databases: These databases are designed specifically for storing and managing
XML data. They store XML documents directly and provide native support for XML-specific
features such as XPath and XQuery.
2. XML-enabled databases: These databases provide some support for XML data, but are
primarily designed for traditional relational data. They typically store XML data as text in a
BLOB (Binary Large Object) column and provide limited support for XML-specific features.
Some of the key features and benefits of XML databases include:
1. Flexible data model: XML databases provide a flexible data model that can handle complex,
hierarchical data structures, making it well-suited for use in applications such as content
management, e-commerce, and scientific research.
2. Efficient querying: XML databases provide advanced querying capabilities using XML-specific
query languages such as XPath and XQuery, allowing for efficient retrieval and manipulation
of XML data.
3. Integration with other technologies: XML databases can integrate easily with other
technologies such as web services, XSLT (Extensible Stylesheet Language Transformations),
and XML parsers, making it easier to work with data across different platforms and systems.
4. Improved performance: XML databases can provide improved performance over traditional
relational databases when working with large amounts of XML data, especially when using
native XML databases.
5. Scalability: XML databases can scale horizontally or vertically to handle large amounts of data
and can provide high availability and reliability through features such as replication and
failover.
Some examples of XML databases include eXist-db, MarkLogic, BaseX, and Tamino.
Multimedia databases
A multimedia database is a database that is specifically designed to store, manage, and retrieve
multimedia data such as images, audio, video, and other forms of digital media. Multimedia
databases provide efficient ways of organizing and retrieving multimedia content, making it easier to
access and use this data for various applications.
Multimedia databases have several important characteristics:
1. Large data sizes: Multimedia data can be very large in size and require special storage and
indexing techniques to optimize performance.
2. Various data types: Multimedia data can come in different formats such as images, videos,
and audio, each requiring specialized techniques to manage.
3. Complex data structures: Multimedia data can be structured in various ways, including
hierarchical structures, network structures, and relational structures.
4. Time-based data: Multimedia data is often time-based and requires specialized techniques
to manage and retrieve content based on temporal queries.
5. Content-based retrieval: Retrieval of multimedia data is often based on content rather than
keywords, which requires sophisticated indexing and retrieval techniques.
Some of the applications that benefit from multimedia databases include:
1. Digital asset management: Multimedia databases are used for managing and organizing large
collections of multimedia content, such as in libraries, museums, and archives.
2. E-commerce: Multimedia databases are used to store product images and videos, allowing
for efficient retrieval of content for online shopping.
3. Entertainment: Multimedia databases are used in the entertainment industry for managing
and delivering music, movies, and other forms of digital content.
4. Medical imaging: Multimedia databases are used in the medical field to store and retrieve
medical images such as X-rays, MRI scans, and CT scans.
Examples of multimedia databases include Oracle Multimedia, Microsoft SQL Server, and IBM DB2
with multimedia extensions. These databases provide support for multimedia data types and
specialized indexing and retrieval techniques for efficient management and retrieval of multimedia
data.
Cloud storage architectures
The architecture of cloud storage typically involves several layers of abstraction, each providing a
different level of functionality and service to users. These layers include:
1. Hardware layer: This layer includes the physical infrastructure used to store data, such as
servers, storage devices, and networking equipment.
2. Virtualization layer: This layer provides the software infrastructure that abstracts the physical
hardware and provides virtual resources to users. Virtualization allows for greater flexibility
and scalability in allocating resources to meet changing demands.
3. Storage management layer: This layer provides the tools and software needed to manage the
storage infrastructure. It includes features such as data backup, disaster recovery, and data
migration.
4. Data access layer: This layer provides the mechanisms for users to access their data stored in
the cloud. This can include APIs, web interfaces, and command-line interfaces.
5. Security layer: This layer is responsible for ensuring the security of the data stored in the
cloud. It includes features such as encryption, access controls, and identity management.
6. Application layer: This layer includes the applications that use the cloud storage service, such
as file sharing, collaboration tools, and content management systems.
Cloud storage architectures can be deployed in different ways depending on the needs of the
organization. Some common deployment models include:
1. Public cloud: In this model, the cloud storage infrastructure is owned and managed by a
third-party provider and is accessible to users over the internet.
2. Private cloud: In this model, the cloud storage infrastructure is owned and managed by the
organization and is accessible only to authorized users within the organization.
3. Hybrid cloud: This model combines public and private cloud storage, allowing organizations
to leverage the benefits of both while maintaining control over their sensitive data.
The architecture of cloud storage is designed to provide scalable, flexible, and cost-effective storage
solutions to meet the demands of modern organizations.
Enhanced Entity-relationship diagram(EER Diagram),
it helps us create and maintain detailed databases through high-level models and tools. In
addition, they are developed on the basic ER diagrams and are its extended version.
Hence, the EER diagram provides all the elements and units of the basic ER diagram along
with categories, attributes, and added relationships between one of more factors.
EER Diagrams basically help in creating and maintaining excellent databases with the help of
smart and efficient techniques. In addition to this, it is a visual representation of the plan or
the overall outlook of the database you intend to create.
When to Use EER Diagrams?
As mentioned above, EER diagrams have eased all the tasks for database management. In
addition to this, it has played its role in Information Modeling as well. Needless to say, the
model has provided excellent benefits and eased the management of databases. EER
diagrams can be leveraged:
If an organization wants to manage the data of all of its employees.
In addition to this, it provides an excellent framework for the management of information
and its flow.
This tool can be used in police departments as well. This is because it helps in maintaining
detailed databases.
Other than this, this tool can be used by universities to keep a record of every student.
Last but not least, this tool can be used by system engineers, network engineers, and
software developers as well.
*Advantages of EER Models in DBMS
1. It is quite simple to develop and maintain. In addition to this, it is easy to understand and
interpret as well, technically speaking.
2. Everything that is visually represented is easier to understand and maintain, and the same
goes for EER models.
3. It has been an efficient tool for database designers. It serves as a communication tool and
helps display the relationship between entities.
4. You can always convert the EER model into a table. Thus, it can easily be integrated into a
relational model.
*Disadvantages/Limitations of EER Diagrams
1. The EER diagrams have many constraints and come up with limited features.
2. The Pareto Chart cannot be used for all the issues.
3. Faults in the scoring of data can happen, plus also there could be an error in the application.
4. Calculated on past data and therefore, cannot predict the future.
What is Specialization and generalization?
Specialization and Generalization Specialization is the process of classifying a class of objects into
more specialized subclasses. Generalization is the inverse process of generalizing several classes into
a higher-level abstract class that includes the objects in all these classes. Specialization is conceptual
refinement, whereas generalization is conceptual synthesis. Subclasses are used in the EER model to
represent specialization and generalization. We call the relationship between a subclass and its
superclass an IS-A-SUBCLASS-OF relationship, or simply an IS-A relationship
What is object Structure and object identity with example?
Object Identity :-
> An object has an unique identity, which is represented by an Object Identifier (OID).
>OODB system provides an unique OID to each independent object stored in the database.
> No two objects can shared the same OID.
> The OID is assigned by the system and does not depend ion the object's attribute value.
> The value of an OID is not visible to an external user, but it is used internally by the system to
identify each object uniquely.
Object Structure
An object has associated with it:
A set of variables that contain the data for the object. The value of each variable is itself an object.
*A set of messages to which the object responds; each message may have zero, one, or more
parameters.
A set of methods, each of which is a body of code to implement a message; a method returns a value
as the response to the message
The physical representation of data is visible only to the implementor of the object
Messages and responses provide the only external interface to an object
The term message does not necessarily imply physical message passing. Messages can be
implemented as procedure invocations.
What is the basic concept of object in oops?
Object Database Definition
An object database is managed by an object-oriented database management system (OODBMS). The
database combines object-oriented programming concepts with relational database principles.
Objects are the basic building block and an instance of a class, where the type is either built-
in or user-defined.
The main characteristic of objects in OODBMS is the possibility of user-constructed types. An object
created in a project or application saves into a database as is.
Object-oriented databases directly deal with data as complete objects. All the information comes in
one instantly available object package instead of multiple tables.
In contrast, the basic building blocks of relational databases, such as PostgreSQL or MySQL, are
tables with actions based on logical connections between the table data.
These characteristics make object databases suitable for projects with complex data which require an
object-oriented approach to programming. An object-oriented management system provides
supported functionality catered to object-oriented programming where complex objects are central.
This approach unifies attributes and behaviors of data into one entity.
Unit 2
Types of SQL Statements
The tables in the following sections provide a functional summary of SQL statements and are divided
into these categories:
Data Definition Language (DDL) Statements
Data Manipulation Language (DML) Statements
Transaction Control Statements
Session Control Statements
System Control Statement
Embedded SQL Statements
Data Definition Language (DDL) Statements
Data definition language (DDL) statements let you to perform these tasks:
Create, alter, and drop schema objects
Grant and revoke privileges and roles
Analyze information on a table, index, or cluster
Establish auditing options
Add comments to the data dictionary
The CREATE, ALTER, and DROP commands require exclusive access to the specified object. For
example, an ALTER TABLE statement fails if another user has an open transaction on the specified
table.
The GRANT, REVOKE, ANALYZE, AUDIT, and COMMENT commands do not require exclusive access to
the specified object. For example, you can analyze a table while other users are updating the table.
Oracle Database implicitly commits the current transaction before and after every DDL statement.
Many DDL statements may cause Oracle Database to recompile or reauthorize schema objects. For
information on how Oracle Database recompiles and reauthorizes schema objects and the
circumstances under which a DDL statement would cause this, see Oracle Database Concepts.
DDL statements are supported by PL/SQL with the use of the DBMS_SQL package.
The DDL statements are:
ALTER ... (All statements beginning with ALTER)
ANALYZE
ASSOCIATE STATISTICS
AUDIT
COMMENT
CREATE ... (All statements beginning with CREATE)
DISASSOCIATE STATISTICS
DROP ... (All statements beginning with DROP)
FLASHBACK ... (All statements beginning with FLASHBACK)
GRANT
NOAUDIT
PURGE
RENAME
REVOKE
TRUNCATE
UNDROP
Data Manipulation Language (DML) Statements
Data manipulation language (DML) statements access and manipulate data in existing schema
objects. These statements do not implicitly commit the current transaction. The data manipulation
language statements are:
CALL
DELETE
EXPLAIN PLAN
INSERT
LOCK TABLE
MERGE
SELECT
UPDATE
The SELECT statement is a limited form of DML statement in that it can only access data in the
database. It cannot manipulate data in the database, although it can operate on the accessed data
before returning the results of the query.
The CALL and EXPLAIN PLAN statements are supported in PL/SQL only when executed dynamically.
All other DML statements are fully supported in PL/SQL.
Transaction Control Statements
Transaction control statements manage changes made by DML statements. The transaction control
statements are:
COMMIT
ROLLBACK
SAVEPOINT
SET TRANSACTION
All transaction contro\l statements, except certain forms of the COMMIT and ROLLBACK commands,
are supported in PL/SQL. For information on the restrictions, see COMMIT and ROLLBACK .
Session Control Statements
Session control statements dynamically manage the properties of a user session. These statements
do not implicitly commit the current transaction.
PL/SQL does not support session control statements. The session control statements are:
ALTER SESSION
SET ROLE
System Control Statement
The single system control statement, ALTER SYSTEM, dynamically manages the properties of an
Oracle Database instance. This statement does not implicitly commit the current transaction and is
not supported in PL/SQL.
Embedded SQL Statements
Embedded SQL statements place DDL, DML, and transaction control statements within a procedural
language program. Embedded SQL is supported by the Oracle precompilers and is documented in
the following books:
Pro*COBOL Programmer's Guide
Pro*C/C++ Programmer's Guide
Oracle SQL*Module for Ada Programmer's Guide
Schema designing in dbms
A Schema organizes data into Tables with appropriate Attributes, shows the interrelationships
between Tables and Columns, and imposes constraints such as Data types. A well-designed Schema
in a Data Warehouse makes life easier for Analysts by:
removing cleaning and other preprocessing from the analyst’s workflow
absolving analysts from having to reverse-engineer the underlying Data Model
providing analysts with a clear, easily understood starting point for analytics
In other words, a well-designed Schema clears the way to faster and easier creation of Reports and
Dashboards.
By contrast, a flawed Schema requires Data Analysts to do extra modeling and forces every Analytics
query to take more time and system resources, increasing an organization’s costs and irritating
everyone who wants their analytics right away.
Schemas are used to specify data items in both data sources and data warehouses in the Data
Analytics field. However, Data Source Schemas aren’t created with Analytics in mind, whether
they’re databases like MySQL, PostgreSQL, or Microsoft SQL Server, or SaaS
services like Salesforce, Facebook Ads, or Zuora.
The SaaS apps, for example, may offer some broad analytics features, but they only apply to the data
from that particular app. Users also have no control over SaaS Schemas, which are established by
the developers of each program.
When enterprise data is duplicated to a Data Warehouse and linked with data from other
applications, it becomes more useful – and enterprises get to build these Data Architectures.
How to Design a Database Schema?
Database schemas define a database’s architecture and help to ensure database fundamentals such
as the following:
The data is formatted consistently.
Every record entry has a distinct primary key.
Important information is not omitted.
A database schema design can exist as both a visual representation and as a collection of formulas or
use constraints that govern a database. Depending on the database system, developers will then
express these formulas in different data definition languages.
For example, despite the fact that the leading database systems have slightly different definitions of
what schemas are, the CREATE SCHEMA statement is supported by MySQL, Oracle Database, and
Microsoft SQL Server.
Suppose you want to create a database to store information for your company’s accounting
department. This database’s schema could outline the structure of two simple tables:
A) Table1
Title: Users
Fields: ID, Full Name, Email, DOB, Dept
B) Table2
Title: Overtime Pay
Fields: ID, Full Name, Time Period, Hours Billed
This single schema includes useful information such as:
Each table’s title
The fields contained in each table
Table relationships (for example, linking an employee’s overtime pay to their identity via
their ID number)
Any other relevant information
These schema tables can then be converted into SQL code by developers and database a
dministrators
how to design relational database?
The four stages of an RDD are as follows:
Relations and attributes: The various tables and attributes related to each table are
identified. The tables represent entities, and the attributes represent the properties of the
respective entities.
Primary keys: The attribute or set of attributes that help in uniquely identifying a record is
identified and assigned as the primary key
Relationships: The relationships between the various tables are established with the help of
foreign keys. Foreign keys are attributes occurring in a table that are primary keys of another
table. The types of relationships that can exist between the relations (tables) are:
o One to one
o One to many
o Many to many
An entity-relationship diagram can be used to depict the entities, their attributes and the
relationship between the entities in a diagrammatic way.
Normalization: This is the process of optimizing the database structure. Normalization
simplifies the database design to avoid redundancy and confusion. The different normal
forms are as follows:
o First normal form
o Second normal form
o Third normal form
o Boyce-Codd normal form
o Fifth normal form
By applying a set of rules, a table is normalized into the above normal forms in a linearly progressive
fashion. The efficiency of the design gets better with each higher degree of normalization.
Types of decision support systems
Decision support systems can be broken down into categories, each based on their primary sources
of information.
Data-driven DSS
A data-driven DSS is a computer program that makes decisions based on data from internal
databases or external databases. Typically, a data-driven DSS uses data mining techniques to discern
trends and patterns, enabling it to predict future events. Businesses often use data-driven DSSes to
help make decisions about inventory, sales and other business processes. Some are used to help
make decisions in the public sector, such as predicting the likelihood of future criminal behavior.
Model-driven DSS
Built on an underlying decision model, model-driven decision support systems are customized
according to a predefined set of user requirements to help analyze different scenarios that meet
these requirements. For example, a model-driven DSS may assist with scheduling or developing
financial statements.
Communication-driven and group DSS
A communication-driven and group decision support system uses a variety of communication tools --
such as email, instant messaging or voice chat -- to allow more than one person to work on the same
task. The goal behind this type of DSS is to increase collaboration between the users and the system
and to improve the overall efficiency and effectiveness of the system.
Knowledge-driven DSS
In this type of decision support system, the data that drives the system resides in a knowledge base
that is continuously updated and maintained by a knowledge management system. A knowledge-
driven DSS provides information to users that is consistent with a company's business processes and
knowledge.
Document-driven DSS
A document-driven DSS is a type of information management system that uses documents to
retrieve data. Document-driven DSSes enable users to search webpages or databases, or find specific
search terms. Examples of documents accessed by a document-driven DSS include policie
Data Warehousing
Background
A Database Management System (DBMS) stores data in the form of tables, uses ER model and the
goal is ACID properties. For example, a DBMS of college has tables for students, faculty, etc.
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically
collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce statistical
results that may help in decision makings. For example, a college might want to see quick different
results, like how the placement of CS students has improved over the last 10 years, in terms of
salaries, counts, etc.
Need for Data Warehouse
An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing
data of TB size, the storage shifted to Data Warehouse. Besides this, a transactional database doesn’t
offer itself to analytics. To effectively perform analytics, an organization keeps a central Data
Warehouse to closely study its business by organizing, understanding, and using its historic data for
taking strategic decisions and analyzing trends.
Benefits of Data Warehouse:
Better business analytics: Data warehouse plays an important role in every business to store
and analysis of all the past data and records of the company. which can further increase the
understanding or analysis of data to the company.
Faster Queries: Data warehouse is designed to handle large queries that’s why it runs queries
faster than the database.
Improved data Quality: In the data warehouse the data you gathered from different sources
is being stored and analyzed it does not interfere with or add data by itself so your quality of
data is maintained and if you get any issue regarding data quality then the data warehouse
team will solve this.
Historical Insight: The warehouse stores all your historical data which contains details about
the business so that one can analyze it at any time and extract insights from it
Steps of data Warehousing
Step 1: Determine Business Objectives. ...
Step 2: Collect and Analyze Information. ...
Step 3: Identify Core Business Processes. ...
Step 4: Construct a Conceptual Data Model. ...
Step 5: Locate Data Sources and Plan Data Transformations. ...
Step 6: Set Tracking Duration. ...
Step 7: Implement the Plan.
Types of data finding/analysis
Four Types of Data Analysis
The four types of data analysis are:
1. Descriptive Analysis
2. Diagnostic Analysis
3. Predictive Analysis
4. Prescriptive Analysis
KKD
KDD stands for Knowledge Discovery in Databases. It is the process of discovering useful
knowledge from large amounts of data. The KDD process involves several steps, including
data cleaning, data integration, data selection, data transformation, pattern discovery, and
knowledge representation.
Data cleaning involves removing noise, handling missing data, and resolving inconsistencies
in the data. Data integration combines data from multiple sources and resolves differences in
schema and data types. Data selection involves selecting the relevant data for analysis based
on the research question. Data transformation involves converting the data into a form
suitable for analysis, such as normalization, aggregation, or discretization.
Pattern discovery involves the use of data mining techniques to identify patterns and
relationships in the data. This step includes tasks such as clustering, classification, regression,
association rule mining, and sequence mining. Knowledge representation involves
interpreting and visualizing the patterns and relationships in the data to generate useful
knowledge and insights.
The KDD process is widely used in fields such as business, finance, marketing, healthcare, and
scientific research. It helps organizations to make data-driven decisions and gain a
competitive advantage by discovering hidden patterns, trends, and insights in their data.
Data Mining
Data mining is the process of discovering patterns, trends, and insights from large volumes of data.
It involves applying statistical and machine learning algorithms to extract knowledge from data
and uncover hidden patterns and relationships.
Data mining involves several steps, including data preprocessing, data exploration, model building,
and model evaluation. Data preprocessing involves cleaning and transforming the data to ensure
consistency and accuracy. Data exploration involves visualizing and exploring the data to identify
patterns and relationships. Model building involves selecting and applying appropriate data mining
algorithms to generate predictive models. Model evaluation involves assessing the performance of
the models and selecting the best model for the data.
Data mining can be applied to a wide range of data types and domains, such as business,
healthcare, finance, and marketing. It can be used for tasks such as customer segmentation, fraud
detection, risk analysis, and predictive maintenance.
The benefits of data mining include improved decision-making, increased efficiency and
productivity, and better customer insights. It can also help organizations to identify new business
opportunities, reduce costs, and improve customer satisfaction.