Data Mining is a process is in which user data are extracted and processed from a heap of unprocessed raw data. By aggregating these datasets into a summarized format, many problems arising in finance, marketing, and many other fields can be solved. In the modern world with enormous data, Data Mining is one of the growing fields of technology that acts as an application in many industries we depend on in our life. Many developments and researches have been held in this field and many systems are also been disclosed. Since there are numerous processes and functions to be done in Data Mining, a very well developed user interface is needed. Even though there are many well-developed user interfaces for the relational systems, Han, Fu, Wang, et al. proposed the Data Mining Query Language(DMQL) to further build more developmental systems and innovate many kinds of research in this field. Though we can’t consider DMQL as a standard language. It is a derived language that stands as a general query language to perform data mining techniques. DMQL is executed in DB miner systems for collecting data from several layers of databases.
Ideas in designing DMQL:
DMQL is designed based on Structured Query Language(SQL) which in turn is a relational query language.
- Data Mining request: For the given data mining task, the corresponding datasets must be defined in the form of a data mining request. Let us see this with an example. As the user can request for any specific part of a dataset in the database, the data miner can use the database query to retrieve the suitable datasets before the process of data mining. If the aggregation of that specific data is not possible for the data miner, he then collects the supersets from which one can derive the required data. This proves the need for query language in data mining which acts as its subtask. Since the extraction of relevant data from huge datasets cannot be performed by manual work, many development methods are present in the data mining technique. But by doing this way, sometimes the task of collecting relevant data requested by the user may be failed. By using DMQL, a command to retrieve specific datasets or data from the database, which gives a desired result to the user and it gives comprehending experience in fulfilling the expectations of users.
- Background Knowledge: Prior knowledge of datasets and their relationships in a database help in mining the data. By knowing the relationships or any useful information can ease the process of extraction and aggregation. For an instance, the conceptual hierarchy of the number of datasets can increase the efficiency of the process and accuracy by collecting the desired data easily. By knowing the hierarchy, the data can be generalized with ease.
- Generalization: When the data in datasets of a data warehouse is not generalized, often the data would be in form of unprocessed primitive integrity constraints, roughly associated multi-valued datasets and their dependencies. But by using the generalization concept using query language can help in processing the raw data into a precise abstraction. It also works in the multi-level collection of data with a quality aggregation. When the larger databases come into the scene, the generalization would play a major role in giving desirable results in a conceptual level of data collection.
- Flexibility and Interaction: To avoid the collection of less desirable or unwanted data from databases, efficient exposure values or thresholds must be specified for the flexible data mining and to provide compulsive interaction which makes the user experience interesting. Such threshold values can be provided with queries of data mining.
The four parameters of data mining:
- The first parameter is to fetch the relevant dataset from the database in the form of a relational query. By specifying this primitive, relevant data are retrieved.
- The second parameter is the type of resource/information extracted. This primitive includes generalization, association, classification, characterization, and discrimination rules.
- The third parameter is the hierarchy of datasets or generalization relation or background knowledge as said earlier in the designing of DMQL.
- The final parameter is the proficiency of the data collected which can be represented by a specific threshold value which in turn depends on the type of rules used in data mining.
Basic syntax in DMQL:
DMQL acquires syntax like the relational query language, SQL. It is designed with the help of Backus Naur Form (BNF) notation/ grammar. In this notation, “[ ]” or “{ }” denotes 0 or other possibilities.
To retrieve relevant dataset:
Syntax:
use database (database_name)
{use hierarchy (hierarchy_name) for (attribute)}
(rule_specified)
related to(attribute_or_aggreagate_list)
from(relation(s)) [where(condition)]
[order by(order_list)]
{with [(type_of)] threshold = (threshold_value) [for(attribute(s))]}
In the above data-mining query, the first line retrieves the required database (database_name). The second line uses the hierarchy one has chosen(hierarchy_name) with the given attribute. (rule_specified) denotes the types of rules to be specified. To find out the various specified rules, one must find the related set based on the attribute or aggregation which helps in generalization. The from and where clauses make sure of the given condition being satisfied. Then they are ordered using “order by” for a designated threshold value with respect to attributes.
For the rules in DMQL:
Syntax:
Generalization:
generalize data [into (relation_name)]
Association:
find association rules [as (rule_name)]
Classification:
find classification rules [as (rule_name) ] according to [(attribute)]
Characterization:
find characteristic rules [as (rule_name)]
Discrimination:
find discriminant rules [as (rule_name)]
for (class_1) with (condition_1)
from (relation(s)_1)
in contrast to (class_2) with (condition_2)
from (relation(s)_2)
{ in contrast to (class_i) with (condition_i)
from (relation(s)_i)}
Kinds of thresholds in rule mining:
In the process of data mining, maintaining a set of threshold values is very important in extracting useful and engaging datasets from a heap of data. This threshold value also helps in measuring the relevance of the data and it helps in a driving search for interesting datasets.
The types of thresholds in rule mining can be categorized into three classes.
- Significance Threshold: To present a dataset in the data mining process, the dataset must be verified for having at least some rationally significant proof of a pattern within itself. According to mining association rules, they are called the minimum support threshold. The patterns found within this minimum support threshold is called frequent data items. In accordance with characteristic rules, they are called noise threshold. The patterns which cannot cross this threshold are denoted as noise.
- Rule Redundancy Threshold: This threshold prevents the redundancy of the dataset that is going to be presented. That is, the rules that are going to be provided should not be the same as that of existing ones.
- Rule Confidence Threshold: The probability of X under the condition Y in rule (X->Y), probability must pass through this rule confidence threshold to make sure of it.
Syntax:
with (threshold_name) threshold = value_of_threshold
Example:
with confidence threshold = 0.9
with redundancy threshold = 0.04
Representation of concept hierarchies:
Concept hierarchies help in the precise data mining process. This works based on the relationships and the grouping of data. This concept hierarchy must be flexible to make changes dynamically when new datasets are encountered.
Using the relationships between the attributes of a dataset, the conceptual hierarchy can be specified at the schema level.
Query:
define hierarchy student_result_hierarchy on marks as [year, department, class, section]
In the above example, the attribute department is more general than the student’s year but less general than the class and section in which the student studies. Now take an example of built-in hierarchy at schema level:
Query:
define hierarchy period_hierarchy on date as [day, month, year]
In this, the hierarchy is specified based on concept grouping techniques which provide the obvious appearance of lower and higher levels.
define hierarchy age_hierarchy for book on audience as
level1: {children, young_adult, adult} < level0: all
level2:{8, ..., 12} <level1: children
level3:{13, ..., 18} <level1: young_adult
level4:{19, ..., 100} <level1: adult
In the above example, the category of a book is grouped with age sets or ranges.
- Operation derived hierarchy:
In this hierarchy, the data is in form of numerical attributes. It can be done by comparing ranges or even by clustering using data algorithms.
define hierarchy age_hierarchy for book on audience as
{age_group(1), ..., age_group(3)}
:= cluster(default, age, 3) < all(age)
In this hierarchy, the specification is done based on rules. There will be a small number of rules at the lower level and increases in higher levels.
define hierarchy book_royalty_hierarchy on book as
level1: low_royalty < level0: all
if ( maximum_selling_price)< Rs. 300
level_1: moderate-royalty < level_0: all
if ((maximum_selling_price) > Rs. 300) and ((maximum_selling_price) ≤ Rs. 275))
level_1: high_royalty < level_0: all
In the above example, the royalty that the author gets for his book based on the range of Maximum Selling Price (MRP) is explained using a rule-based hierarchy.
For the presentation of pattern:
To enhance the experience of the user, the user can request a specified dataset or pattern to see in a specified format.
Syntax:
display as (result_format)
Example:
display as graph
Specification of DMQL in a book database:
Consider a book database with the below schema.
Query:
book( book_name, book_id, book_category,genre, book_type)
author(author_name,phone_no,address)
publishing(publisher_name, publishing_id,cost)
buyer(buyer_name,buyer_id,buyer_address)
As the marketing manager of a national publishing house, Joel wants to characterize the purchasing trends of buyers of a book that is priced at no less than Rs.300 concerning the book category, type of book (e-book/paperback), and the genre in which the buyers purchased. The aim is to find the percentage of buyers with the given characteristic trend. Joel is only interested in purchases made as an e-book. He wants to display the resulting data in form of a table.
Query:
use database national_publisher
use hierarchy age_hierarchy for B.category
mine characteristics as buyerPurchasing
analyze sell%
related to B.category,B.type,B.genre
from buyer X, Book B, purchase P, books_sold S, book_category C
where B.ID = S.book_ID and P.buyer_ID = X.buyer_ID and
B.type = "ebook" and B.price ≥ 300
with noise threshold = 0.7%
display as table
In the field of data mining, efficient Graphic User Interfaces(GUI) are required to perform efficient functions. In GUI, relational database languages like SQL play a major role in creating many developmental systems. But we can consider DMQL as a core query language to perform any applications specifically based on data mining which helps in building a more effective GUI. The collection, manipulation, and presentation of data in data mining become easy if one standardizes the use of DMQL as the core query langue for data mining processes. DMQL has many advantages in Data Mining. But, it still has limitations. When it comes to developed GUI’s where graph or improved presenting structure is used, it becomes difficult to locate places. So, only after severe experimenting DMQL can replace SQL as a core query language in data mining.
Similar Reads
Querying Data in Elastic Search
Querying data in Elasticsearch is a fundamental skill for effectively retrieving and analyzing information stored in this powerful search engine. In this guide, we'll explore various querying techniques in Elasticsearch, providing clear examples and outputs to help you understand the process. Introd
4 min read
DML Full Form - Data Manipulation Language
Database Management System offers a framework of functions or dialects to modify or alter the data, called the Data Manipulation Language. Data manipulation could be done perhaps by typing SQL queries or by using, a typically called Query-by-Example graphical interface. Data manipulation includes in
6 min read
Data Mining: Data Attributes and Quality
Prerequisite - Data Mining Data: It is how the data objects and their attributes are stored. An attribute is an object's property or characteristics. For example. A person's hair colour, air humidity etc.An attribute set defines an object. The object is also referred to as a record of the instances
4 min read
Relational Query Language in DBMS
SQL has its own querying methods to interact with the database. But how do these queries work in the database? These queries work similarly to Relational Algebra that we study in mathematics. In the database, we have tables participating in relational Algebra. Relational Database systems are expecte
4 min read
Database Languages in DBMS
Databases are essential for efficiently storing, managing, and retrieving large volumes of data. They utilize both software and hardware components. The software provides an interface that enables users or applications to interact with the database, while the hardware consists of servers and storage
10 min read
Power BI - Explain the 'M language'
A robust "get data" experience with many options is offered by Microsoft Power Query. The ability to filter and mix, or "mash-up," data from one or more of the many supported data sources, is a fundamental feature of Power Query. Using the Power Query Formula Language, such data mashups are expresse
5 min read
What is Data Management?
Over the last few decades, the constant development of cloud systems, artificial intelligence, and the Internet of Things has achieved remarkable growth in collaborating with big data. With the more complex structures, data management contributes a lot at the enterprise level to manage the data. It
8 min read
Query-Evaluation Plan in SQL
Pre requisites: Query Execution Engine in SQL, Query-Execution Plan in SQL In this article, we will see about Query Evaluation Plan in SQL and how the system optimizes the given query. Basically, Query Processing in SQL is extracting data from the datasets. There are various steps involved like Pars
2 min read
Data Engineering 101
Data engineering involves designing, constructing, and maintaining data architectures and systems. It focuses on transforming raw data into valuable information through processes such as ETL (Extract, Transform, Load). Data engineers use various tools and technologies to build data pipelines and ens
12 min read
Data Modeling in Data Engineering
Data modeling in data engineering is the process of creating a conceptual representation of the information structures that support business processes. This model details how data is stored, organized, and manipulated in a database, facilitating efficient data handling and usage within an organizati
4 min read