DiploCloud: Efficient and Scalable Management of RDF Data
In the Cloud
Aim:
The main aim of this Project is to provide an efficient and scalable distributed RDF
data management system for the cloud.
Synopsis:
Despite recent advances in distributed RDF data management, processing large-
amounts of RDF data in the cloud is still very challenging. In spite of its seemingly simple
data model, RDF actually encodes rich and complex graphs mixing both instance and
schema-level data. Sharing such data using classical techniques or partitioning the graph
using traditional min-cut algorithms leads to very inefficient distributed operations and to
a high number of joins. In this paper, we describe DiploCloud, an efficient and scalable
distributed RDF data management system for the cloud. Contrary to previous approaches,
DiploCloud run a physiological analysis of both instance and schema information prior to
partitioning the data. In this paper, we describe the architecture of DiploCloud, its main
data structures, as well as the new algorithms we use to partition and distribute data. We
also present an extensive evaluation of DiploCloud showing that our system is often two
orders of magnitude faster than state-of-the-art systems on standard workloads.
Existing System:
As Database Retrieval process is heavy weighted and Time Consuming,
Information needs to be addressed by RDF-based. Variables can be connected in arbitrary
ways. Query processing from RDF corresponds to the notion of basic graph pattern (BGP)
in SPARQL .Every query represents a graph pattern consisting of a set of triple patterns
representing the sets of distinguished variables, undistinguished variables, and constants,
hence Retrieval based on query search also meaningless. A solution to a graph pattern q on
a graph G is a mapping from the variables in q to vertices in G such that the substitution of
variables would yield a subgraph of G. The substitutions of distinguished variables
constitute the answers. In fact, it can be interpreted as a homomorphism (i.e., a structure
preserving mapping) from the query graph to the data graph. This task of matching a query
graph pattern against the data graph is supported by various RDF stores, which retrieve
data for every triple pattern and join it along the query edges. While the efficiency of
retrieval depends on the physical data organization and indexing, the efficiency of join is
largely determined by the join implementation and join order optimization strategies. We
discuss these performance drivers that distinguish existing RDF stores. There are no single
systems but rather, the state-of-the art in RDF data management is constituted by a
combination of different concepts.
Proposed System:
We propose an efficient and scalable distributed RDF data management system for
the cloud. A structure oriented approach that exploits the structure patterns exhibited by
the underlying data captured using a structure index. Height- and label-parameterized
structure index for RDF. For capturing the structure of the underlying data, we propose to
use the structure index, a concept that has been successfully applied in the area of XML-
and semi structured data management. It is basically a graph, where vertices represent
groups of data elements that are similar in structure. For constructing this index, we
consider structure patterns that exhibit certain edge labels containing path. A structure
index can be used as a pseudo schema for querying and browsing semi structured RDF
data on the web. Further, we propose to leverage it for RDF data partitioning. To obtain a
contiguous storage of data elements that are structurally similar, vertices of the structure
index are mapped to tables. The triples with the same property label, triples with subjects
that share the same structure are physically grouped. Such fine-granular groups that match
a given query contain more candidate answers. Standard query processing relies on what
we call data-level processing. It consists of operations that are executed against the data
only. We suggest using the structure index for structure-level query processing. A basic
strategy is to match the query against the structure index first to identify groups of data
that satisfy the query structure. Then, via standard data-level processing, data in these
relevant groups are retrieved and joined. However, this needs to be performed only for
some parts of the query, which additional to the structure constraints, also contain
constants and distinguished variables representing more specific constraints that can only
be validated using the actual data. Instead of performing structure- and data-level
operations successively and independent from each other like in this basic strategy, we
further propose an integrated strategy that aims at an optimal combination of these two
types of operations.
Modules:
1. Semantic db RDF generation
2. Semantic Web RDF generation
3. DATA Partitioning and Indexing
4. Query Processing over Indexed Data
Semantic db RDF generation:
The Resource Description Framework (RDF) is constructed for semantic data on a
Relational Database containing Structured as well as unstructured data. A Schema is
identified for the relational database and a RDF representing the schema of the database is
constructed through model provided by the jena api. The Model contains all the
information about the data linkages in the schema. In this process the schema can also be
altered based on admin requirement so that the search process can be effective.
Semantic Web RDF generation:
The Semantic Web RDF generation generates an RDF data for user entered data.
The user Entered data is converted to RDF file and stored in that respective server. The
RDF file is constructed for the jena api.
The converted RDF file is created as a web service and when that the RDF file is
required to send an web service response.
DATA Partitioning and Indexing:
The RDF is also generated by mining the text contents uploaded by the users in
blogs and the contents of the file are analyzed and the meta contents are manipulated. The
meta contents are the key for search process so that the file can be rendered on demand.
The Text mining process analyses the text word by word and also picks up the literal
meaning behind the group of words that constitute the sentence. The Words are analyzed
in WordNet api so that the related terms can be found for use in the meta content in
generation of RDF. Generally RDF runs in the web services of Servers in all over the
world to provide the db to the distribution in the web to access it. Hence this process is
shown in real time and the text also analyzed in a WebService provided by an open source
project deployed in a real time server. So the user uploaded content will also be analyzed
in real time servers in their own natural language processing strategies and the results are
obtained in a RDF format so that it can be understood by other Servers.
Query Processing over Indexed Data:
Similar data’s are grouped together that relate to the same resource. The
data level processes are subjected to structure level processing by indexing the semantic
data elements. Multiple RDFs are grouped and structured together to form a master RDF
data that holds all the semantic information’s of a Server that support reasoning in any
formats of query processing. The Different resources are interlinked with high degree of
relational factors by the predicates in the triples. The Query processing is handled directly
in the RDF file by iterating the triples forming a discrete relation with the Service query
and the URI representing the location of the resource is returned.
Software Requirements
Windows XP/7
JDK 1.6
J2EE
Tomcat 6.0
MySQL
Hardware Requirements
Hard Disk : 80GB and Above
RAM : 2GB and Above
Processor : P IV and Above
Architecture Diagram:
Admin Indexing
Building User Schema from
defined schema Text
Relational
DB Text Mining
Master
RDF
File
Web Upload
Service
Server
Blog User
WEB