0% found this document useful (0 votes)
84 views33 pages

Group 7 Databases On The Web and Semi Structured Databases

Uploaded by

Tongai Dune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views33 pages

Group 7 Databases On The Web and Semi Structured Databases

Uploaded by

Tongai Dune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Databases on the Web and

Semi Structured Data


Semi Structured Data:
• Semi-structured data is a type of data that doesn't conform to the
tabular structure of relational databases or other forms of data tables.
• Semi-structured data is generated from sources such as IoT devices,
mobile apps and web pages.
• Semi-structured data is more flexible and free to evolve over time as
new attributes are added.
Characteristics of semi-structured data:
1. Tags or labels to identify data elements
2. Hierarchical or nested structure
3. Flexible schema or no schema at all
4. Data can be missing or incomplete
5. Data can have varying formats and structures
6. Data can be generated dynamically
7. Data can be large and voluminous
Benefits of semi-structured data:
1. Flexibility and adaptability
2. Ability to handle large volumes of data
3. Easy to integrate with other data sources
4. Supports real-time data processing and analytics
5. Enables Big Data analytics and machine learning
6. Supports IoT and smart devices data management
7. Supports mobile and web applications data management
Challenges of semi-structured data:
1. Difficulty in data modelling and schema design
2. Data quality and integrity issues
3. Data security and privacy concerns
4. Difficulty in data integration and interoperability
5. Requires specialized skills and tools for management and analysis
6. Scalability and performance issues
7. Data governance and compliance challenges
Document Schema
• A document schema, also known as a data schema or schema definition, is a
formal description or blueprint that outlines the structure, organization, and
properties of a document or a collection of documents.
• It defines the allowed fields, their data types, formats, relationships, and
any validation rules associated with them.
• Document schemas are commonly used in various contexts, including
databases, content management systems, document-oriented databases
(e.g., MongoDB), markup languages (e.g., XML, JSON), and search engines.

key elements typically included in a document schema:

1. Fields: A schema specifies the fields or attributes that a document can


have. Each field represents a specific piece of information within the
document.
For example, a document schema for a blog post may include fields such as "title," "author,"
"content," and "publish_date.“

2. Data Types: Each field in a schema is associated with a specific data type, which defines
the kind of data that can be stored in that field. Common data types include strings,
numbers, booleans, dates, arrays, and objects.

3. Field Constraints: Schemas often define constraints or rules that dictate how data should
be stored in each field. These constraints can include requirements such as field length
limits, allowed value ranges, regular expressions for data formats, and mandatory or
optional fields.

4. Relationships: In some cases, documents may have relationships with other documents or
entities. Document schemas can define these relationships, such as one-to-one, one-to-
many, or many-to-many relationships. This allows for structured and interconnected data
modeling.
5. Indexing: Schemas may also specify indexing rules, which determine how documents are
indexed for efficient searching and retrieval. Indexes can be created on specific fields or
combinations of fields to optimize query performance.

6. Validation: Document schemas often include validation rules that enforce data integrity
and accuracy. These rules validate the data against the defined schema to ensure it meets
the specified requirements and constraints.

• Document schemas provide a standardized structure for storing and organizing


documents, ensuring consistency and facilitating data retrieval and manipulation. They
serve as a contract or agreement between data producers and consumers, enabling
interoperability and efficient data processing.

• Document schemas are the highest level of the metadata structure. They allow the
Library Administrator to control the documents or files that are added to the Library.
Document schemas are a way to group like files together even when they are filed in
disparate places across the Library. Document schemas manage how files are added to
the Library and what information is collected about them via the metadata.
• When adding a file to system, the user will select the schema they want to associate with
the document by using the Document Schema field. The schema that is selected will
determine what metadata is required and the format that the metadata will take.
• Document schemas should be created to fit your business processes. They are mapped to
either a specific class of documents, such as Executive, Compliance, or Record, or
individual document schemas, such as Well Reports, Minutes and Packet Attachments.
• A Publisher works with these document schemas when adding or checking a document
into the Library. For more information about Publisher security rights, see User Roles.
• Providing descriptive schema names and descriptions will increase the effectiveness of
the document management system. Once you have created a schema name, you cannot
reuse a specific schema name, even if you have deleted the original schema from the
system due to retention features within FileHold.
• Providing unique names for each schema also greatly reduces confusion for
administrators and end-users of the system.
Document schemas manage the following document features:

1. General — Set schema name, format, and document numbering conventions.


2. Schema Membership — Define which groups have access to this schema.
3. Metadata — Define the metadata fields that are applied to a document.
4. Workflow — Set up a review and approval process for a document that belongs to this
schema.
5. Courier — Send documents to external individuals or internal FileHold users for view
and approval.
6. Custom Naming — Set up naming conventions for the documents
7. .Auto-Filing — Define the destination folder in the library.
8. Event Schedule — Determine when to convert the document to a record, archive, or
delete the document.
9. DB Lookup – Do a database lookup for all metadata fields in the schema.
• XSD stands for XML Schema Definition, it is like a detailed plan for organizing and
checking XML documents.
• It is a versatile language that provides a framework to structure and validate XML data
effectively.
• it is a set of guidelines and limitations designed to ensure the reliability, consistency,
and correctness of data stored in XML files. it specifies what elements, attributes, and
data types are allowed in an XML document, serving as a rulebook that XML
documents must adhere to.
• It also empowers developers to define constraints, like setting the acceptable value
ranges for elements and enforcing specific formatting rules for data. XML Schema
Definition (XSD) files are saved with the .xsd file extension.

Features of XSD
•XSD lets you examine XML documents to see if they follow specific rules about how
they’re organized and what kind of data they contain.
•It also allows you to create detailed data arrangements with things like nested elements,
attributes, and different data types.
•Helps you define different kinds of data, like words, numbers, and dates, in XML
documents. It allows you to decide how this data should be structured.
•You can also decide on default values and rules for elements and attributes in your XML
documents.

XSD Example:
Explanation of Example Code:

•The xs:schema element declares the XML Schema namespace.


•The xs:element element defines the “Book” element as the root element.
•Inside the “Book” element, we use xs:complexType to define the type of content it can
contain.
•The xs:sequence element specifies that the child elements must appear in the specified
order.
•We define child elements, such as “Title,” “Author,” “PublicationYear,” and “Price,” each
with their data types using the xs:element element.
•default attribute is used to give default value.
•xs:string is used for text data, xs:gYear for a four-digit year, and xs:decimal for a decimal
number.

Advantages of XSD
•Helps different systems and apps work together smoothly by creating a common language for organizing and
checking data.
•Lets you set automatic values for missing parts in your data. It’s like having placeholders ready, so if some
information is missing, it gets filled in with the default values. This keeps your data consistent.
•XSD checks data to make sure it’s correct, stopping any mistakes from getting into XML documents.
•Provides different data types, like primitive data types, derived data types, and user-defined data types. This
variety helps keep data organized and dependable.

Disadvantages of XSD.
•Understanding and working with XSD can be challenging for those new to XML and related technologies.
•Data validation using XSD can introduce a performance overhead, especially for large XML documents.

Use Cases of XSD


•Helps in linking XML data with database layouts, making it easier to move data between XML based systems
and databases. It’s like a bridge that connects them.
•Lots of computer programs use setup files written in XML, and XSD helps explain how these files should be
organized and what rules they need to follow.
•XSD is really important for checking and making sure XML documents are correct, especially in places like
finance or healthcare, where getting the data right is super important.
Querying XML data
• Querying XML data involves using a language or tool to extract
specific information from an XML document. Details of Querying XML
Data:
1. Xpath:
• Xpath is a syntax for navigating and selecting nodes in an XML
document.
• It’s used in various programming languages including Java, Python
and JavaScript.
• Xpath expressions can be used to:
-Select nodes by name, attribute or text content
-Navigate relationships between nodes (e.g. parent, child, ancestor,
descendant)
-Filter nodes based on conditions (e.g. //node[@attribute=‘value’] )
2. Xquery:
• Xquery is a query language for retrieving and manipulating data in
XML databases.
• It’s similar to SQL but designed specifically for XML data.
• Xquery can be used to:
-Retrieve specific nodes or data
-Join and merge data from multiple XML documents.
-Transform and manipulate XML data.
3. XML DOM:
• XML DOM(Document Object Model) is a programming interface for parsing
and manipulating XML documents.
• It represents the XML document as a tree like structure, allowing you to
navigate and access nodes.
• XML DOM is often used in web development , especially with JavaScript.
XML parsing libraries:
• Many programming languages have libraries for parsing and querying XML
data, such as:
-xml2js(JavaScript)
-xmltodict(Python)
-XMLParser(Java)
• These libraries provide an easier way to work with XML data, often with a
more intuitive API.
SQL/XML:
• Some databases like Oracle or SQL Server support querying XML data
using SQL extensions.
• This allows you to use SQL queries to extract data from XML columns or
tables.
Examples of querying XML data:
• Retrieving all nodes with a specific name: //node
• Retrieving nodes with a specific attribute value:
//[node@attribute=‘value’]
• Retrieving nodes with a specific text content: //node[text()=‘value’]
• Retrieving nodes with a specific ancestor or descendant:
//ancestor/node or //node/descendant
• Retrieving nodes with a specific sibling: //node/following-sibling::sibling
Some popular applications of querying XML
data include:
• Data integration and exchange
• Data transformation and mapping
• Data validation and verification
• Data search and retrieval
• Data analytics and reporting
• Data warehousing and business intelligence
Storage of XML Data
Storage of XML Data
• XML data can be stored in various ways depending on the requirements and
systems involved
File based storage
• It can be stored as individual files on a file system.Each XML document is saved as
a separate file with a unique identifier or name,
• The file based approach is simple but not suitable for large datasets where
efficient querying and retrieval are required.
• It lacks data isolation, atomicity, concurrent access and security.
• Directory structure:
• /xml_data
• Document1.xml
• Document2.xml
• Document3.xml
2.Relational Databases
• In relational databases it is stored by mapping XML elements and attributes to tables and columns.
• This approach enables efficient querying ,indexing and transactional capabilities offered by the database
management system.
• Table Structure:
• Table:xml_data
• Columns:id(INT),xml_content(XML)

3.Native XML Databases


• These are specialized databases designed specifically for storing and managing XML data.
• They provide advanced features like indexing,querying and full text search capabilities optimized for XML
documents.
• They are useful when dealing with large volumes of XML data and when complex XML processing is required.
• /db/xml_data
• Document1.xml
• Document2.xml
• Document3.xml
• It uses eXist-db which is a popular native XML database that stores XML documents directly
4.Object Oriented Databases
• These databases allow you to store XML documents as objects and provide mechanisms to query
and manipulate the XML data using object oriented programming paradigms
• Collection structure:
• Collection:xml_data
• Document1:
• _id:ObjectId(“”)
• Xml_content:”<?xml version=“1.0”?><root>…</root>”
5.NoSQL databases
• Databases such as MongoDB or Couchbase can store XML data as well.These databases offer
flexible schema models and can handle semi structured data like XML effectively.
• XML data can be stored as JSON like documents or as binary data within NoSQL databases
• Document structure:
• Document 1:
• Key:doc1
• Value:”<?xml version=“1.0”?><root>…</root>”
6.In-memory data structures
• It can be stored using data structures such as trees or graphs.
• This approach is suitable for scenarios where fast in-memory processing and manipulation of XML data are
required.However its important to note that the data is not persisted and will be lost when application or
system restarts.
• XML tree structure:
• Root
• Element 1
• Attribute 1
• Element 2
• Text 1
• Element 3
• Text 2
XML APPLICATIONS:
1. Storing Data with Complex Structure
• -Many applications need to store data that are structured, but are not easily
modelled as relations.
• -Consider, for example, user preferences that must be stored by an application
such as a browser. There are usually a large number of fields, such as home page,
security settings, language settings, and display settings, that must be recorded.
Some of the fields are multivalued, for example, a list of trusted sites, or maybe
ordered lists, for example, a list of bookmarks.
• -Applications traditionally used some type of textual representation to store
such data. Today, a majority of such applications prefer to store such
configuration information in XML format.
• - XML-based representations are now widely used for storing documents,
spreadsheet data and other data that are part of office application packages. ---
The Open Document Format (ODF) and the Office Open XML (OOXML) format
are the most widely used formats for editable document representation.
• -XML is also used to represent data with complex structure that must be
exchanged between different parts of an application.
2. Standardized Data Exchange Formats
• XML-based standards for representation of data have been developed for a variety of specialized
applications, ranging from business applications such as banking and shipping to scientific applications
such as chemistry and molecular biology.
• Some examples:
• The chemical industry needs information about chemicals, such as their molecular structure, and a
variety of important properties, such as boiling and melting points, calorific values, and solubility in
various solvents. ChemML is a standard for representing such information.
• In shipping, carriers of goods and customs and tax officials need shipment records containing detailed
information about the goods being shipped, from whom and to where they were sent, to whom and to
where they are being shipped, the monetary value of the goods, and so on.
• An online marketplace in which business can buy and sell goods [a so-called business-to-business (B2B)
market] requires information such as product catalogs, including detailed product descriptions and
price information, product inventories, quotes for a proposed sale, and purchase orders. For example,
the RosettaNet standards for e-business applications define XML schemas and semantics for
representing data as well as standards for message exchange.
• Using normalized relational schemas to model such complex data requirements would result in a large
number of relations that do not correspond directly to the objects that are being modelled.
• The relations would often have large numbers of attributes; explicit representation of
attribute/element names along with values in XML helps avoid confusion between attributes. Nested
element representations help reduce the number of relations that must be represented, as well as the
number of joins required to get required information, at the possible cost of redundancy.
3. Web Services
• Applications often require data from outside of the organization, or from another
department in the same organization that uses a different database.
• In many such situations, the outside organization or department is not willing to
allow direct access to its database using SQL, but is willing to provide limited
forms of information through predefined interfaces.
• When the information is to be used directly by a human, organizations provide
Web-based forms, where users can input values and get back desired information
in HTML form. However, there are many applications where such information
needs to be accessed by software programs, rather than by end users.
• Providing the results of a query in XML form is a clear requirement. In addition, it
makes sense to specify the input values to the query also in XML format.
• In effect, the provider of the information defines procedures whose input and
output are both in XML format. The HTTP protocol is used to communicate the
input and output information, since it is widely used and can go through firewalls
that institutions use to keep out unwanted traffic from the Internet.
• The Simple Object Access Protocol (SOAP) defines a standard for invoking procedures,
using XML for representing the procedure input and output. SOAP defines a standard
XML schema for representing the name of the procedure, and result status indicators
such as failure/error indicators. The procedure parameters and results are application-
dependent XML data embedded within the SOAP XML headers.
• The SOAP standard is independent of the underlying programming language, and it is
possible for a site running one language, such as C#, to invoke a service that runs on a
different language, such as Java.
• A site providing such a collection of SOAP procedures is called a Web service.
• The Web Services Description Language (WSDL) is a language used to describe a Web
service’s capabilities.
• There is also a standard called Universal Description, Discovery, and Integration (UDDI)
that defines how a directory of available Web services may be created and how a
program may search in the directory to find a Web service satisfying its requirements.
• Invoking a Web Service: To invoke a Web service, a client must prepare an appropriate
SOAP XML message and send it to the service; when it gets the result encoded in XML,
the client must then extract information from the XML result. There are standard APIs in
languages such as Java and C# to create and extract information from SOAP messages.
4. Data Mediation
• Comparison shopping is an example of a mediation application, in which data about items,
inventory, pricing, and shipping costs are extracted from a variety of Web sites offering a
particular item for sale. The resulting aggregated information is significantly more valuable
than the individual information offered by a single site.
• Suppose we want to provide centralized management for all accounts of a customer with a
variety of accounts (from different institutions) to manage e.g. bank accounts, retirement
accounts, credit card accounts.
• XML-based mediation addresses the problem by extracting an XML representation of
account information from the respective Web sites of the financial institutions where the
individual holds accounts. This information may be extracted easily if the institution exports
it in a standard XML format, for example, as a Web service.
• Once the basic tools are available to extract information from each source, a mediator
application is used to combine the extracted information under a single schema.
• The mediator must decide on a single schema that represents all required information, and
must provide code to transform data between different representations.
• XML query languages such as XSLT and XQuery play an important role in the task of
transformation between different XML representations
Some basic XML applications:
• Web Development: XML is commonly used for data exchange between web applications. It can be used
to represent and transmit structured data, such as configuration files, data feeds, or web service
responses.
• Document Management: XML provides a standardized format for storing and managing structured
documents. It is often used in content management systems (CMS) or document databases to store and
organize data in a hierarchical manner.
• Data Interchange: XML is frequently used as a data interchange format, allowing different systems and
platforms to exchange data in a standardized way. It provides a platform-independent format that can be
easily parsed and processed by different applications.
• Configuration Files: XML is often used for defining configuration settings in various software applications.
Configuration files written in XML can be easily edited and modified, and they provide a structured
representation of application settings.
• Syndication and Feeds: XML is the foundation for various syndication formats such as RSS (Really Simple
Syndication) and Atom. These formats are used to distribute and share content, such as news articles,
blog posts, or podcasts, in a standardized manner.
• Data Representation: XML can be used as a data representation format for storing and exchanging
structured data. It allows users to define custom tags and attributes to represent the specific data
elements and their relationships.
• Middleware Communication: XML is often used as a communication format in middleware systems that
facilitate communication between different software components or systems. It provides a standardized
way to exchange data between disparate systems.

You might also like