DS 5110 – Lecture 6
Semi-Structured Data
Roi Yehoshua
Agenda
HTML
Web scraping
XML
JSON
Web API, RESTful services
2 Roi Yehoshua, 2023
Semi-Structured Data
Data that doesn’t conform to a data model, but has some structure
A self-describing structure
Contains tags that describe the data
No separation between data and schema
Examples of semi-structured data:
HTML pages
XML
JSON
E-mails
TCP/IP packets
3 Roi Yehoshua, 2023
Semi-Structured Data
Pros
Flexible schema, can be easily changed
Data is portable
Can be used to exchange data between different databases
Support for nested or hierarchical data
Cons
Interpretation of the relationships in the data is more difficult
Queries are less efficient as compared to the relational model
Storage cost is higher
Cannot define constrains on the data
4 Roi Yehoshua, 2023
HTML
HTML is the standard language for creating Web pages
Consists of a series of elements which tell the browser how to display the content
An HTML element is defined by a start tag, some content, and an end tag:
<tagname>Content goes here...</tagname>
Some HTML elements have no content (like the <br> element)
These elements are called empty elements
Elements can have attributes that provide additional information about the element
Attributes are specified in the start tag of the element
For example, the href attribute of <a> specifies the URL of the page the link goes to:
<a href="[Link] google</a>
5 Roi Yehoshua, 2023
HTML Example
<!DOCTYPE html>
<html lang="en" xmlns="[Link]
<head>
<meta charset="utf-8" />
<title>Sample Page</title>
</head>
<body>
<h1 style="color:darkcyan">Header</h1>
<p>Some text</p>
<nav>
<a href="page1">To page1</a><br/>
<a href="page2">To page2</a><br/>
<a href="page3">To page3</a><br/>
</nav>
</body>
</html>
6 Roi Yehoshua, 2023
DOM (Document Object Model)
The Document Object Model (DOM) is a programming interface for web documents
Used mainly for XML, HTML and SVG documents
Represents the document as a tree of nodes, known as the DOM tree
DOM methods allow programmatic access to the tree
<!DOCTYPE html>
<html lang="en">
<head>
<title>My Document</title>
</head>
<body>
<h1>Header</h1>
<p>Paragraph</p>
</body>
</html>
7 Roi Yehoshua, 2023
Web Scraping
Web scraping is a technique of extracting information from websites
It is used to transform unstructured data (HTML format) into structured data
(database or spreadsheet)
Web scraping is legal if you scrape data publicly available on the web
Be careful scraping personal data, intellectual property or confidential data
In any case, you should check out the Terms of Service of the website before scraping its data
8 Roi Yehoshua, 2023
Web Scraping
You need two Python libraries for scraping data:
Requests: used for fetching web pages from URLs
BeautifulSoup: a library for pulling data of HTML and XML files
These packages are included with Anaconda distribution
To install them manually you can use pip install:
pip install requests
pip install beautifulsoup4
9 Roi Yehoshua, 2023
Web Scraping
The following example extracts data about US states from Wikipedia
[Link]
10 Roi Yehoshua, 2023
Loading the Page Content
First, we need to retrieve the HTML of the page using the requests library:
11 Roi Yehoshua, 2023
Parsing the Page Content
Beautiful Soup is a python library for parsing structured data
To parse the HTML content use the following code:
The second argument, "[Link]", makes sure that you use the appropriate parser for the
HTML content
12 Roi Yehoshua, 2023
Navigating using Tag Names
The simplest way to navigate the parsed tree is to use the name of the tag you want
For example, if you want the <title> tag, just say [Link]:
To get only the text content of the tag use its text attribute:
You can use the dot multiple time to zoom in on a deeper part of the tree:
Using a tag name as an attribute will give you only the first tag by that name
13 Roi Yehoshua, 2023
Inspecting Elements
For easier viewing, you can prettify any Beautiful Soup object when you print it out
14 Roi Yehoshua, 2023
Navigation Attributes
Each tag has a few attributes that allow you to traverse
.contents – a list of the tag’s children
.children – an iterable over the tag’s children
.descendants – an iterable over all of the tag’s descendants
.parent – the element’s parent
.next_sibling – the next sibling of the element on the same level
.previous_sibling – the previous sibling of the element on the same level
15 Roi Yehoshua, 2023
Searching the Tree
The two most common methods for searching the tree are:
find(name, attrs) – retrieves the first tag that matches your filters
findall(name, attrs) – retrieves all the tags that match your filters
The name argument searches for only tags with certain names
Any argument that’s not recognized becomes a filter on one of the tag’s attributes
For example, let’s find all the link elements in the page with the CSS class 'image':
16 Roi Yehoshua, 2023
Searching the Tree
You can also call the find() and find_all() methods on a specific element
This will search only inside the subtree rooted at that element
For example, let’s search for links inside the table
17 Roi Yehoshua, 2023
Searching for Strings
With the string argument you can search for strings instead of tags:
You can also pass a regular expression to it:
18 Roi Yehoshua, 2023
Attributes
You can access a tag’s attributes by treating the tag like a dictionary:
You can access the attributes dictionary directly as .attrs:
19 Roi Yehoshua, 2023
Scraping the Wikipedia Page
Let’s now extract the information we need from the US states Wikipedia page
We first locate the states table as the second table in the page:
20 Roi Yehoshua, 2023
Scraping the Wikipedia Page
We need to skip the first two rows of the table that contain the headers
21 Roi Yehoshua, 2023
Scraping the Wikipedia Page
By inspecting the structure of each row we find the location of the data we need
The name of the state is in the first <th> tag, inside an <a> element
The population size is in the fifth <td> element
The area size is in the sixth <td> element
However, in states where the capital = largest city, the location of the data changes
The population size is in the fourth <td> element
The area size is in the fifth <td> element>
We can identify these states by checking if the second <td> element has colspan="2"
22 Roi Yehoshua, 2023
Scraping the Wikipedia Page
The function to get the states data:
23 Roi Yehoshua, 2023
Scraping the Wikipedia Page
We can now build a DataFrame from this data:
24 Roi Yehoshua, 2023
XML
XML: Extensible Markup Language <purchase_order>
<id>P-101</id>
Unlike HTML: <purchaser> … </purchaser>
<itemlist>
Designed to represent data and not UI elements <item>
Extensible: users can define their own tags <id>RS1</id>
<description>Atom rocket sled</description>
Mainly used for data exchange between applications <quantity>2</quantity>
<price>199.95</price>
Unlike HTML, XML is designed to represent data
</item>
<item>
<id>SG2</id>
<description>Superb glue</description>
<quantity>1</quantity>
<unit-of-measure>liter</unit-of-measure>
<price>29.95</price>
</item>
</itemlist>
<total_cost>429.85</total_cost>
</purchase_order>
25 Roi Yehoshua, 2023
Structure of XML Data
The building blocks of an XML document are elements
Element: a section of data beginning with <tag> and ending with a matching </tag>
An element can contain text, attributes, and other elements
Elements must be properly nested
Proper nesting
<course> … <title> … </title> … </course>
Improper nesting
<course> … <title> … </course> </title>
An empty element <tag></tag> may be abbreviated as <tag/>
Every document must have a single root element that contains all other elements
26 Roi Yehoshua, 2023
Attributes
▪ Elements can have attributes
<course course_id="CS-101">
<title>Intro. to Computer Science</title>
<dept_name>Comp. Sci.</dept_name>
<credits>4</credits>
</course>
▪ Attributes are specified by name=value pairs inside the starting tag of an element
▪ An element may have several attributes, but each attribute can only occur once
<course course_id="CS-101" credits="4">
27 Roi Yehoshua, 2023
Attributes vs. Subelements
In the context of documents, attributes are part of markup, while subelement
contents are part of the basic document contents
In the context of data representation, the distinction is less relevant
Same information can be represented in two ways
<course course_id="CS-101"> … </course>
<course>
<course_id>CS-101</course_id> …
</course>
Suggestion: use attributes for identifiers of elements, and store all other data as
subelements
28 Roi Yehoshua, 2023
Namespaces
A namespace allows organizations to specify globally unique names for the elements
A namespace is defined by an xmlns attribute in the start tag of the root element
<root xmlns:prefix="URL">…</root>
Typically, the URL of the organization’s web site is used as the namespace identifier
The namespace is prepended to each tag or attribute in the document
prefix:element-name
<university xmlns:yale="[Link]
…
<yale:course>
<yale:course_id>CS-101</yale:course_id>
<yale:title>Intro. to Computer Science</yale:title>
<yale:dept_name> Comp. Sci.</yale:dept_name>
<yale:credits>4</yale:credits>
</yale:course>
…
</university>
29 Roi Yehoshua, 2023
Comparison with Relational Data
Inefficient
Tags, which in effect represent schema information, are repeated
Redundant storage of data
e.g., item descriptions may be repeated in multiple purchase orders that ordered the same item
Better than relational tuples as a data exchange format
Unlike relational tuples, XML data is self-documenting due to the presence of tags
Non-rigid format: tags can be added
XML allows nested structures
Wide acceptance, not only in database systems, but also in browsers, tools, and applications
30 Roi Yehoshua, 2023
XML Document Schema
Database schemas constrain what information can be stored, and the data types of
stored values
XML documents are not required to have an associated schema
However, schemas are very important for XML data exchange
Otherwise, a site cannot automatically interpret data received from another site
Two mechanisms for specifying XML schema
Document Type Definition (DTD)
An older format
XML Schema
Newer format, widely used today
31 Roi Yehoshua, 2023
XML Schema
An XML Schema describes the structure of an XML document
Schema definitions themselves are specified in XML syntax, using a variety of tags
defined by XML Schema
These tags are typically prefixed by the namespace xs
Elements are specified using the <xs:element> tag
The type of an element can be simple or complex
XML Schema defines a number of built-in types such as string, integer, decimal and date
e.g., <xs:element name=“dept_name” type=“xs:string”/>
We can use the <xs:complexType> element to create named complex types
<xs:sequence> defines the complex type as a sequence of elements
Attributes are specified using the <xs:attribute> tag
32 Roi Yehoshua, 2023
XML Document for the University Data
<university>
<department dept_name="Comp. Sci.">
<building>Taylor</building>
<budget>100000</budget>
</department>
<department dept_name="Biology">
<building>Watson</building>
<budget>90000</budget>
</department>
<course course_id="CS-101" dept_name="Comp. Sci.">
<title>Intro. to Computer Science</title>
<credits>4</credits>
</course>
….
<instructor ID="10101" dept_name="Comp. Sci.">
<name>Srinivasan</name>
<salary>65000</salary>
<teaches>
<course course_id="CS-101"/>
….
</teaches>
</instructor>
….
</university>
33 Roi Yehoshua, 2023
XML Schema for the University Data
<xs:schema xmlns:xs=“[Link]
<xs:element name=“university” type=“universityType” />
<xs:element name=“department”>
<xs:complexType>
<xs:attribute name=“dept_name” type=“xs:string”/>
<xs:sequence>
<xs:element name=“building” type=“xs:string”/>
<xs:element name=“budget” type=“xs:decimal”/>
</xs:sequence>
</xs:complexType>
</xs:element>
….
<xs:element name=“instructor”>
<xs:complexType>
<xs:attribute name=“ID” type=“xs:string”/>
<xs:sequence>
<xs:element name=“name” type=“xs:string”/>
<xs:element name=“dept_name” type=“xs:string”/>
<xs:element name=“salary” type=“xs:decimal”/>
<xs:element name=“teaches” type=“teachesType”/>
</xs:sequence>
</xs:complexType>
</xs:element>
… Contd.
34 Roi Yehoshua, 2023
XML Schema for the University Document
….
<xs:complexType name=“teachesType”>
<xs:sequence>
<xs:element ref=“course” minOccurs=“0” maxOccurs=“unbounded”/>
</xs:sequence>
</xs:complexType>
<xs:complexType name=“UniversityType”>
<xs:sequence>
<xs:element ref=“department” minOccurs=“0” maxOccurs=“unbounded”/>
<xs:element ref=“course” minOccurs=“0” maxOccurs=“unbounded”/>
<xs:element ref=“instructor” minOccurs=“0” maxOccurs=“unbounded”/>
</xs:sequence>
</xs:complexType>
</xs:schema>
35 Roi Yehoshua, 2023
Application Program Interfaces to XML
There are two standard APIs to XML data:
SAX (Simple API for XML)
Parses the XML document one bit at a time
Provides event handlers for parsing events
e.g., start of element, end of element
Need to keep track of the program’s position in the document
DOM (Document Object Model)
Represents the XML document as a tree structure
Provides a variety of properties and methods for traversing the DOM tree
Also provides methods for updating the DOM tree
Useful for random-access applications
Supported by many programming languages with slightly different syntaxes
36 Roi Yehoshua, 2023
XML Processing in Python
Python’s interfaces for processing XML are grouped in the xml package
The XML handling submodules are:
[Link]: a SAX parser
[Link]: the DOM API definition
[Link]: a minimal DOM implementation
[Link]: the ElementTree API, a simple and lightweight XML processor
A more "Pythonic" API compared to the W3C-controlled DOM
37 Roi Yehoshua, 2023
MiniDom Example
38 Roi Yehoshua, 2023
ElementTree Example
39 Roi Yehoshua, 2023
Querying and Transforming XML Data
Translation of information from one XML schema to another
Querying on XML data
Above two are closely related, and handled by the same tools
Standard XML querying/translation languages
XPath
Simple language consisting of path expressions
XQuery
An XML query language with a rich set of features
XSLT
Simple language designed for translation from XML to XML and XML to HTML
40 Roi Yehoshua, 2023
Tree Model of XML Data
Query and transformation languages are based on a tree model of XML data
41 Roi Yehoshua, 2023
XPath
XPath is a querying language for selecting nodes from an XML document
A path expression is used to navigate and select elements from the document
Consists of a sequence of steps separated by / or //
/ selects a child node (the first / in the path selects the root node)
// selects all the descendant nodes (including self)
Examples:
/bookstore/book selects all the books
//title selects all the title elements anywhere in the document
/bookstore/book//title selects all the title elements anywhere under a book element
42 Roi Yehoshua, 2023
XPath Predicates
Predicates written inside [] are used to find specific nodes in the document
They may follow any step in the path
Index values in predicates start from 1
Can use Boolean operators and, or, and a function not()
A union operator | forms the union of two node sets
Examples:
//book[price < 25] selects books with price less than 25
/bookstore/book[1]/title selects the title of the first book
//book[author='J.K. Rowling']/title selects titles of books authored by J.K. Rowling
//book[price] selects books that have a price subelement
//book[year > 2000 and price < 20] selects books released after 2000 with price less than 20
//book[price > 2 * discount] selects books whose price is greater than twice their discount
43 Roi Yehoshua, 2023
XPath Attributes
Attributes are accessed using @
Examples:
/bookstore/book[1]/title/@lang selects the language attribute of the first book
//title[@lang='en'] selects title nodes that have an attribute lang with a values of 'en’
//title[@lang] selects title nodes that have an attribute lang
44 Roi Yehoshua, 2023
XPath Functions
XPath offers a variety of functions to filter your selections:
Number functions: count(), sum(), round(), …
String functions: concat(), contains(), starts-with(), substring(), …
Boolean functions: not(), true(), false(), …
Functions to get properties of nodes: name(), text(), position()
Examples:
//books/title/text() get the title of the books (without the enclosing <title> tag)
count(//book) returns the number of books
//book[contains(title, 'Harry')] selects books whose title contains 'Harry'
//book[not(contains(title, 'Harry'))] selects books which don’t have 'Harry' in the title
45 Roi Yehoshua, 2023
Class Exercise
Write XPath expressions to find the following nodes:
Select the language of books whose price is greater than 20
Select the title of the books that have more than one author
46 Roi Yehoshua, 2023
XPath in Python
To run an XPath query in Python, you can use the lxml library
pip install lxml
from lxml import etree
# Parse the XML file
root = [Link]('[Link]')
# Run the XPath query
results = [Link]('//book[price > 20]/title/text()')
# Print the results
for result in results:
print(result)
47 Roi Yehoshua, 2023
XML Applications
Storing data with complex structure
e.g., user preferences, configuration files
Storing documents and spreadsheet data
e.g., Open Document Format (ODF) for storing Open Office documents is based on XML
Numerous other standards for a variety of applications
e.g., ChemML, MathML
Exchanging data between different parts of the application
Standard for data exchange for web services
Remote method invocation over HTTP protocol
XML is used to represent method input and output
Data mediation
Common data representation format to bridge different systems
48 Roi Yehoshua, 2023
JSON
JavaScript Object Notation
Textual representation widely used for data exchange
Lightweight compared to XML
Almost no parsing required
Supported by many programming languages
49 Roi Yehoshua, 2023
JSON Syntax
JSON closely resembles the syntax of JavaScript object literal
JSON is built on two structures:
Objects (a collection of key/value pairs)
Arrays (an ordered list of values)
Supported primitive types
Numbers
Strings
Booleans
null
Property names (keys) must be strings
Allows only double-quoted strings
50 Roi Yehoshua, 2023
Processing JSON in Python
The json package provides functions for encoding and decoding JSON data
Main functions:
Function Description
[Link](obj, file) Serialize obj as a JSON formatted stream to file
[Link](obj) Serialize obj to a JSON string
[Link](file) Deserialize file containing a JSON document to a Python object
[Link](s) Deserialize string s to a Python object
Objects in the JSON document are converted into Python dictionaries
Arrays in the JSON document are converted into Python lists
51 Roi Yehoshua, 2023
JSON Document for the University Data
{
"departments": [
{
"dept_name": "Comp. Sci.",
"building": "Taylor",
"budget": 100000
},
{
"dept_name": "Biology",
"building": "Watson",
"budget": 90000
},
...
],
"courses": [
{
"course_id": "CS-101",
"dept_name": "Comp. Sci.",
"title": "Intro. to Computer Science",
"credits": 4
},
...
]
52 Roi Yehoshua, 2023
JSON Document for the University Data
"instructors": [
{
"ID": "10101",
"dept_name": "Comp. Sci.",
"name": "Srinivasan",
"salary": 65000,
"teaches": ["CS-101", "CS-315", "CS-347"]
},
{
"ID": "83821",
"dept_name": "Comp. Sci.",
"name": "Brandt",
"salary": 92000,
"teaches": ["CS-190", "CS-319"]
},
...
]
}
53 Roi Yehoshua, 2023
Reading the Document in Python
54 Roi Yehoshua, 2023
Loading JSON into a DataFrame
You can pass a JSON object (dictionary) directly to the DataFrame constructor
55 Roi Yehoshua, 2023
Web APIs
Web APIs are services provided by web sites that allow to query their content
These services can be accessed from different platforms
e.g., web pages, desktop/mobile applications
Most web APIs support both XML and JSON formats
To use the API, you need to make an HTTP request for a specific URL
They usually require API keys
These protect the API vendor from malicious use of the service
You must apply to get a key, and include it in your code to access the API functionality
56 Roi Yehoshua, 2023
Web APIs
Common web APIs
Google suite of APIs enable you to communicate with various Google services
e.g., Google Search, Google Translate, Google Maps, Gmail, etc.
Facebook suite of APIs enables you to use various parts of the Facebook ecosystem
e.g., providing app login using Facebook login, accepting in-app payments, etc.
Twitter API allows you to embed Twitter data on your site, e.g., your latest tweets
Map APIs like MapQuest and Google Maps API allow you to do things with maps
Telegram APIs allow you to embed content from Telegram channels on your site
YouTube API allows you to embed YouTube videos on your site, search YouTube, etc.
Pinterest API provides tools to manage Pinterest boards and pins
Twilio API provides frameworks for building voice and video call functionality
57 Roi Yehoshua, 2023
REST Architecture
One of the most popular ways to build server APIs is the REST architectural style
REST stands for representational state transfer
Defines an architectural pattern for communication between client and server
Defines the following architectural constraints:
Uniform interface – the server provides a uniform interface for accessing resources
Client-server – the client and the server must be decoupled from each other
Stateless – the server won’t maintain any state between requests from the client
Cacheable – the data retrieved from the server should be cacheable by the client or the server
Layered system – the client may access the server resources indirectly through other layers
such as a proxy or load balancer
Code on demand (optional) – the server may transfer code to the client that it can run
58 Roi Yehoshua, 2023
RESTful Web Services
Web services that follow the REST style are known as RESTful web services
These web services expose their data through public URLs
e.g., the URL for the GitHub REST API is [Link]
You access the data by sending an HTTP request to that URL
59 Roi Yehoshua, 2023
API Endpoints
A REST API exposes a set of public URLs that map to different actions on the server
These URLs are called endpoints
For example, a web service for product management may have the following APIs:
HTTP Method API Endpoint Description
GET /products Get a list of products
GET /products/<product_id> Get a single product
POST /products Create a new product
PUT /product/<product_id> Update a product
DELETE /product/<product_id> Delete a product
60 Roi Yehoshua, 2023
Example: Google Books API
The Google Books API allows clients to access the Google Books repository
A Volume represents information about a book or a magazine
Contains metadata, such as title, authors, publisher
Also includes personalized data, such as whether or not it has been purchased
To get information about volumes, you can use one of the following GET requests
These methods apply to the public data about volumes and do not require authentication
61 Roi Yehoshua, 2023
Example: Google Books API
For example, to search for books that contain the word Data Science
[Link]
62 Roi Yehoshua, 2023
Example: Google Books API
You can also try the method directly from that page
63 Roi Yehoshua, 2023
Getting Data from URLs
The requests module allows you to fetch data from URLs
[Link](url) sends an HTTP request to the specified URL and gets an HTTP
response with all the data (status, headers, content, etc.)
64 Roi Yehoshua, 2023
Loading Data From JSON
The json module provides functions for encoding and decoding JSON data
[Link](file) converts a JSON file into a Python object (dictionary or list)
[Link](s) converts a JSON string into a Python object
65 Roi Yehoshua, 2023
Loading Data From JSON
Finally, we can create a DataFrame from the dictionary we obtained from the JSON:
66 Roi Yehoshua, 2023
Loading Data From JSON
To flatten the JSON, we can use the function json_normalize() from [Link]:
67 Roi Yehoshua, 2023