0% found this document useful (0 votes)
59 views67 pages

Twilio XML Schema Overview

Uploaded by

skullman830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views67 pages

Twilio XML Schema Overview

Uploaded by

skullman830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DS 5110 – Lecture 6

Semi-Structured Data

Roi Yehoshua
Agenda
 HTML
 Web scraping
 XML
 JSON
 Web API, RESTful services

2 Roi Yehoshua, 2023


Semi-Structured Data
 Data that doesn’t conform to a data model, but has some structure
 A self-describing structure
 Contains tags that describe the data
 No separation between data and schema
 Examples of semi-structured data:
 HTML pages
 XML
 JSON
 E-mails
 TCP/IP packets

3 Roi Yehoshua, 2023


Semi-Structured Data
 Pros
 Flexible schema, can be easily changed
 Data is portable
 Can be used to exchange data between different databases
 Support for nested or hierarchical data
 Cons
 Interpretation of the relationships in the data is more difficult
 Queries are less efficient as compared to the relational model
 Storage cost is higher
 Cannot define constrains on the data

4 Roi Yehoshua, 2023


HTML
 HTML is the standard language for creating Web pages
 Consists of a series of elements which tell the browser how to display the content
 An HTML element is defined by a start tag, some content, and an end tag:
<tagname>Content goes here...</tagname>

 Some HTML elements have no content (like the <br> element)


 These elements are called empty elements
 Elements can have attributes that provide additional information about the element
 Attributes are specified in the start tag of the element
 For example, the href attribute of <a> specifies the URL of the page the link goes to:
<a href="[Link] google</a>

5 Roi Yehoshua, 2023


HTML Example
<!DOCTYPE html>

<html lang="en" xmlns="[Link]


<head>
<meta charset="utf-8" />
<title>Sample Page</title>
</head>
<body>
<h1 style="color:darkcyan">Header</h1>

<p>Some text</p>

<nav>
<a href="page1">To page1</a><br/>
<a href="page2">To page2</a><br/>
<a href="page3">To page3</a><br/>
</nav>
</body>
</html>

6 Roi Yehoshua, 2023


DOM (Document Object Model)
 The Document Object Model (DOM) is a programming interface for web documents
 Used mainly for XML, HTML and SVG documents
 Represents the document as a tree of nodes, known as the DOM tree
 DOM methods allow programmatic access to the tree

<!DOCTYPE html>
<html lang="en">
<head>
<title>My Document</title>
</head>
<body>
<h1>Header</h1>
<p>Paragraph</p>
</body>
</html>

7 Roi Yehoshua, 2023


Web Scraping
 Web scraping is a technique of extracting information from websites
 It is used to transform unstructured data (HTML format) into structured data
(database or spreadsheet)
 Web scraping is legal if you scrape data publicly available on the web
 Be careful scraping personal data, intellectual property or confidential data
 In any case, you should check out the Terms of Service of the website before scraping its data

8 Roi Yehoshua, 2023


Web Scraping
 You need two Python libraries for scraping data:
 Requests: used for fetching web pages from URLs
 BeautifulSoup: a library for pulling data of HTML and XML files
 These packages are included with Anaconda distribution
 To install them manually you can use pip install:
pip install requests
pip install beautifulsoup4

9 Roi Yehoshua, 2023


Web Scraping
 The following example extracts data about US states from Wikipedia
[Link]

10 Roi Yehoshua, 2023


Loading the Page Content
 First, we need to retrieve the HTML of the page using the requests library:

11 Roi Yehoshua, 2023


Parsing the Page Content
 Beautiful Soup is a python library for parsing structured data
 To parse the HTML content use the following code:

 The second argument, "[Link]", makes sure that you use the appropriate parser for the
HTML content

12 Roi Yehoshua, 2023


Navigating using Tag Names
 The simplest way to navigate the parsed tree is to use the name of the tag you want
 For example, if you want the <title> tag, just say [Link]:

 To get only the text content of the tag use its text attribute:

 You can use the dot multiple time to zoom in on a deeper part of the tree:

 Using a tag name as an attribute will give you only the first tag by that name

13 Roi Yehoshua, 2023


Inspecting Elements
 For easier viewing, you can prettify any Beautiful Soup object when you print it out

14 Roi Yehoshua, 2023


Navigation Attributes
 Each tag has a few attributes that allow you to traverse
 .contents – a list of the tag’s children
 .children – an iterable over the tag’s children
 .descendants – an iterable over all of the tag’s descendants
 .parent – the element’s parent
 .next_sibling – the next sibling of the element on the same level
 .previous_sibling – the previous sibling of the element on the same level

15 Roi Yehoshua, 2023


Searching the Tree
 The two most common methods for searching the tree are:
 find(name, attrs) – retrieves the first tag that matches your filters
 findall(name, attrs) – retrieves all the tags that match your filters
 The name argument searches for only tags with certain names
 Any argument that’s not recognized becomes a filter on one of the tag’s attributes
 For example, let’s find all the link elements in the page with the CSS class 'image':

16 Roi Yehoshua, 2023


Searching the Tree
 You can also call the find() and find_all() methods on a specific element
 This will search only inside the subtree rooted at that element
 For example, let’s search for links inside the table

17 Roi Yehoshua, 2023


Searching for Strings
 With the string argument you can search for strings instead of tags:

 You can also pass a regular expression to it:

18 Roi Yehoshua, 2023


Attributes
 You can access a tag’s attributes by treating the tag like a dictionary:

 You can access the attributes dictionary directly as .attrs:

19 Roi Yehoshua, 2023


Scraping the Wikipedia Page
 Let’s now extract the information we need from the US states Wikipedia page
 We first locate the states table as the second table in the page:

20 Roi Yehoshua, 2023


Scraping the Wikipedia Page
 We need to skip the first two rows of the table that contain the headers

21 Roi Yehoshua, 2023


Scraping the Wikipedia Page
 By inspecting the structure of each row we find the location of the data we need
 The name of the state is in the first <th> tag, inside an <a> element
 The population size is in the fifth <td> element
 The area size is in the sixth <td> element
 However, in states where the capital = largest city, the location of the data changes
 The population size is in the fourth <td> element
 The area size is in the fifth <td> element>
 We can identify these states by checking if the second <td> element has colspan="2"

22 Roi Yehoshua, 2023


Scraping the Wikipedia Page
 The function to get the states data:

23 Roi Yehoshua, 2023


Scraping the Wikipedia Page
 We can now build a DataFrame from this data:

24 Roi Yehoshua, 2023


XML
 XML: Extensible Markup Language <purchase_order>
<id>P-101</id>
 Unlike HTML: <purchaser> … </purchaser>
<itemlist>
 Designed to represent data and not UI elements <item>
 Extensible: users can define their own tags <id>RS1</id>
<description>Atom rocket sled</description>
 Mainly used for data exchange between applications <quantity>2</quantity>
<price>199.95</price>
 Unlike HTML, XML is designed to represent data
</item>
<item>
<id>SG2</id>
<description>Superb glue</description>
<quantity>1</quantity>
<unit-of-measure>liter</unit-of-measure>
<price>29.95</price>
</item>
</itemlist>
<total_cost>429.85</total_cost>
</purchase_order>

25 Roi Yehoshua, 2023


Structure of XML Data
 The building blocks of an XML document are elements
 Element: a section of data beginning with <tag> and ending with a matching </tag>
 An element can contain text, attributes, and other elements
 Elements must be properly nested
 Proper nesting
 <course> … <title> … </title> … </course>
 Improper nesting
 <course> … <title> … </course> </title>
 An empty element <tag></tag> may be abbreviated as <tag/>
 Every document must have a single root element that contains all other elements

26 Roi Yehoshua, 2023


Attributes
▪ Elements can have attributes
<course course_id="CS-101">
<title>Intro. to Computer Science</title>
<dept_name>Comp. Sci.</dept_name>
<credits>4</credits>
</course>
▪ Attributes are specified by name=value pairs inside the starting tag of an element
▪ An element may have several attributes, but each attribute can only occur once
<course course_id="CS-101" credits="4">

27 Roi Yehoshua, 2023


Attributes vs. Subelements
 In the context of documents, attributes are part of markup, while subelement
contents are part of the basic document contents
 In the context of data representation, the distinction is less relevant
 Same information can be represented in two ways
 <course course_id="CS-101"> … </course>
 <course>
<course_id>CS-101</course_id> …
</course>
 Suggestion: use attributes for identifiers of elements, and store all other data as
subelements

28 Roi Yehoshua, 2023


Namespaces
 A namespace allows organizations to specify globally unique names for the elements
 A namespace is defined by an xmlns attribute in the start tag of the root element
 <root xmlns:prefix="URL">…</root>
 Typically, the URL of the organization’s web site is used as the namespace identifier
 The namespace is prepended to each tag or attribute in the document
 prefix:element-name
<university xmlns:yale="[Link]

<yale:course>
<yale:course_id>CS-101</yale:course_id>
<yale:title>Intro. to Computer Science</yale:title>
<yale:dept_name> Comp. Sci.</yale:dept_name>
<yale:credits>4</yale:credits>
</yale:course>

</university>
29 Roi Yehoshua, 2023
Comparison with Relational Data
 Inefficient
 Tags, which in effect represent schema information, are repeated
 Redundant storage of data
 e.g., item descriptions may be repeated in multiple purchase orders that ordered the same item
 Better than relational tuples as a data exchange format
 Unlike relational tuples, XML data is self-documenting due to the presence of tags
 Non-rigid format: tags can be added
 XML allows nested structures
 Wide acceptance, not only in database systems, but also in browsers, tools, and applications

30 Roi Yehoshua, 2023


XML Document Schema
 Database schemas constrain what information can be stored, and the data types of
stored values
 XML documents are not required to have an associated schema
 However, schemas are very important for XML data exchange
 Otherwise, a site cannot automatically interpret data received from another site
 Two mechanisms for specifying XML schema
 Document Type Definition (DTD)
 An older format
 XML Schema
 Newer format, widely used today

31 Roi Yehoshua, 2023


XML Schema
 An XML Schema describes the structure of an XML document
 Schema definitions themselves are specified in XML syntax, using a variety of tags
defined by XML Schema
 These tags are typically prefixed by the namespace xs
 Elements are specified using the <xs:element> tag
 The type of an element can be simple or complex
 XML Schema defines a number of built-in types such as string, integer, decimal and date
 e.g., <xs:element name=“dept_name” type=“xs:string”/>
 We can use the <xs:complexType> element to create named complex types
 <xs:sequence> defines the complex type as a sequence of elements
 Attributes are specified using the <xs:attribute> tag

32 Roi Yehoshua, 2023


XML Document for the University Data
<university>
<department dept_name="Comp. Sci.">
<building>Taylor</building>
<budget>100000</budget>
</department>
<department dept_name="Biology">
<building>Watson</building>
<budget>90000</budget>
</department>
<course course_id="CS-101" dept_name="Comp. Sci.">
<title>Intro. to Computer Science</title>
<credits>4</credits>
</course>
….
<instructor ID="10101" dept_name="Comp. Sci.">
<name>Srinivasan</name>
<salary>65000</salary>
<teaches>
<course course_id="CS-101"/>
….
</teaches>
</instructor>
….
</university>
33 Roi Yehoshua, 2023
XML Schema for the University Data
<xs:schema xmlns:xs=“[Link]
<xs:element name=“university” type=“universityType” />
<xs:element name=“department”>
<xs:complexType>
<xs:attribute name=“dept_name” type=“xs:string”/>
<xs:sequence>
<xs:element name=“building” type=“xs:string”/>
<xs:element name=“budget” type=“xs:decimal”/>
</xs:sequence>
</xs:complexType>
</xs:element>
….
<xs:element name=“instructor”>
<xs:complexType>
<xs:attribute name=“ID” type=“xs:string”/>
<xs:sequence>
<xs:element name=“name” type=“xs:string”/>
<xs:element name=“dept_name” type=“xs:string”/>
<xs:element name=“salary” type=“xs:decimal”/>
<xs:element name=“teaches” type=“teachesType”/>
</xs:sequence>
</xs:complexType>
</xs:element>
… Contd.
34 Roi Yehoshua, 2023
XML Schema for the University Document
….
<xs:complexType name=“teachesType”>
<xs:sequence>
<xs:element ref=“course” minOccurs=“0” maxOccurs=“unbounded”/>
</xs:sequence>
</xs:complexType>
<xs:complexType name=“UniversityType”>
<xs:sequence>
<xs:element ref=“department” minOccurs=“0” maxOccurs=“unbounded”/>
<xs:element ref=“course” minOccurs=“0” maxOccurs=“unbounded”/>
<xs:element ref=“instructor” minOccurs=“0” maxOccurs=“unbounded”/>
</xs:sequence>
</xs:complexType>
</xs:schema>

35 Roi Yehoshua, 2023


Application Program Interfaces to XML
 There are two standard APIs to XML data:
 SAX (Simple API for XML)
 Parses the XML document one bit at a time
 Provides event handlers for parsing events
 e.g., start of element, end of element

 Need to keep track of the program’s position in the document


 DOM (Document Object Model)
 Represents the XML document as a tree structure
 Provides a variety of properties and methods for traversing the DOM tree
 Also provides methods for updating the DOM tree
 Useful for random-access applications
 Supported by many programming languages with slightly different syntaxes

36 Roi Yehoshua, 2023


XML Processing in Python
 Python’s interfaces for processing XML are grouped in the xml package
 The XML handling submodules are:
 [Link]: a SAX parser
 [Link]: the DOM API definition
 [Link]: a minimal DOM implementation
 [Link]: the ElementTree API, a simple and lightweight XML processor
 A more "Pythonic" API compared to the W3C-controlled DOM

37 Roi Yehoshua, 2023


MiniDom Example

38 Roi Yehoshua, 2023


ElementTree Example

39 Roi Yehoshua, 2023


Querying and Transforming XML Data
 Translation of information from one XML schema to another
 Querying on XML data
 Above two are closely related, and handled by the same tools
 Standard XML querying/translation languages
 XPath
 Simple language consisting of path expressions
 XQuery
 An XML query language with a rich set of features
 XSLT
 Simple language designed for translation from XML to XML and XML to HTML

40 Roi Yehoshua, 2023


Tree Model of XML Data
 Query and transformation languages are based on a tree model of XML data

41 Roi Yehoshua, 2023


XPath
 XPath is a querying language for selecting nodes from an XML document
 A path expression is used to navigate and select elements from the document
 Consists of a sequence of steps separated by / or //
 / selects a child node (the first / in the path selects the root node)
 // selects all the descendant nodes (including self)
 Examples:
 /bookstore/book selects all the books
 //title selects all the title elements anywhere in the document
 /bookstore/book//title selects all the title elements anywhere under a book element

42 Roi Yehoshua, 2023


XPath Predicates
 Predicates written inside [] are used to find specific nodes in the document
 They may follow any step in the path
 Index values in predicates start from 1
 Can use Boolean operators and, or, and a function not()
 A union operator | forms the union of two node sets
 Examples:
 //book[price < 25] selects books with price less than 25
 /bookstore/book[1]/title selects the title of the first book
 //book[author='J.K. Rowling']/title selects titles of books authored by J.K. Rowling
 //book[price] selects books that have a price subelement
 //book[year > 2000 and price < 20] selects books released after 2000 with price less than 20
 //book[price > 2 * discount] selects books whose price is greater than twice their discount

43 Roi Yehoshua, 2023


XPath Attributes
 Attributes are accessed using @

 Examples:
 /bookstore/book[1]/title/@lang selects the language attribute of the first book
 //title[@lang='en'] selects title nodes that have an attribute lang with a values of 'en’
 //title[@lang] selects title nodes that have an attribute lang

44 Roi Yehoshua, 2023


XPath Functions
 XPath offers a variety of functions to filter your selections:
 Number functions: count(), sum(), round(), …
 String functions: concat(), contains(), starts-with(), substring(), …
 Boolean functions: not(), true(), false(), …
 Functions to get properties of nodes: name(), text(), position()

 Examples:
 //books/title/text() get the title of the books (without the enclosing <title> tag)
 count(//book) returns the number of books
 //book[contains(title, 'Harry')] selects books whose title contains 'Harry'
 //book[not(contains(title, 'Harry'))] selects books which don’t have 'Harry' in the title

45 Roi Yehoshua, 2023


Class Exercise
 Write XPath expressions to find the following nodes:
 Select the language of books whose price is greater than 20
 Select the title of the books that have more than one author

46 Roi Yehoshua, 2023


XPath in Python
 To run an XPath query in Python, you can use the lxml library
 pip install lxml
from lxml import etree

# Parse the XML file


root = [Link]('[Link]')

# Run the XPath query


results = [Link]('//book[price > 20]/title/text()')

# Print the results


for result in results:
print(result)

47 Roi Yehoshua, 2023


XML Applications
 Storing data with complex structure
 e.g., user preferences, configuration files
 Storing documents and spreadsheet data
 e.g., Open Document Format (ODF) for storing Open Office documents is based on XML
 Numerous other standards for a variety of applications
 e.g., ChemML, MathML
 Exchanging data between different parts of the application
 Standard for data exchange for web services
 Remote method invocation over HTTP protocol
 XML is used to represent method input and output
 Data mediation
 Common data representation format to bridge different systems
48 Roi Yehoshua, 2023
JSON
 JavaScript Object Notation
 Textual representation widely used for data exchange
 Lightweight compared to XML
 Almost no parsing required
 Supported by many programming languages

49 Roi Yehoshua, 2023


JSON Syntax
 JSON closely resembles the syntax of JavaScript object literal
 JSON is built on two structures:
 Objects (a collection of key/value pairs)
 Arrays (an ordered list of values)
 Supported primitive types
 Numbers
 Strings
 Booleans
 null
 Property names (keys) must be strings
 Allows only double-quoted strings

50 Roi Yehoshua, 2023


Processing JSON in Python
 The json package provides functions for encoding and decoding JSON data
 Main functions:
Function Description
[Link](obj, file) Serialize obj as a JSON formatted stream to file
[Link](obj) Serialize obj to a JSON string
[Link](file) Deserialize file containing a JSON document to a Python object
[Link](s) Deserialize string s to a Python object

 Objects in the JSON document are converted into Python dictionaries


 Arrays in the JSON document are converted into Python lists

51 Roi Yehoshua, 2023


JSON Document for the University Data
{
"departments": [
{
"dept_name": "Comp. Sci.",
"building": "Taylor",
"budget": 100000
},
{
"dept_name": "Biology",
"building": "Watson",
"budget": 90000
},
...
],
"courses": [
{
"course_id": "CS-101",
"dept_name": "Comp. Sci.",
"title": "Intro. to Computer Science",
"credits": 4
},
...
]

52 Roi Yehoshua, 2023


JSON Document for the University Data
"instructors": [
{
"ID": "10101",
"dept_name": "Comp. Sci.",
"name": "Srinivasan",
"salary": 65000,
"teaches": ["CS-101", "CS-315", "CS-347"]
},
{
"ID": "83821",
"dept_name": "Comp. Sci.",
"name": "Brandt",
"salary": 92000,
"teaches": ["CS-190", "CS-319"]
},
...
]
}

53 Roi Yehoshua, 2023


Reading the Document in Python

54 Roi Yehoshua, 2023


Loading JSON into a DataFrame
 You can pass a JSON object (dictionary) directly to the DataFrame constructor

55 Roi Yehoshua, 2023


Web APIs
 Web APIs are services provided by web sites that allow to query their content
 These services can be accessed from different platforms
 e.g., web pages, desktop/mobile applications
 Most web APIs support both XML and JSON formats

 To use the API, you need to make an HTTP request for a specific URL
 They usually require API keys
 These protect the API vendor from malicious use of the service
 You must apply to get a key, and include it in your code to access the API functionality

56 Roi Yehoshua, 2023


Web APIs
 Common web APIs
 Google suite of APIs enable you to communicate with various Google services
 e.g., Google Search, Google Translate, Google Maps, Gmail, etc.
 Facebook suite of APIs enables you to use various parts of the Facebook ecosystem
 e.g., providing app login using Facebook login, accepting in-app payments, etc.
 Twitter API allows you to embed Twitter data on your site, e.g., your latest tweets
 Map APIs like MapQuest and Google Maps API allow you to do things with maps
 Telegram APIs allow you to embed content from Telegram channels on your site
 YouTube API allows you to embed YouTube videos on your site, search YouTube, etc.
 Pinterest API provides tools to manage Pinterest boards and pins
 Twilio API provides frameworks for building voice and video call functionality

57 Roi Yehoshua, 2023


REST Architecture
 One of the most popular ways to build server APIs is the REST architectural style
 REST stands for representational state transfer
 Defines an architectural pattern for communication between client and server
 Defines the following architectural constraints:
 Uniform interface – the server provides a uniform interface for accessing resources
 Client-server – the client and the server must be decoupled from each other
 Stateless – the server won’t maintain any state between requests from the client
 Cacheable – the data retrieved from the server should be cacheable by the client or the server
 Layered system – the client may access the server resources indirectly through other layers
such as a proxy or load balancer
 Code on demand (optional) – the server may transfer code to the client that it can run

58 Roi Yehoshua, 2023


RESTful Web Services
 Web services that follow the REST style are known as RESTful web services
 These web services expose their data through public URLs
 e.g., the URL for the GitHub REST API is [Link]
 You access the data by sending an HTTP request to that URL

59 Roi Yehoshua, 2023


API Endpoints
 A REST API exposes a set of public URLs that map to different actions on the server
 These URLs are called endpoints
 For example, a web service for product management may have the following APIs:
HTTP Method API Endpoint Description
GET /products Get a list of products
GET /products/<product_id> Get a single product
POST /products Create a new product
PUT /product/<product_id> Update a product
DELETE /product/<product_id> Delete a product

60 Roi Yehoshua, 2023


Example: Google Books API
 The Google Books API allows clients to access the Google Books repository
 A Volume represents information about a book or a magazine
 Contains metadata, such as title, authors, publisher
 Also includes personalized data, such as whether or not it has been purchased
 To get information about volumes, you can use one of the following GET requests

 These methods apply to the public data about volumes and do not require authentication

61 Roi Yehoshua, 2023


Example: Google Books API
 For example, to search for books that contain the word Data Science
 [Link]

62 Roi Yehoshua, 2023


Example: Google Books API
 You can also try the method directly from that page

63 Roi Yehoshua, 2023


Getting Data from URLs
 The requests module allows you to fetch data from URLs
 [Link](url) sends an HTTP request to the specified URL and gets an HTTP
response with all the data (status, headers, content, etc.)

64 Roi Yehoshua, 2023


Loading Data From JSON
 The json module provides functions for encoding and decoding JSON data
 [Link](file) converts a JSON file into a Python object (dictionary or list)
 [Link](s) converts a JSON string into a Python object

65 Roi Yehoshua, 2023


Loading Data From JSON
 Finally, we can create a DataFrame from the dictionary we obtained from the JSON:

66 Roi Yehoshua, 2023


Loading Data From JSON
 To flatten the JSON, we can use the function json_normalize() from [Link]:

67 Roi Yehoshua, 2023

You might also like