Open XML Deep Dive
Doug Mahugh
Technical Evangelist, Microsoft
[Link]
Satisfy Your Technical Curiosity
Application type: Document Assembly
Server environment: Linux, Java, Apache, MySql
Desktop environment: Office 2007
Satisfy Your Technical Curiosity
Session Objectives
Satisfy your curiosity about Open XML:
Architecture
The three main Open XML schemas
Development options
Custom XML support
Development scenarios
Satisfy Your Technical Curiosity
Today is the tip of the iceberg
Comprehensive 2-day Open XML Developer
workshop scheduled for Belgium on May 21
Contact Imma Verheyen, Partner Development
Manager: immav@[Link]
Satisfy Your Technical Curiosity
Diverse Environments
All you need is ZIP and XML support
Linux Java Microsoft COM
.NET Framework 3.0
Minizip
J2SE [Link] *
ZIP Library [Link]
Xceed ActiveX controls
zLib
Xceed .NET controls
.NET Framework 3.0
XML Library Apache Xerces JAXP
[Link]
MSXML
* Also includes abstractions for OPC concepts
(Open Packaging Convention)
Satisfy Your Technical Curiosity
Development Scenarios
Scenario Example
Document Assembly Create sales reports from financial and forecast data stored
Server-based or user-assisted construction of documents from in a CRM system.
archived content or database content.
Integration & Content Reuse Quickly and efficiently apply content stored in Word
Much easier to move content between documents, including documents to Web pages.
different document types.
Document Sanitization
Remove unwanted content like comments, embedded code or Remove all tracked changes and comments from a Word
potentially sensitive items from your document when document before it is published.
appropriate.
Document Interrogation
Query document repositories based on custom data, content Search for all documents containing a specific company
types or document metadata. name or sales contact.
Content Tagging
Adding a tagging schema to content can dramatically improve Organizations can create their own smart tags then use
content searches and the value of the data stored in documents. them as the basis for searches.
Document Archival
Ensuring document formats can be consumed long into the XML-based document archives include the data and
future without vendor-specific clients or applications. presentation information.
Satisfy Your Technical Curiosity
XML in Office: the last 10 years
Office 2003
Breakthrough XML Support
WordProcessingML,
SpreadsheetML
Custom-defined schema
2007 Office system
New XML-based Formats
XML File format Default
Office 2000
XML PowerPoint Format
Early Innovation
XML Document Properties
Office XP
First XML Formats
Spreadsheet XML
Office 97
Existing binary file formats designed in
1994, launched in Office 97
Satisfy Your Technical Curiosity
Open XML Architecture
Markup Languages
WordprocessingML SpreadsheetML PresentationML
Shared Vocabularies
DrawingML Custom XML Bibliography
VML (legacy) Metadata Equations
Open Packaging Convention
Digital
Relationships Content Types
Signatures
Core Technologies
ZIP XML + Unicode
Satisfy Your Technical Curiosity
Open Packaging Convention
Low-level conventions that define the structure of
an Office Open XML document
Also used by XPS, and some third-party
implementations are under development
Key concepts: package, parts, relationships, and
content types
Satisfy Your Technical Curiosity
Parts
Stored inside the package in a specific location
Reachable via a URI
Associated with a specific content type
Often XML, but can be of any defined content type (including custom types)
Satisfy Your Technical Curiosity
Content Types
Every part must have a content type
Most OXML parts are content type XML
Consumers support a specific set of content
types
You can define custom content types, and
consumers will preserve them – this is a key
area of opportunity for developer innovation
Satisfy Your Technical Curiosity
Relationships
Tie elements inside the package to each other
Allow you to step through the document without
parsing parts
Are required: a part without a relationship is not
part of the package, and may be discarded
Satisfy Your Technical Curiosity
OPC is a Logical Structure
Files and folders – NO! Parts should be referenced by
These details may vary. their relationship type.
Satisfy Your Technical Curiosity
Types of Interoperability
Reference Schemas Custom-defined Schemas
Display-oriented Data-oriented
Enables technical interoperability Enables semantic interoperability
Satisfy Your Technical Curiosity
Brian Jones, ODC2006
WordprocessingML
Document architecture
Document
properties body
comments images
footnotes/endnotes numberingDefinitions
headers/footers styles
fontTable customXML
Satisfy Your Technical Curiosity
Paragraphs, Runs and Text
How text is stored in wordprocessingML
The document element
• Contains a body element
• Contains paragraphs
• Contains runs
• Contains text elements
<document>
<body>
<p>
<r>
<t>HELLO!</t>
</r>
</p>
</body>
</document>
Satisfy Your Technical Curiosity
Satisfy Your Technical Curiosity
Direct Formatting Example
Simple formatting at paragraph/run levels:
<w:p> Paragraph properties specify bold (default
<w:pPr>
<w:b/> for the entire paragraph)
</w:pPr>
<w:r>
<w:t>The quick</w:t>
</w:r> Run properties specify italics
<w:r> (override for this run)
<w:rPr>
<w:i/>
</w:rPr>
<w:t>brown</w:t>
</w:r>
<w:r>
<w:t>fox.</w:t>
</w:r>
</w:p>
Satisfy Your Technical Curiosity
Paragraph Properties
Can be set directly or in a paragraph style
24 total property settings
<w:p>
<w:pPr>
<w:widowControl w:val=“on” />
<w:keepNext/>
<w:keepLines/>
<w:pageBreakBefore/>
<w:suppressLineNumbers />
<w:suppressAutoHyphens />
<w:textBoxTightWrap />
</w:pPr>
… runs, paragraph content …
</w:p>
Satisfy Your Technical Curiosity
Run Properties
Define formatting for
individual characters
Font attributes, size/position,
other settings
24 total properties
<w:r>
<w:rPr>
<w:rFonts w:ascii=“Arial” w:hAnsi=“Arial” w:cs=“Arial” />
<w:b/>
<w:i/>
<w:sz w:val=“11” />
<w:dstrike w:val=“true” />
Satisfy Your Technical Curiosity
Text <w:t>
The only element in the main story that can
contain text – all other text is in attributes
Three other types of text are allowed in runs:
Deleted text <w:delText>
Field code <w:instrText>
Deleted field codes <w:delInstrText>
By looking to <w:t> nodes, you can be sure
you’re seeing only displayed text
Satisfy Your Technical Curiosity
Revision IDs (RSIDs)
RSID values are used to identify a set of
changes that were made during the same
editing session
Found in many elements:
Paragraphs, runs, sections, styles
Table rows, table properties, charts, diagrams
Allows for merging revisions, without the
privacy and security issues involved in tracking
who changed what
Optional, but recommended for applications
that modify existing documents Satisfy Your Technical Curiosity
Images
An image is a w:pict element inside a run <w:r>
The v:imagedata element is defined in VML:
xmlns:v="urn:schemas-microsoft-com:vml"
The actual image is referenced via a relationship:
<w:pict>
<v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:250; height:200">
<v:imagedata r:id="rId4"/>
</v:shape>
</w:pict>
The relationship points to an image part in the package:
<Relationship Id="rId4”
Type="[Link]
Target="[Link]"/>
Satisfy Your Technical Curiosity
Satisfy Your Technical Curiosity
Tables
Tables are a set of paragraphs which are
arranged into rows and columns
In WordprocessingML, tables are block level
content, and are specified using the table
element
Analogous to the HTML <table> element
Satisfy Your Technical Curiosity
What’s in a table?
<w:tbl>
<w:tblPr>
<w:tblStyle w:val=“TableGrid”/>
<w:tblW w:w=“0” w:type=“auto”/>
<w:tblLook w:val=“01E0”/>
</w:tblPr>
<w:tblGrid>
Properties
<w:gridCol w:w=“2952”/>
<w:gridCol w:w=“2952”/>
<w:gridCol w:w=“2952”/>
</w:tblGrid>
Grid
<w:tr>
<w:tc>
Rows
<w:tcPr>
<w:tcW w:w=“2952” w:type=“dxa”/>
</w:tcPr>
<w:p>
Cells
<w:r>
<w:t>1,1</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w=“2952” w:type=“dxa”/>
</w:tcPr>
<w:p>
<w:r>
<w:t>1,2</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>
Satisfy Your Technical Curiosity
Styles
A style defines a specific set of values for formatting properties that may be applied as a single logical unit
For example, the Normal style in Word 2007 defines these formatting properties:
Font = Calibri (body)
Font Size = 11 point
Font Language = Word default (as configured by user)
Justification = Left
Line Spacing = Single
Widow/Orphan control
Satisfy Your Technical Curiosity
Style Types
WordprocessingML supports six style types:
Paragraph styles
Character styles
Linked styles
Table styles
List styles
Default style (linked type, but applies when no style
specified)
Satisfy Your Technical Curiosity
Paragraph Styles Example
Step 1: define a paragraph style
Styles are defined in the style part:
<w:style w:type=“paragraph” w:styleid=“TestParagraphStyle”>
<w:name w:val=“Test Paragraph Style”/> Common
<w:qformat/>
<w:rsid w:val=“009E253E”/> Properties
<w:pPr>
<w:pStyle w:val=“TestParagraphStyle”/>
<w:spacing w:line=“480” w:lineRule=“auto”/> Paragraph
<w:ind w:firstLine=“1440”/> Properties
</w:pPr>
<w:rPr>
<w:rFonts w:ascii=“Algerian” w:hAnsi=“Algerian”/>
<w:b/> Character (Run)
<w:color w:val=“ED1C24”>
<w:sz w:val=“40”/> Properties
</w:rPr>
</w:style>
Satisfy Your Technical Curiosity
Paragraph Styles Example
Step 2: apply the style to a paragraph
The pStyle element associates a style with a
paragraph:
<w:p>
<w:pPr>
<w:pStyle w:val=“TestParagraphStyle”/>
</w:pPr>
<w:r>
<w:t>Text</w:t>
</w:r>
</w:p>
The paragraph is displayed with the style applied:
Satisfy Your Technical Curiosity
Numbering Styles
Flexible hierarchical definition
Numbering styles are styles which define the
structure of a multi-level numbering format
Numbering definition instances are based on an
abstract numbering definition
Abstract numbering definitions define paragraph
properties for up to 9 hierarchical levels
NOTE: items in a list are simply paragraphs. There
is no list “container” as in HTML.
Satisfy Your Technical Curiosity
Table Styles
A table style is associated with a table via the tblStyle
element in the table properties:
<w:tbl>
<w:tblPr>
<w:tblStyle w:val=“Style20”/> Table style Style20 is applied to
<w:tblW w:w=“5000” w:type=“pct”/> the table
<w:tblLook w:val=“0220”/>
</w:tblPr>
… tblGrid, table rows and cells …
</w:tbl>
Satisfy Your Technical Curiosity
Style Application Hierarchy
Direct formatting overrides style settings
Document Defaults
Table
Numbering
Paragraph
Character
Direct Formatting
Satisfy Your Technical Curiosity
Satisfy Your Technical Curiosity
Subdocuments
Mechanism for “rolling up” documents
Subdocuments are well-formed Open XML
documents and can be edited independently
Subdocuments don’t know they’re part of
something bigger – they’re just stand-alone
documents
Satisfy Your Technical Curiosity
Subdocuments
Implementation details
Main document part contains subDoc elements that indicate where to
insert subdocuments
The subdocument’s location is stored in a relationship
Main document part:
<w:body>
<w:subDoc r:id=“rId1”/>
<w:subDoc r:id=“rId2”/>
<w:subDoc r:id=“rId3”/>
Relationships:
<Relationship Id=“rId1” Type=“…/subDocument” Target=“[Link]” TargetMode=“external”/>
<Relationship Id=“rId2” Type=“…/subDocument” Target=“[Link]” TargetMode=“external”/>
<Relationship Id=“rId3” Type=“…/subDocument” Target=“[Link]” TargetMode=“external”/>
Satisfy Your Technical Curiosity
Document Sections
A document may be divided into sections
Allows formatting at a higher level than
paragraphs:
Landscape/portrait orientation
Page margins, etc.
Section properties are defined in sectPr:
<w:sectPr>
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440“ w:left="1800“
w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
Satisfy Your Technical Curiosity
Section Properties
Example
In Word, section properties are
specified in the Page Setup dialog
<w:sectPr>
<w:pgSz w:w="12240" w:h="15840" />
<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440"
w:header="720" w:footer="720" w:gutter="0" />
<w:cols w:space="720" />
<w:docGrid w:linePitch="360" />
</w:sectPr>
Satisfy Your Technical Curiosity
Custom XML Support
Merging the worlds of documents and data
Satisfy Your Technical Curiosity
Why Custom XML?
Enables semantic interoperability
Documents can provide a rich view of back-end data
Documents can update back-end data sources
Exposes business data within documents to
heterogenous systems
Business-specific semantics can be applied to
document data
Separates presentation and data
Custom XML schema support was a key design
objective for Open XML: any schema can be used
in Open XML documents.
Satisfy Your Technical Curiosity
Custom XML
Developer options for custom XML support
Satisfy Your Technical Curiosity
Custom-defined XML is Document Template
stored in its own discrete part Visual
document XML
parts data
Any XML can be stored, with
or without a schema
External System
Only one requirement:
must be well-formed XML
External applications (client/server) can process
the store or populate the store
Microsoft Confidential
Custom XML Properties
Information about a custom XML part is stored
in a custom XML properties part
Stored via an implicit customXmlProps
relationship from the custom XML part
Contains two types of information:
Part ID
Uniquely identifies a part within a document
Maintained through editing sessions
XML Schema references
Satisfy Your Technical Curiosity
Structured Document Tags
Known as "content controls" in MS-Office
Smart tags and custom XML markup add semantics,
but do not have any effect on presentation
Sometimes you want to affect presentation
Data-entry restrictions, multi-select, etc.
Solution: the structured document tag <sdt>
Satisfy Your Technical Curiosity
Types of Content Controls
Plain text
Combobox
Dropdown list
Document building block
Date picker
Rich text
Picture
Satisfy Your Technical Curiosity
Data Binding
2-way synchronization between:
Content controls (structured document tags)
Custom XML nodes (data in your schema)
Satisfy Your Technical Curiosity
Satisfy Your Technical Curiosity
Data Binding Basics
How to bind xml nodes to structured document tags
Add a <dataBinding> element to the structured
document tag properties <sdtPr>
<dataBinding> specifices a custom Xml part (by Custom
XML Data Identifier) and an Xpath to a specific node
within that part
Custom XML Data Identifier? What’s that?
The custom XML part has a properties part
Implicit relationship in [Link]
The properties part specifies a Custom XML Data Identifier
Satisfy Your Technical Curiosity
Content Control Toolkit
Open-source developer tool
[Link]
iki/[Link]?ProjectName=d
be
Automatically generates
parts, relationships, and
markup to bind custom XML
parts to content controls
Satisfy Your Technical Curiosity
Custom XML Markup
Tagging document content with custom semantics
Allows embedding the structure from any XML schema into a WordprocessingML
document
Schema not required
XML doesn’t have to validate against your schema
Custom XML elements may have custom attributes
Consumers/producers preserve your attributes
Satisfy Your Technical Curiosity
Custom XML Markup
Example
Satisfy Your Technical Curiosity
XML Mapping in SpreadsheetML
XML elements and attributes may be mapped
to cells and tables
Store a copy of the schema in the workbook
Data is in an external XML file
Satisfy Your Technical Curiosity
Satisfy Your Technical Curiosity
SpreadsheetML
Document architecture
Workbook properties
styles
sharedStrings
calcChain
sheet1..N
sheet1..N
sheet1..N
sheet1..N
table
chart
sheet1..N
sheet1..N
sheet1..N
drawing
Satisfy Your Technical Curiosity
SpreadsheetML
Performance optimizations
SpreadsheetML has been optimized based on
analysis of typical spreadsheet usage patterns:
Small tag size (often a single character)
Shared strings
Shared formulas
Sparse table markup allowed
Optional r=“A1” attribute for faster loading
Satisfy Your Technical Curiosity
SpreadsheetML Strings
Two alternatives for storing text strings
1. Inline strings
• Provided for ease of translation/conversion
• Useful in XSLT scenarios
• Excel and other consumers may convert to shared
strings on document save
2. An entry in the shared-strings table
• May be either a simple string or formatted text
These approaches may be mixed/combined
Satisfy Your Technical Curiosity
Shared Strings
Repetitive strings are common in typical spreadsheets
Strings are stored in a shared-strings part:
Each unique string is stored once
Cells store the index (0-based) of the string
Benefits:
Users: reduced file size, improved performance
Developers: all strings are in one part, simplifying
search, localization, and other common string-handling
tasks
Satisfy Your Technical Curiosity
Shared Strings
Sampled shared-strings table
6 string references, 4
unique strings
<sst xmlns="..." count="6" uniqueCount="4">
<si>
<t>Paris</t>
</si> Paris = string 0
<si>
<t>Seattle</t>
</si> <row r="1" spans="1:1">
<si> <c r="A1" t="s">
<t>London</t> <v>0</v>
</si> </c>
<si> </row>
<t>Copenhagen</t>
</si>
</sst>
Satisfy Your Technical Curiosity
Inline Strings
No shared-strings part required
Especially useful in XSLT scenarios
If you’re consuming Open XML documents, you must
handle both cases: inline strings and/or shared strings
Excel 2007 converts to shared strings on save
<sheetData>
<row><c t="inlineStr"><is><t>Paris</t></is></c></row>
<row><c t="inlineStr"><is><t>Seattle</t></is></c></row>
<row><c t="inlineStr"><is><t>London</t></is></c></row>
<row><c t="inlineStr"><is><t>Copenhagen</t></is></c></row>
<row><c t="inlineStr"><is><t>Paris</t></is></c></row>
<row><c t="inlineStr"><is><t>London</t></is></c></row>
</sheetData>
Satisfy Your Technical Curiosity
Satisfy Your Technical Curiosity
SpreadsheetML Tables
Design goals for SpreadsheetML tables:
1. Separate presentation and data
Data stays in the worksheet
Table definition is in a separate part (referenced via a relationship)
2. Cell definition lightweight but extensible
Complex type with future storage capabilities
Named ranges written in their own collection instead of on each cell
Open XML has different types of tables for each
document type, optimized for different scenarios:
WordprocessingML has its tbl element
SpreadsheetML has its table element
PresentationML uses DrawingML tables (tbl inside graphicData)
Satisfy Your Technical Curiosity
SpreadsheetML Table Example
Worksheet part:
<sheetData>
<row r="1" spans="1:2"> Headings = shared strings
<c r="A1" t="s"><v>0</v></c>
<c r="B1" t="s"><v>1</v></c>
</row>
<row r="2" spans="1:2">
<c r="A2"><v>1</v></c>
<c r="B2"><v>4</v></c>
</row>
<row r="3" spans="1:2">
<c r="A3"><v>2</v></c>
<c r="B3"><v>5</v></c>
</row> Table-definition part:
<row r="4" spans="1:2">
<c r="A4"><v>3</v></c> <table … ref="A1:B4” …>
<c r="B4"><v>6</v></c> <autoFilter ref="A1:B4”/>
</row> <tableColumns count="2">
</sheetData> <tableColumn id="1" name="Column1" />
... <tableColumn id="2" name="Column2" />
<tableParts count="1"> </tableColumns>
<tablePart r:id="rId2"/> <tableStyleInfo …/>
</tableParts> </table>
Satisfy Your Technical Curiosity
AutoFilter Example
Satisfy Your Technical Curiosity
Formulas
<row>
<c>
Stored as plain text <v>1</v>
</c>
</row>
<row>
Documented in the spec <c>
<v>2</v>
to provide for predictable </c>
</row>
interoperability <row>
<c>
<v>3</v>
</c>
</row>
<row>
<c>
<f>SUM(A1:A3)</f>
</c>
</row>
Satisfy Your Technical Curiosity
DrawingML
Satisfy Your Technical Curiosity
DrawingML vs. VML
Per the Ecma spec: “VML should be considered
a deprecated format included in Office Open
XML for legacy reasons only.”
VML was not entirely replaced by DrawingML
before submission to Ecma
Main remaining uses of VML:
WordprocessingML: OfficeArt shapes, textboxes
SpreadsheetML/PresentationML: comments,
embedded OLE objects Satisfy Your Technical Curiosity
3-D Effects
Apply 3-D Adjust
Bevels Material types
3-D Scene Definition
Before Apply 3-D Scene
Satisfy Your Technical Curiosity
DrawingML
Implementation varies for each document type
Location varies (main body, drawing part, slide)
Packaging (“shim”) varies
WordprocessingML SpreadsheetML PresentationML
(in Word): (in Excel): (in PowerPoint):
Satisfy Your Technical Curiosity
WordprocessingML
DrawingML is stored in the document body
Shim defines graphic frame
and locked canvas
Shape definition is DrawingML
Satisfy Your Technical Curiosity
SpreadsheetML
Drawing is in a separate drawing part
Shim defines anchor
position and type
Shape definition uses
spreadsheetDrawing namespace
for non-visual properties
Satisfy Your Technical Curiosity
PresentationML
DrawingML is stored in the slide part
No shim – the shape is in
the shape tree
Shape definition is DrawingML
Satisfy Your Technical Curiosity
PresentationML
Document architecture
Themes
Slide Masters
Slide Layouts
Fonts
Slides
Presentation
View Properties
Notes Slides
Notes Masters
Presentation
Properties
Handout
Masters
Code
Satisfy Your Technical Curiosity
Sample Slide
Typical presentationML content
Shape Textbox Chart
Satisfy Your Technical Curiosity
Slide Part
Shape tree contains slide content definitions
<p:sld xmlns:p=“…/presentationml/2006/main”
xmlns:a=“…/drawingml/2006/main” …>
<p:cSld>
<p:spTree>
<p:sp>
Shape
<p:nvSpPr>
<p:cNvPr id="2" name="7-Point Star 1” />
…
<p:sp>
<p:nvSpPr> Textbox
<p:cNvPr id="3" name="TextBox 2” />
…
<p:graphicFrame>
<p:nvGraphicFramePr> Chart
<p:cNvPr id="4" name="Chart 3” />
…
</p:spTree>
</p:cSld>
<p:clrMapOvr>
<a:masterClrMapping />
</p:clrMapOvr>
</p:sld>
Satisfy Your Technical Curiosity
Chart Part ([Link])
Data source
Shape Textbox Chart
Satisfy Your Technical Curiosity
PresentationML Tables
Slide part contains table definition
In a graphicFrame element
All DrawingML is in the slide – no separate “table part”
Header-row formatting
Banded-row formatting
Table position
TableStyleID = GUID
Table definition
Satisfy Your Technical Curiosity
Satisfy Your Technical Curiosity
[Link]
Formed by 40 companies to share developer information about
the Office Open XML file formats
Articles with source code for C#, VB, Java, PHP, XSLT
Forums for posting technical questions
Satisfy Your Technical Curiosity
The Ecma Spec
1. Fundamentals
2. Open Packaging Convention
3. Primer (start here)
4. Markup Language Reference (huge!)
5. Markup Compatibility and Extensibility
Reference Schemas (XSD, RelaxNG)
Tips:
• Start with part 3, Primer
• Use the PDF version of part 4 to look up elements/attributes
Satisfy Your Technical Curiosity
Open XML Blogs
Brian Jones: [Link]
Doug Mahugh: [Link]
Kevin Boske: [Link]
Wouter Van Vugt:
[Link]
Erika Ehrli: [Link]
See complete list on [Link]
Satisfy Your Technical Curiosity
Satisfy Your Technical Curiosity