Extensible Markup Language (XML)
An intro to structured flat-file data
1 Extensible Markup Language (XML): An intro to structured flat-file data

Ever wondered how different computer programs talk to each other, or how data can be structured in a way that’s both human-readable and machine-understandable? Enter XML (eXtensible Markup Language), a powerful and versatile data format that you’re likely to encounter in your college career and beyond.
A Quick Trip Down Memory Lane: The History of XML
XML didn’t just appear out of nowhere. Its roots can be traced back to SGML (Standard Generalized Markup Language), a complex ISO standard from the 1980s for defining markup languages. While SGML was incredibly powerful, it was also, well, incredibly complex.
The World Wide Web Consortium (W3C) saw the need for a simpler, more web-friendly version of SGML, something that could handle diverse data structures while still being robust. And so, in 1998, XML 1.0 was released. It quickly gained traction for its simplicity, extensibility, and the fact that it was designed to be easily parsed by computers. Think of it as SGML’s leaner, meaner (and much more user-friendly) cousin.
What Exactly Is XML?
At its core, XML is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. It’s “extensible” because it allows you to define your own tags, unlike HTML which has a predefined set of tags (like <p>, <h1>, <img>).
Here’s an example of what an XML file might look like for a simple book:
<bookstore>
<book category="fiction">
<title lang="en">The Hitchhiker's Guide to the Galaxy</title>
<author>Douglas Adams</author>
<year>1979</year>
<price>12.99</price>
</book>
<book category="science">
<title lang="en">Cosmos</title>
<author>Carl Sagan</author>
<year>1980</year>
<price>15.50</price>
</book>
<book category="fiction">
<title lang="en">The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<year>1925</year>
<price>12.99</price>
</book>
<book category="science">
<title lang="en">A Brief History of Time</title>
<author>Stephen Hawking</author>
<year>1988</year>
<price>15.50</price>
</book>
</bookstore>Notice the self-descriptive tags like <book>, <title>, and <author>. This makes it very easy for someone (or something) to understand what kind of data is being presented.
You can view this file directly in the browser and they structure will be syntax highlighted using the browser’s internal default style for xml files.
Where is XML Used? A Myriad of Applications
XML is incredibly versatile and is used in a vast array of applications. Here are just a few common scenarios:
Configuration Files: Many software applications use XML to store their settings and configurations.
Data Exchange: It’s a popular format for exchanging data between different systems, especially in older web services.
Document Storage: For structured documents, XML provides a robust way to store content, especially for publishing and content management systems.
RSS Feeds: Many news feeds you might still encounter are powered by XML!
Microsoft Office Formats: Under the hood, modern Word, Excel, and PowerPoint files are essentially collections of XML files.
Ensuring Consistency: What is an XML Schema?
Imagine you’re receiving XML data from various sources. How do you ensure that everyone is sending you data in the exact same format? This is where an XML Schema comes in.
An XML Schema (often using the XSD - XML Schema Definition - language) defines the legal building blocks of an XML document. It specifies:
Which elements and attributes are allowed.
The order of elements.
The data types for elements and attributes (e.g., string, integer, date).
Which elements are mandatory or optional.
Think of it like a blueprint or a contract for your XML data. If an XML document doesn’t conform to its associated schema, it’s considered “invalid.” This is crucial for data integrity and predictable processing.
Here is a full example of an XSD schema that validates our bookstore XML file.
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified" attributeFormDefault="unqualified">
<!-- Define the root element, 'bookstore' -->
<xs:element name="bookstore">
<xs:complexType>
<!-- 'bookstore' contains a sequence of 'book' elements -->
<xs:sequence>
<xs:element ref="book" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
</xs:element>
<!-- Define the 'book' element -->
<xs:element name="book">
<xs:complexType>
<!-- A 'book' contains a sequence of 'title', 'author', 'year', and 'price' -->
<xs:sequence>
<xs:element name="title">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<!-- 'title' has a required 'lang' attribute -->
<xs:attribute name="lang" type="xs:string" use="required" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
<xs:element name="author" type="xs:string" />
<xs:element name="year" type="xs:integer" />
<xs:element name="price" type="xs:decimal" />
</xs:sequence>
<!-- 'book' has a required 'category' attribute -->
<xs:attribute name="category" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
</xs:schema>The schema is just a plain xml file with elements from the “http://www.w3.org/2001/XMLSchema” namespace. You can view this file in the browser.
Making XML Work for You in VS Code: Formatting and Linting
If you’re using VS Code, you’re in luck! It’s a fantastic editor for working with XML.
Formatting: Keeping your XML clean and readable is essential. VS Code, with the right extensions, can automatically format your XML, ensuring consistent indentation and spacing.
Linting: Linting is like having a spell checker and grammar checker for your code. For XML, a linter will check for well-formedness (basic syntax errors like unclosed tags) and can also validate your XML against a schema, flagging any inconsistencies.
Here are some excellent VS Code extensions to get you started:
XML Tools: This is a must-have! It provides XML formatting, validation, XPath evaluation, and XSLT transformations.
XML: Another popular option with good support for XML editing.
To Format an XML file in VS Code:
Install one of the extensions above.
Open your XML file.
Right-click anywhere in the file and select “Format Document” (or use
Shift+Alt+Fon Windows/Linux,Shift+Option+Fon macOS).
XML as a Data Source: Querying Your Data
Once you have your data structured in XML, how do you get specific pieces of information out of it? This is where query languages comes into play.
The primary language for querying XML documents is XPath (XML Path Language). XPath uses path expressions (much like file system paths) to navigate through elements and attributes in an XML document.
Here are some XPath examples based on our book example:
/bookstore/book: Selects all<book>elements that are direct children of the<bookstore>element./bookstore/book[1]: Selects the first<book>element./bookstore/book[@category='fiction']: Selects all<book>elements where thecategoryattribute is “fiction”./bookstore/book/title: Selects all<title>elements that are children of<book>elements./bookstore/book/price/text(): Selects the text content of all<price>elements.
Many programming languages (like Python, Java, C#) have built-in libraries to parse XML and use XPath to extract data, making it a very effective data source.
The Wordiness Problem: Why XML’s Popularity Shifted
Despite its power and widespread adoption, XML has a notable drawback: its verbosity. Take a look at our book example again. Notice how each piece of data, like <title> or <author>, requires both an opening and a closing tag. This “self-describing” nature, while great for human readability and extensibility, means XML files can become quite large and contain a lot of repetitive markup compared to the actual data.
For instance, if you’re dealing with millions of records, this verbosity can lead to:
Larger File Sizes: More bytes transferred over networks, consuming more bandwidth and storage.
Slower Parsing: More text for computers to process, potentially slowing down applications.
Increased Complexity for Simple Data: For basic key-value pairs, the XML structure can feel unnecessarily heavy.
This “wordiness” was a significant factor in the rise of alternative data formats, most notably JSON (JavaScript Object Notation). JSON offers a much more compact syntax for representing data, especially for web-based applications where data size and parsing speed are critical. While XML still holds its ground in many enterprise and document-centric applications, JSON has become the de facto standard for many new web services and APIs due to its leaner structure.
From Data to Presentation: XSLT Transformations
While XML is great for storing data, it doesn’t have a built-in way to define how that data should be presented to a user. This is where XSLT (eXtensible Stylesheet Language Transformations) comes in. Think of XSLT as a powerful instruction manual for converting an XML document into something else, like a human-readable web page (HTML).
XSLT is a declarative, XML-based language that defines a set of rules for transforming an XML document from one format to another. It uses templates that are applied to elements in the source XML document, and these templates specify how to restructure and reformat the data to create the output. This separation of data (XML) and presentation logic (XSLT) is a core concept that allows for great flexibility. For example, the same XML data file can be transformed into HTML for a web browser or into a new, differently structured XML file for data exchange.
The standard and recommended approach for XSLT transformations involves keeping your XML data, XSD schema, and XSLT stylesheet as separate files. This separation provides better maintainability, reusability, and follows best practices for web development. Instead of embedding stylesheets within XML documents, you can reference external XSLT files using a processing instruction or apply transformations programmatically.
Here’s an example using three separate files for a playlist system:
The XML data file: This is where the data resides.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="playlist.xsl"?>
<playlist xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="playlist.xsd">
<song>
<title>Bohemian Rhapsody</title>
<artist>Queen</artist>
</song>
<song>
<title>Imagine</title>
<artist>John Lennon</artist>
</song>
<song>
<title>Hallelujah</title>
<artist>Jeff Buckley</artist>
</song>
</playlist>The XML Schema for validation: This defines the allowed structure for the data.
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<!-- Define the root element, 'playlist' -->
<xs:element name="playlist">
<xs:complexType>
<xs:sequence>
<xs:element name="stylesheet" minOccurs="0" maxOccurs="1">
<xs:complexType mixed="true">
<xs:sequence>
<xs:any namespace="##any" processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="required"/>
<xs:attribute name="version" type="xs:string" use="required"/>
<xs:anyAttribute namespace="##any" processContents="lax"/>
</xs:complexType>
</xs:element>
<xs:element name="song" minOccurs="1" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="title" type="xs:string"/>
<xs:element name="artist" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>The XSLT stylesheet: This defines how the data is displayed.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<head>
<title>My Playlist</title>
<style>
body { font-family: Arial, sans-serif; margin: 20px; }
h2 { color: #333; }
ul { list-style-type: disc; }
li { margin: 5px 0; }
</style>
</head>
<body>
<h2>My Favorite Songs</h2>
<ul>
<xsl:for-each select="playlist/song">
<li>
<xsl:value-of select="title"/> by <xsl:value-of select="artist"/>
</li>
</xsl:for-each>
</ul>
</body>
</html>
</xsl:template>
</xsl:stylesheet>When a browser that supports XSLT processes the playlist.xml file, it will automatically apply the external stylesheet referenced in the processing instruction and display the formatted HTML output: html
<html>
<body>
<h2>My Favorite Songs</h2>
<ul>
<li>
Bohemian Rhapsody by Queen
</li>
<li>
Imagine by John Lennon
</li>
<li>
Hallelujah by Jeff Buckley
</li>
</ul>
</body>
</html>This approach offers several advantages over embedded stylesheets: the XSLT can be reused across multiple XML files, the schema provides validation independently of the transformation, and each file has a clear, focused responsibility in the overall system.
Conclusion
XML might seem a bit old-school compared to JSON for some modern web applications, but its importance, especially in enterprise systems, document management, and data interchange, remains significant. Understanding XML, schemas, and how to query and transform them with XSLT will give you a solid foundation for working with structured data in many computing contexts. While its verbosity led to a decline in popularity for certain use cases, its extensibility and strict validation capabilities ensure it remains a relevant and powerful tool.
So, dive in, experiment with those VS Code extensions, and start exploring the world of XML!