Every .docx file you open, every sitemap Google crawls, every hospital record exchanged between systems, and every SVG on a modern website shares one thing: they are all XML.
XML (Extensible Markup Language) is a text-based, platform-independent format for storing and transporting structured data. The W3C published the specification in 1998, and it has been quietly running core infrastructure ever since.
This guide covers what XML is, how it structures data, how it differs from HTML and JSON, how parsing works, and where XML still powers real systems in 2025 across finance, healthcare, and web development.
What is XML?

XML (Extensible Markup Language) is a text-based, platform-independent format for storing and transporting structured data. The W3C published the first XML 1.0 specification in February 1998, and the current standard is XML 1.0 (Fifth Edition), with XML 1.1 covering edge cases around Unicode character support.
The word “extensible” is the key distinction. Unlike HTML, which has a fixed set of tags, XML lets you define your own. A tag called <invoice> or <patientRecord> is perfectly valid XML. The format carries no visual presentation rules. It just describes data structure.
Over 35% of organizations still rely on XML for data exchange in finance and healthcare, according to industry reports (MoldStud, 2025). The XML Databases Software Market was valued at $3.2 billion in 2024 and is forecast to reach $7.5 billion by 2033 at a 9.8% CAGR (Verified Market Reports).
What separates XML from other data formats
XML is self-describing. The tag names carry semantic meaning, which matters a lot when two systems that have never communicated before need to exchange data.
3 properties make XML distinct from formats like CSV or binary files:
- Human-readable and machine-readable at the same time
- Hierarchical tree structure supports nested, complex data relationships
- Validation against a schema (DTD or XSD) is built into the specification
JSON can’t enforce document structure natively the way XML Schema does. That matters in regulated industries.
The XML 1.0 specification in brief
Published by: W3C (World Wide Web Consortium), 1998
Current version: XML 1.0 Fifth Edition (2008)
Character encoding support: UTF-8 and UTF-16 natively, with optional encoding declarations for others
XML 1.1 exists but sees minimal adoption. Most tools and systems target XML 1.0. Practically speaking, if you’re working with XML today, you’re working with XML 1.0.
How Does XML Structure Data?
XML structures data as a tree of nested elements. Every XML document has exactly 1 root element. All other elements are children or descendants of that root. Attributes provide additional metadata on individual elements without creating new child nodes.
| Component | Role | Example |
|---|---|---|
| Prolog | Declares XML version and encoding | <?xml version=”1.0″ encoding=”UTF-8″?> |
| Root element | Top-level container, required | <catalog> |
| Child elements | Nested data nodes | <product id=”101″> |
| Attributes | Metadata within an element’s opening tag | id=”101″ |
| CDATA section | Raw text block, parser ignores markup inside | <![CDATA[ … ]]> |
What makes an XML document well-formed?
A well-formed XML document follows 5 strict rules the parser enforces:
- Exactly 1 root element wraps all content
- Every opening tag has a matching closing tag
- Tags are case-sensitive (
<Name>and<name>are different elements) - Attributes must be quoted
- Elements must be properly nested, never overlapping
Well-formed is the minimum requirement. A parser will reject any document that breaks these rules entirely. “Valid” is the next level up, which means the document also conforms to a DTD or XML Schema.
What is an XML Schema (XSD)?
An XSD (XML Schema Definition) defines what elements and attributes are allowed, their data types, and whether they’re required or optional. It’s written in XML itself.
XSD replaced DTD as the preferred validation method because it supports 40+ built-in data types (string, integer, date, boolean, and more), something DTD cannot do. Approximately 41% of developers prefer XML for configuration management in enterprise applications, partly because XSD lets teams enforce strict data contracts between systems (MoldStud, 2025).
IBM, SAP, and Oracle all ship XSD-based validation in their enterprise integration platforms.
What is the Difference Between XML and HTML?
XML and HTML share the same angle-bracket syntax but solve completely different problems. HTML defines how content looks in a browser. XML defines what data means. They are not competing technologies. They often coexist in the same system.
Tag definitions and flexibility
HTML tag set: Fixed. You use <p>, <div>, <table> because browsers know what those mean.
XML tag set: Fully user-defined. You write <shipmentDate> or <bloodPressureReading> because the meaning comes from your application, not from a browser.
HTML is forgiving by design. A browser will render a page even if tags aren’t closed. XML is not forgiving. One unclosed tag and the XML parser throws an error and stops.
Parsing rules and error handling
HTML parsers are built for real-world messiness. XHTML is the stricter version that applies XML parsing rules to HTML documents. Most web developers abandoned XHTML around 2008 when HTML5 launched, because the strictness caused more problems than it solved for general web pages.
For data exchange between systems, strictness is actually useful. A finance system sending payment data to a bank needs the receiving system to reject malformed input, not silently guess what was meant.
When to use each
| Use Case | Use HTML | Use XML |
|---|---|---|
| Web page content | Yes | No |
| Data exchange between APIs | No | Yes |
| Config files | No | Yes |
| Document storage with metadata | No | Yes |
What Are the Core Components of an XML Document?

An XML document has 7 structural components. Not all appear in every document, but understanding each one determines whether you can write, debug, or parse XML reliably.
What are XML namespaces?
Namespaces prevent tag name conflicts when combining XML from 2 or more sources. If both a product catalog and an order system define a <price> element, the parser needs a way to tell them apart.
A namespace declaration assigns a URI prefix to a group of elements:
xmlns:cat="http://example.com/catalog"
Then <cat:price> and <ord:price> refer to different things. The URI doesn’t need to point to a real web page. It just functions as a unique identifier string.
What is a CDATA section?
Purpose: Wraps text that contains characters the XML parser would otherwise misread as markup.
If you need to store a block of JavaScript or CSS inside an XML document without escaping every < and & character, CDATA handles it.
Syntax: <![CDATA[ your raw content here ]]>
The parser skips everything inside CDATA and treats it as plain text. The 3 characters that end a CDATA section (]]>) cannot appear inside one.
Entity references
XML has 5 predefined entity references for characters with special meaning in markup:
<for<>for>&for&"for"'for'
Custom entities are defined in the DTD. External entity references (pointing to files or URLs) are the source of XXE security vulnerabilities, covered in a later section.
How Does XML Parsing Work?
An XML parser reads an XML document and converts it into a data structure the application can work with. There are 3 main parsing models: DOM, SAX, and StAX. Each involves a different trade-off between memory use, speed, and access pattern.
DOM vs. SAX: which parser to use?
DOM (Document Object Model) parsing loads the entire XML document into memory as a tree. You can navigate to any node, modify it, and query it with XPath. The trade-off is memory. A 500 MB XML file with DOM parsing will consume multiple gigabytes of RAM.
SAX (Simple API for XML) parsing reads the document sequentially, firing events as it encounters elements. It never builds a full in-memory tree.
SAX uses constant memory regardless of file size. The limitation: you can’t go back. Once the parser passes a node, it’s gone.
StAX (Streaming API for XML) is a middle path. Like SAX it streams, but it gives the developer control over when to pull the next event rather than pushing events automatically. Java developers use StAX heavily for large-file processing.
| Parser | Memory Use | Random Access | Best For |
|---|---|---|---|
| DOM | High | Yes | Small documents, complex navigation |
| SAX | Low | No | Large files, read-once processing |
| StAX | Low | No | Large files, developer-controlled flow |
XPath: querying the parsed XML tree
XPath is the query language for navigating XML documents. It works on the DOM tree model and uses path expressions similar to file system paths.
//product[@id='101']/price selects the price element inside any product element with the attribute id equal to 101.
XPath 1.0 is part of the original W3C XML stack. XPath 2.0 and 3.1 add stronger type support and functional expressions. Most XML tools in 2024 support at least XPath 2.0.
Python’s lxml library, Java’s JAXB, and .NET’s System.Xml all include XPath support natively. You don’t need a third-party dependency to query XML in any of these environments.
What is XML Used For?
XML has 5 major real-world application areas. Some are obvious, others less so. The less obvious ones (like every .docx file you’ve ever opened) are where XML’s reach actually surprises most people.
How is XML used in web services?
SOAP (Simple Object Access Protocol) uses XML exclusively for message formatting. Every SOAP request and response is an XML document with a defined envelope, header, and body structure.
About 70% of web services utilize XML for cross-platform communication, according to industry reports (MoldStud, 2025). Banking, healthcare, and government systems still rely on SOAP-based web services because SOAP provides built-in support for WS-Security, message-level encryption, and ACID-compliant transactions that REST does not guarantee by default.
PayPal’s original payment API was SOAP-based. Many financial institutions and healthcare providers still run SOAP endpoints alongside their newer REST APIs to support legacy integrations.
How is XML used in configuration files?
Several major development ecosystems use XML as their configuration format by default:
- Maven (Java): pom.xml defines project dependencies, build plugins, and versioning
- Android development: AndroidManifest.xml declares permissions, activities, and app metadata
- Spring Framework: applicationContext.xml (pre-annotation era) configured dependency injection
- .NET: web.config and app.config control application behavior and connection strings
Approximately 41% of developers prefer XML for configuration management in enterprise applications specifically because XSD validation catches misconfiguration errors before deployment (MoldStud, 2025).
XML in document formats and industry standards
This is where XML’s footprint is largest and most overlooked.
Every .docx, .xlsx, and .pptx file is a ZIP archive containing XML files. The OOXML format (ISO/IEC 29500) is XML. Open any .docx with a file decompressor and you’ll find document.xml, styles.xml, and settings.xml inside.
SVG (Scalable Vector Graphics) is XML. Every SVG file in web design is a valid XML document. FHIR (Fast Healthcare Interoperability Resources) supports XML as one of its 2 primary encoding formats alongside JSON. XBRL, the financial reporting standard used for SEC filings and corporate disclosures, is XML-based. RSS and Atom feeds are XML. XML sitemaps, which Google requires for crawling large sites, are XML.
What is SOAP and How Does It Use XML?
SOAP (Simple Object Access Protocol) is a messaging protocol that uses XML to structure every message exchanged between a client and a web service. It is not a data format. It is a full communication protocol with rules for message structure, error handling, and transport.
72% of enterprises that have invested in API frameworks specifically highlight SOAP’s robustness for legacy system support (MoldStud, 2025). That number explains why SOAP is still running in production across banking, insurance, and government despite being over 25 years old.
SOAP message structure
Every SOAP message is an XML document with 3 mandatory parts and 1 optional part:
- Envelope: The root XML element that wraps the entire message
- Header (optional): Authentication tokens, transaction IDs, routing information
- Body: The actual request or response data
- Fault: Error details, present only when the call fails
The WSDL (Web Services Description Language) file is the companion XML document that describes what operations a SOAP service offers, what inputs it expects, and what it returns. WSDL is machine-readable. Tools like Apache CXF and .NET’s WCF generate client code directly from WSDL files.
Where SOAP still wins over REST
SOAP handles things REST doesn’t standardize: built-in retry logic, stateful operations, ACID transactions across multiple systems, and WS-Security for message-level encryption. REST leaves those to the developer.
In a 2024 survey, over 60% of enterprises reported still relying on established API standards for critical operations (MoldStud). For payment processing between banks, where a duplicate transaction means a real financial loss, that level of protocol-enforced reliability justifies the verbosity cost of XML.
Salesforce ran a SOAP API as its primary integration method for years before adding a REST API in 2010. The SOAP API is still active and still used by enterprise customers who built integrations against it in the 2000s.
What is the Difference Between XML and JSON?
In 2024, 78% of APIs used JSON for data exchange, yet XML remains the dominant format in enterprise systems, regulated industries, and document-centric workflows (DEV Community, 2024).
That split is not about one format winning. It reflects different tools solving different problems.
| Factor | XML | JSON |
|---|---|---|
| File size | Larger (closing tags add weight) | 20-30% smaller for same data |
| Native array support | No | Yes |
| Schema validation | XSD (mature, 40+ data types) | JSON Schema (less mature) |
| Metadata support | Strong (attributes, namespaces) | Limited |
| JavaScript parsing | Requires XML parser | Native, no library needed |
Where JSON is the better choice
REST APIs, browser-based apps, mobile apps. JSON parses natively in JavaScript with no external library. Smaller payload sizes mean faster responses. For web applications calling a backend service dozens of times per page load, that overhead compounds quickly.
According to Stack Overflow research, JSON is used by 54.7% of developers compared to 42.4% for XML (MoldStud, 2024).
MongoDB, CouchDB, and most modern NoSQL databases use JSON as their native document format. Zero format translation on read or write.
Where XML still wins
XSD is still the gold standard for strict schema validation, particularly in enterprise and regulated environments (JSON Utils, 2025).
3 scenarios where XML is the stronger option:
- Documents with complex metadata or mixed content (legal filings, technical manuals, publishing pipelines)
- Systems requiring WS-Security, digital signatures, or XML Encryption
- Compliance with standards that mandate XML: XBRL for financial reporting, CDA for healthcare, OOXML for office documents
Well, the thing is, over 65% of enterprises use both XML and JSON in parallel rather than choosing one exclusively (MoldStud, 2025). The formats coexist. Most serious integration platforms handle both without drama.
If you want to convert between them, tools like an XML to CSV Converter or a CSV to XML Converter handle the format translation. A JSON beautifier or JSON minifier can help when switching between formats during debugging.
What Are XML Transformations?
XML transformation is the process of converting an XML document into a different structure or format using a defined set of rules. The 2 primary technologies for this are XSLT and XQuery, both W3C standards, both still actively used in finance, publishing, healthcare, and government.
How does XSLT work?
XSLT (Extensible Stylesheet Language Transformations) became a W3C Recommendation in 1999. XSLT 3.0, the current version, reached Recommendation status on June 8, 2017.
XSLT uses template-matching rules, not procedural loops. You write templates that match specific XML patterns, and the processor applies them as it traverses the document tree.
Common XSLT output targets:
- HTML (XML catalog to web page)
- PDF via XSL-FO intermediate format
- A different XML structure (transforming one schema into another)
- Plain text for batch exports
XSLT 3.0 added streaming support through Saxon, enabling processing of large XML documents without loading the full document into memory. This matters for HL7 records, financial batch files, and large publishing datasets (xml.com, 2025).
XQuery for XML databases
Over 60% of developers are currently using XQuery and XPath in their XML projects (MoldStud, 2024).
XQuery vs. XSLT in one line: XQuery is SQL for XML documents. XSLT is a stylesheet language for converting XML into something else.
XQuery runs natively in XML databases like MarkLogic, BaseX, and eXist-db. A FLWOR expression (For, Let, Where, Order by, Return) works similarly to a SQL SELECT query but operates on XML tree nodes instead of table rows.
XSLT tooling in production
Altova XMLSpy is trusted by 91% of the Fortune 500 and 5.4 million developers worldwide, with built-in XSLT 1.0, 2.0, and 3.0 processors alongside XQuery 1.0 and 3.0 (Altova, 2025).
Saxon (from Saxonica) is the reference XSLT 3.0 processor. Most Java-based transformation pipelines use Saxon under the hood.
Oxygen XML Editor provides step-through XSLT debugging with breakpoints, XPath inspection, and integration with XQuery. For teams doing serious document transformation work, those debugging tools cut development time significantly.
What Are the Security Vulnerabilities in XML?
XML processing introduces 3 specific attack classes that don’t exist with simpler data formats. All 3 exploit the XML parser itself rather than application logic. Understanding them is non-negotiable before deploying any system that accepts XML input from external sources.
What is an XXE attack?
XXE stands for XML External Entity injection. OWASP included XXE as a standalone entry in its Top 10 vulnerabilities list for years. In the 2021 edition, XXE became a sub-category under A05: Security Misconfiguration, where 90% of applications were tested for some form of misconfiguration (OWASP, 2021).
How it works: an attacker submits XML that includes an entity declaration pointing to a local file or internal network resource. If the parser resolves external entities (many do by default), the content of that resource gets returned in the response.
Real examples from 2023-2024:
- CVE-2024-5919: blind XXE in Palo Alto Networks PAN-OS allowed file exfiltration from firewalls to attacker-controlled servers
- CVE-2024-30043: XXE in Microsoft SharePoint Server allowed file reads and SSRF attacks
- CVE-2023-27554: XXE in IBM WebSphere Application Server exposed sensitive data
Fix: disable external entity processing and DTD retrieval in the XML parser configuration. One setting. Most modern libraries have it available.
The Billion Laughs attack
Named for its structure, not its impact. A developer defines a nested chain of entity references: 1 entity expands to 10, those expand to 100, which expand to 1,000, and so on.
A tiny XML file triggers exponential memory consumption that crashes the parsing process.
Fix: set a hard limit on entity expansion depth and total entity count at the parser level. Libraries like lxml (Python) and Apache Xerces expose these configuration options.
XPath injection
XPath injection is the XML analog of SQL injection. If user input gets inserted into an XPath query without sanitization, an attacker can modify the query logic to bypass authentication or extract data they shouldn’t see.
Prevention covers 3 layers: parameterized XPath queries (when the library supports them), input sanitization before any XML document model is touched, and XSD schema validation to reject structurally invalid documents before processing begins.
What Tools Are Used to Work With XML?
The XML tooling ecosystem splits cleanly into 4 categories: editors for writing and validating XML, processors for transforming it, libraries for parsing it in code, and databases for storing and querying it at scale.
XML editors and validators
Altova XMLSpy is the top-selling XML editor globally, trusted by 91% of the Fortune 500 (Altova, 2025). Includes graphical XSD schema designer, XSLT/XQuery debugger, and built-in support for XBRL, SOAP, and JSON.
Oxygen XML Editor is the preferred choice for technical writers and documentation teams working with DITA and DocBook. The 28.x release series (2024-2026) added AI-powered content generation and XLIFF translation workflows.
For developers who don’t need a dedicated IDE, VS Code with the XML extension from Red Hat handles validation, XPath evaluation, and schema association without leaving the editor.
Parsing libraries by language
| Language | Library | Parser Type |
|---|---|---|
| Python | lxml | DOM + XPath + XSLT |
| Java | JAXB (JSR 222) | DOM, SAX, StAX |
| .NET | System.Xml | DOM, SAX-style XmlReader |
| JavaScript | DOMParser (native) | DOM |
XML databases
MarkLogic is the leading enterprise XML database, used in publishing, financial services, and government for managing large collections of XML and JSON documents alongside full-text search and ACID transactions.
BaseX and eXist-db are open-source alternatives focused on XQuery performance and research use cases. BaseX consistently benchmarks as one of the fastest XQuery processors available.
For command-line validation: xmllint (part of libxml2) validates an XML document against a DTD or XSD in a single command. Ships with most Linux distributions by default. Took me longer than I’d like to admit to realize it was already installed before I went looking for it elsewhere.
How Does XML Relate to Modern Web Standards?
XML didn’t fade out. It became the substrate that powers formats most developers use every day without thinking about it as XML at all.
SVG is XML
Every SVG file is a valid XML document. The W3C SVG specification is built on the XML data model, which means SVG files follow XML well-formedness rules and can be validated, transformed with XSLT, and embedded directly into HTML as inline XML.
That’s why SVG in HTML works without any conversion step. You can animate SVG with CSS, apply XPath to its nodes, or process it with any XML-aware tool.
For practical work: SVG optimization tools like SVGO reduce file size by removing redundant XML nodes and attributes. Editing SVG files directly means editing XML. If you’ve ever looked at SVG source and seen <svg xmlns="http://www.w3.org/2000/svg">, that namespace declaration is pure XML syntax.
Office documents and financial reporting
Every .docx, .xlsx, and .pptx file created by Microsoft Office or Google Docs is a ZIP archive containing XML files. OOXML (Office Open XML, ISO/IEC 29500) is the international standard, and it is XML through and through.
XBRL (eXtensible Business Reporting Language) is the XML-based standard for financial disclosures. The SEC requires XBRL tagging for financial statements filed by public companies in the US. Altova XMLSpy holds the XBRL-Certified Software designation from XBRL International.
Healthcare and XML
FHIR (Fast Healthcare Interoperability Resources) supports both JSON and XML as primary encoding formats. According to a 2025 State of FHIR survey, 71% of organizations report using FHIR for at least some healthcare data exchange use cases (HL7 International, 2025).
HL7 v3 and CDA (Clinical Document Architecture) are XML-only formats. CDA encodes patient clinical documents as XML and remains widely deployed across hospital systems globally.
Cleveland Clinic uses FHIR to sync patient records across its entire hospital network, relying on XML and JSON alongside REST APIs for health data exchange (TechMagic, 2024).
XML sitemaps and web infrastructure
Google’s sitemap protocol requires XML. Every sitemap.xml file submitted to Google Search Console is an XML document structured according to the sitemaps.org XML schema.
RSS and Atom feeds (still used by podcasting platforms, news aggregators, and content management systems) are XML documents. The Atom format is formally specified as an XML namespace.
XML isn’t a legacy technology that survived by luck. It’s the document format standard that most of the structured information on the web is actually built on. JSON gets more attention. XML does more work quietly.
FAQ on XML
What does XML stand for?
XML stands for Extensible Markup Language. It is a W3C standard published in 1998 for storing and transporting structured data in a plain text, human-readable, and machine-readable format that works across any platform or system.
What is XML used for?
XML is used for data interchange between systems, configuration files, web services via SOAP, document formats like .docx and .xlsx, RSS feeds, XML sitemaps, and industry standards including FHIR in healthcare and XBRL in financial reporting.
What is the difference between XML and HTML?
HTML defines how content looks in a browser using fixed tags. XML defines data structure using user-defined tags. HTML tolerates unclosed tags. XML does not. One handles presentation, the other handles data.
Is XML still used in 2025?
Yes. Over 35% of organizations still rely on XML for data exchange in finance and healthcare. Every .docx file, every SVG graphic, and every XML sitemap Google crawls is XML. It is embedded in infrastructure most developers touch daily.
What is the difference between XML and JSON?
JSON is 20-30% smaller, parses natively in JavaScript, and dominates REST APIs. XML supports stronger schema validation via XSD, handles metadata and namespaces better, and remains required in regulated industries and legacy enterprise systems.
What is an XML schema?
An XML Schema Definition (XSD) defines the structure, element names, data types, and rules an XML document must follow. XSD supports 40+ built-in data types and replaced DTD as the preferred validation method for enterprise XML documents.
What is XML parsing?
XML parsing is how software reads and processes an XML document. The 3 main models are DOM (loads full tree into memory), SAX (sequential event-driven streaming), and StAX (developer-controlled streaming). Each trades memory use against access flexibility.
What is XSLT?
XSLT (Extensible Stylesheet Language Transformations) converts XML documents into other formats, including HTML, PDF, or different XML structures. It uses template-matching rules rather than procedural loops. XSLT 3.0 added streaming for large-file processing via Saxon.
What is an XXE vulnerability?
An XXE (XML External Entity) attack exploits XML parsers that resolve external entity references. Attackers submit crafted XML to read server files or trigger SSRF. Fix it by disabling external entity processing in the parser configuration before accepting any XML input.
What tools are used to work with XML?
Common tools include Altova XMLSpy (used by 91% of Fortune 500), Oxygen XML Editor, Saxon for XSLT processing, lxml for Python, JAXB for Java, and xmllint for command-line validation. MarkLogic and BaseX handle XML database storage and XQuery.
Conclusion
This conclusion is for an article presenting XML as far more than a legacy format. It is the document encoding standard behind OOXML, XBRL financial reporting, HL7 clinical records, and every SVG file rendered in a browser today.
Understanding XML document structure, XSD validation, and the difference between DOM and SAX parsing gives you real leverage when working with enterprise data interchange or regulated systems.
The XML vs JSON debate misses the point. Most production environments use both.
Security matters too. XXE injection and the Billion Laughs attack are live threats in any system that accepts XML input without proper parser configuration.
XML has been running quietly for 27 years. It is not going anywhere soon.
