Python XML processing tutorial

Although JSON now dominates front-end and back-end interactions on the Web due to its lightweight, easy-to-read and write features, XML’s structural constraints, namespaces, mixed content and other capabilities make it still irreplaceable and is commonly used in enterprise applications, configuration files, industry data exchange and other scenarios.

This article will help you quickly get started with Python's built-in XML parsing and generation tools, and demonstrate the complete process using a real weather interface.


1. XML you can see

You must have come into contact with the following use cases, they are all important "bases" of XML:

  • Androidactivity_layout.xmllayout file
  • RSS 2.0/Atom news feed
  • Word .docx、Excel .xlsxThe contents of the underlying compressed package
  • SOAP interface request/response for banking, insurance and other industries
  • Jenkins Pipeline configuration (commonly used for old versions or complex requirements) -Maven'spom.xml, Web applicationweb.xml

These scenarios either require strict format validation or namespace isolation, which is what XML excels at.


2. Python built-in parser comparison

The Python standard library provides two core parsing solutions. There is no absolute good or bad, only scene adaptation.

DOM(Document Object Model)

"Load the entire document tree into memory and change it as you want"

✅ ADVANTAGES

  • Supports random access to any node, and it is very convenient to jump between father, son, and sibling nodes.
  • You can directly modify, add, and delete XML structures
  • Python built-inxml.dom.minidomIt is a lightweight DOM implementation, suitable for novices

❌ Disadvantages

  • Memory consumption is directly proportional to the size of the XML file. Large files (more than 100 MB) may directly burst the memory.
  • Parsing is slower than event-driven solutions

SAX(Simple API for XML)

"Parse while reading, trigger events when encountering elements, and never look back."

✅ ADVANTAGES

  • Extremely low memory usage (only saves the current context), can handle GB-level files
  • Extremely fast parsing speed, implemented in C languagexml.parsers.expatExcellent performance

❌ Disadvantages

  • Only one-way sequential reading can be done, and previous nodes cannot be traced back or modified.
  • You need to write your own event processing logic (such as tracking the current label path)

📌 Practical priority rule: Use DOM for small files and need to modify the structure; for large files and read-only parsing, SAX is preferred. This article focuses on high-performance and versatile SAX, and also briefly introduces the usage of DOM.


3. Get started with SAX quickly

The core idea of ​​SAX: Register three hook functions → Feed XML data → The parser automatically triggers the hook.

Three hooks correspond to three key events:

  1. Encountered start tag (such as<book>, <title lang="en">
  2. Encountering Plain text/CDATA (text content in tags)
  3. Encountered end tag (such as</book>, </title>

3.1 Basic example: traverse all nodes

from xml.parsers.expat import ParserCreate

# 自定义 SAX 事件处理类
class SimpleSaxHandler:
    def __init__(self):
        # 用于追踪当前标签路径(复杂解析时必备)
        self.tag_stack = []

    def start_element(self, tag_name, attrs):
        self.tag_stack.append(tag_name)
        indent = "  " * (len(self.tag_stack) - 1)
        print(f"{indent}<{tag_name}", end="")
        # 打印属性
        for k, v in attrs.items():
            print(f' {k}="{v}"', end="")
        print(">")

    def char_data(self, raw_text):
        text = raw_text.strip()
        if text:
            indent = "  " * len(self.tag_stack)
            print(f"{indent}{text}")

    def end_element(self, tag_name):
        indent = "  " * (len(self.tag_stack) - 1)
        print(f"{indent}</{tag_name}>")
        self.tag_stack.pop()

# 测试 XML
demo_xml = """<?xml version="1.0" encoding="UTF-8"?>
<tech_blog>
    <title>Python XML 入门</title>
    <tags>
        <tag lang="Python">XML</tag>
        <tag lang="Python">SAX</tag>
    </tags>
</tech_blog>
"""

handler = SimpleSaxHandler()
parser = ParserCreate()
parser.StartElementHandler = handler.start_element
parser.EndElementHandler = handler.end_element
parser.CharacterDataHandler = handler.char_data
parser.Parse(demo_xml)

After running it you will see indented structured output that looks almost the same as the original XML.

3.2 Advanced techniques: processing segmented text

⚠️ Important pitfalls: SAXCharacterDataHandlerThe complete text within the tag will not be returned in one go. Multiple callbacks may be triggered if the text contains newlines, special characters, or if the parser's internal buffer is full.

Solution: Temporarily cache the text fragments in memory, and then splice them when the end tag is used.

class FullTextSaxHandler:
    def __init__(self):
        self.tag_stack = []
        self.text_parts = []

    def start_element(self, tag_name, attrs):
        self.tag_stack.append(tag_name)
        # 每次进入新标签,清空上一个标签的文本片段
        self.text_parts = []

    def char_data(self, raw_text):
        # 直接追加原始片段,不 strip,避免丢失有意保留的空白
        self.text_parts.append(raw_text)

    def end_element(self, tag_name):
        full_text = "".join(self.text_parts).strip()
        if full_text:
            indent = "  " * (len(self.tag_stack) - 1)
            print(f"{indent}{tag_name}: {full_text}")
        self.tag_stack.pop()

This way, even if the text is split, what you end up with is the complete label content.


4. How to generate XML?

The simplest way to generate XML is string concatenation, but you must pay attention to escaping special characters (such as&&amp;<&lt;), otherwise the generated document may be illegal.

4.1 Simple scenario: string concatenation + escaping

Using the standard libraryxml.sax.saxutils.escape()Basic escaping can be handled automatically:

from xml.sax.saxutils import escape

def build_article(title: str, content: str) -> str:
    xml_parts = [
        '<?xml version="1.0" encoding="UTF-8"?>',
        '<article>',
        f'  <title>{escape(title)}</title>',
        f'  <content>{escape(content)}</content>',
        '</article>'
    ]
    return "\n".join(xml_parts)

# 测试带有“危险”字符的内容
print(build_article("Python & Go 对比", "1+1 < 3, Go > Python 语法糖?"))

4.2 Complex scenes: using ElementTree

When the XML structure is multi-layered and has many attributes, handwritten strings are prone to errors. It is recommended to use the standard libraryxml.etree.ElementTree

import xml.etree.ElementTree as ET

def build_book_catalog():
    # 创建根元素
    root = ET.Element("catalog")

    # 第一本书(带属性)
    book1 = ET.SubElement(root, "book", id="bk001", price="49.9")
    ET.SubElement(book1, "author").text = "Gambardella, Matthew"
    ET.SubElement(book1, "title").text = "XML Developer's Guide"
    ET.SubElement(book1, "genre").text = "Computer"

    # 第二本书
    book2 = ET.SubElement(root, "book", id="bk002", price="39.5")
    ET.SubElement(book2, "author").text = "Ralls, Kim"
    ET.SubElement(book2, "title").text = "Midnight Rain"
    ET.SubElement(book2, "genre").text = "Fantasy"

    # 写入文件
    tree = ET.ElementTree(root)
    tree.write("books.xml", encoding="utf-8", xml_declaration=True)

    # 也可以直接返回字符串
    return ET.tostring(root, encoding="unicode", xml_declaration=True)

print(build_book_catalog())

ElementTreeIt not only supports DOM-style random access, but also provides a convenient output interface, which is very suitable as a "Swiss Army Knife" in Python.


5. Practical combat: Parsing the real weather XML interface

We use the free WeatherAPI to demonstrate the complete parsing process of SAX. Please enterAPI_KEYReplace it with your own free key (or find another weather interface that provides XML output).

from xml.parsers.expat import ParserCreate
from urllib.request import urlopen
from typing import Dict, Any

class WeatherSaxHandler:
    def __init__(self):
        self.result: Dict[str, Any] = {}
        self.tag_stack = []
        self.text_parts = []

    def start_element(self, tag_name, attrs):
        self.tag_stack.append(tag_name)
        self.text_parts = []
        # 从属性中直接提取城市名称(属性数据不会被分割)
        if tag_name == "location":
            self.result["city"] = attrs.get("name", "未知城市")

    def char_data(self, raw_text):
        self.text_parts.append(raw_text)

    def end_element(self, tag_name):
        full_text = "".join(self.text_parts).strip()
        if full_text:
            tag_path = "/".join(self.tag_stack)  # 利用标签路径精准定位
            if tag_path == "current/temp_c":
                self.result["temperature"] = float(full_text)
            elif tag_path == "current/condition/text":
                self.result["weather_desc"] = full_text
            elif tag_path == "current/wind_kph":
                self.result["wind_speed"] = float(full_text)
        self.tag_stack.pop()

def fetch_beijing_weather() -> Dict[str, Any]:
    # 免费 API 密钥,请替换为你自己的
    API_KEY = "your_actual_key_here"
    API_URL = f"https://api.weatherapi.com/v1/current.xml?key={API_KEY}&q=Beijing&aqi=no"

    try:
        with urlopen(API_URL, timeout=4) as resp:
            xml_data = resp.read().decode("utf-8")

        handler = WeatherSaxHandler()
        parser = ParserCreate()
        parser.StartElementHandler = handler.start_element
        parser.EndElementHandler = handler.end_element
        parser.CharacterDataHandler = handler.char_data
        parser.Parse(xml_data)

        return handler.result
    except Exception as e:
        return {"error": str(e)}

if __name__ == "__main__":
    weather = fetch_beijing_weather()
    if "error" in weather:
        print(f"获取天气失败:{weather['error']}")
    else:
        print("--- 北京实时天气 ---")
        print(f"城市:{weather['city']}")
        print(f"温度:{weather['temperature']}℃")
        print(f"天气:{weather['weather_desc']}")
        print(f"风速:{weather['wind_speed']} km/h")

The whole process is: define event processing rules → obtain original XML from the network → SAX parsing → extract target data, clear and efficient.


6. Avoid pitfalls and best practices

  1. SAX text must be merged: do not trust directlyCharacterDataHandlerThe returned fragment is always assembled on the closing tag.
  2. Special characters must be escaped: Must be used when generating XMLescape()(or ElementTree's built-in escaping), which the parser will automatically restore when parsing.
  3. Set parsing restrictions to prevent attacks: Python 3.8+ has default restrictions on entity expansion, etc., but it is still recommended not to parse very large XML from unknown sources.
  4. High-frequency parsing recommended lxml: If you need XPath, complex namespace, and extremely high speed, you can introduce a third-party librarylxml, which is faster and more powerful than the standard library.
  5. Don’t reinvent the wheel: The libraries for parsing XML are already very mature, so don’t parse them by handwriting string processing yourself.

7. When should you choose XML?

Even with the popularity of JSON, XML is still the best choice in the following scenarios:

  • XSD/DTD structure verification required (such as financial, medical data exchange)
  • Requires namespace to isolate tags with the same name (such as different versions of configuration files)
  • Requires Mixed content (e.g. tags and text interleaved in XHTML)
  • Connect to Legacy SOAP system

If you are building a new project, give priority to YAML for configuration, JSON for front-end and back-end interaction, and Protocol Buffers or MessagePack for high-performance cross-language transmission. And XML is still an unshakable standard in scenarios where you need "rigorous, orderly, and cross-organization exchange."