Modern HTML parsing tutorial

When building web crawlers, search engine indexers, or performing web data analysis, HTML parsing is an essential core link. It can convert messy HTML tags into structured information and help us truly "read" web pages. This tutorial will help you compare several mainstream Python HTML parsing solutions and master practical skills through actual combat.

1. Comparison of HTML parsing methods

There are a variety of HTML parsing tools in the Python ecosystem. We compare them one by one in terms of ease of use, performance, functionality and other dimensions.

1.1 Traditional solution: built-in HTMLParser

Python standard library comes withhtml.parserThe module does not need to be installed and can complete the most basic tag parsing.

from html.parser import HTMLParser

class BasicTagPrinter(HTMLParser):
    # 处理开始标签
    def handle_starttag(self, tag, attrs):
        print(f"🔖 开始标签: <{tag}>")
        for attr_name, attr_val in attrs:
            print(f"   📎 属性: {attr_name} = {attr_val}")
    
    # 处理标签内文本
    def handle_data(self, data):
        stripped_data = data.strip()
        if stripped_data:
            print(f"📝 文本内容: {stripped_data}")

# 测试解析
parser = BasicTagPrinter()
parser.feed('<div class="content">Hello <b>World</b></div>')

:::tip Features of built-in HTMLParser advantage:

  • Built-in standard library, zero dependencies
  • Lightweight and efficient, suitable for ultra-simple scenarios

shortcoming:

  • Weak fault tolerance, easy to crash when encountering non-standard HTML
  • The API is low-level and requires manual management of tag status, resulting in low development efficiency. :::

1.2 Modern choice: BeautifulSoup

BeautifulSoupIt is currently the most popular Python HTML parsing library. It encapsulates the underlying details, provides a user-friendly API like jQuery, and can automatically complete non-standard HTML tags.

Install

pip install beautifulsoup4

Basic usage

from bs4 import BeautifulSoup

# 示例HTML文档
sample_html = """
<html>
 <head><title>我的示例页面</title></head>
 <body>
  <div class="post-content">
   <h1>这是一篇测试文章</h1>
   <p>第一段文本内容</p>
   <a href="https://example.com">点击访问示例网站</a>
  </div>
 </body>
</html>
"""

# 初始化解析器(这里用Python内置的html.parser作为后端)
soup = BeautifulSoup(sample_html, 'html.parser')

# 快速提取内容
print(soup.title.text)              # 输出:我的示例页面
print(soup.find('a')['href'])      # 输出:https://example.com
print(soup.select_one('.post-content h1').text)  # CSS选择器写法

1.3 High-performance choice: lxml

If you need to process massive amounts of HTML or pursue ultimate performance, it is recommended to uselxmlServes as the parsing backend for BeautifulSoup. Its parsing speed and fault tolerance are both stronger.

from bs4 import BeautifulSoup

# 只需将第二个参数改为'lxml'即可(需先pip install lxml)
soup = BeautifulSoup(sample_html, 'lxml')

2. Practical combat: capture Python official website activities

We use BeautifulSoup to complete a small task: grab the recent activity list of the Python official website in real time and experience the parsing process.

import requests
from bs4 import BeautifulSoup

# 目标URL
target_url = "https://www.python.org/events/python-events/"

# 发送请求(添加User-Agent模拟浏览器,避免被拦截)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
response = requests.get(target_url, headers=headers)
response.encoding = response.apparent_encoding  # 自动修正编码

# 解析页面
soup = BeautifulSoup(response.text, 'html.parser')

# 提取所有活动项
event_items = soup.find_all('li', class_='event')

print(f"📅 共找到 {len(event_items)} 个近期活动:\n")
for idx, event in enumerate(event_items, 1):
    # 提取活动详情
    event_name = event.find('h3').text.strip()
    event_date = event.find('time')['datetime']
    event_location = event.find('span', class_='event-location').text.strip()
    
    print(f"【活动{idx}{event_name}")
    print(f"🕐 时间:{event_date}")
    print(f"📍 地点:{event_location}\n")

3. Solutions to common problems

3.1 🚀 handles content dynamically loaded by JavaScript

The content of many modern web pages is dynamically rendered using JavaScript. You can directly userequestsThe original HTML obtained may not contain the target data. It can be used at this timerequests-htmlorseleniumRender the page first.

from requests_html import HTMLSession

session = HTMLSession()
resp = session.get(target_url)
resp.html.render()  # 这一步会执行页面中的JS

# 用渲染后的HTML解析
soup = BeautifulSoup(resp.html.html, 'html.parser')

3.2 📝 solves the problem of garbled characters

Different websites may have different encoding methods. The best practice to avoid garbled characters is to let requests automatically recognize the encoding.

response = requests.get(target_url)
response.encoding = response.apparent_encoding  # 替代手动指定 utf-8/gbk

3.3 🔐 Handle pages that require login

userequests.Session()Maintain the session, log in first and then request the protected page.

session = requests.Session()

# 先提交登录表单
login_payload = {'username': 'your_name', 'password': 'your_pass'}
session.post('https://example.com/login', data=login_payload)

# 再访问需要登录的页面
protected_resp = session.get('https://example.com/protected')

4. Crawler best practices

Please adhere to the following principles when crawling data to protect yourself and reduce pressure on the target site.

  1. Comply with robots.txt: Visit first目标网站/robots.txt, view the scope of crawling allowed
  2. Control request frequency: Add reasonable delay to avoid bombardment with massive requests
    import time
    time.sleep(1)  # 每次请求间隔至少1秒
  3. Disguise User-Agent: As shown in the actual code above, simulate a real browser
  4. Add exception-handling: Make the crawler more robust
    try:
        response = requests.get(target_url, timeout=5)
        response.raise_for_status()  # 自动抛出HTTP错误
    except requests.exceptions.RequestException as e:
        print(f"⚠️ 请求出错:{e}")

5. Summary

  • Simple temporary tasks: can be directly used in Python’s built-inhtml.parser
  • Production environment/complex scenarios: preferredBeautifulSoup, matchlxmlThe backend balances performance and ease of use
  • Dynamic Page: Cooperationrequests-htmlorseleniumuse

This set of combos can basically cover 99% of Python HTML parsing needs.

6. Further reading