Selector选择器完全指南 - CSS与XPath数据提取技术详解

📂 所属阶段：第一阶段 — 初出茅庐（框架核心篇）
🔗 相关章节：Spider 实战 · Item 与 Item Loader

Selector基础概念

在Scrapy爬虫中，Selector负责从网页里找出我们想要的数据。它内置了两种表达方式：CSS选择器和XPath表达式，几乎能覆盖所有HTML/XML文档的定位需求。

Selector如何工作？

整个过程可以拆成几步：

接收HTML或XML内容
内部构建一个DOM树
用你写的选择器去树上匹配节点
取出匹配到的内容
返回结果（文本、属性或节点本身）

从Response获取Selector

在Scrapy里，Response对象早已帮你准备好了Selector，无需手动新建。常用方式有两种：

def relationship_with_response(response):
    """
    演示Response与Selector的关系
    """
    # 方式一（推荐）：直接调用response.css()或response.xpath()
    title1 = response.css('h1::text').get()
    title2 = response.xpath('//h1/text()').get()
    
    # 方式二：通过response.selector访问
    selector = response.selector
    title3 = selector.css('h1::text').get()
    title4 = selector.xpath('//h1/text()').get()
    
    # 两种方式结果等价
    return {
        'title': title1,
        'equivalent': title1 == title3
    }

建议优先使用response.css()和response.xpath()，写法更简洁。

CSS选择器详解

CSS选择器对于前端开发者很亲切，在Scrapy里同样好用。它的优势在于语法直观，阅读性高。

常用CSS选择器类型

1. 基础选择器

选择器类型	写法示例	说明
元素选择器	`div`	选中所有`<div>`标签
类选择器	`.product`	选中class包含`product`的元素
多类选择器	`.hot.item`	同时有`hot`和`item`两个类
ID选择器	`#main`	id为`main`的元素
属性选择器	`[href]`	所有带href属性的元素
属性值匹配	`[href="https://example.com"]`	href完全等于该地址
属性包含匹配	`[class*="product"]`	class属性值里包含`product`

2. 组合选择器

关系	写法	说明
后代	`div p`	div内所有层级的p元素
子代	`div > p`	div的直接子元素p
相邻兄弟	`h1 + p`	紧跟在h1之后的第一个p
通用兄弟	`h1 ~ p`	h1之后的所有同级p元素

3. 伪类选择器

常用的伪类可以帮你筛选特定位置的元素：

# 第一个子元素
response.css('.product:first-child .name::text').get()
# 最后一个
response.css('.product:last-child .name::text').get()
# 第n个（例如第3个）
response.css('.product:nth-child(3) .name::text').get()
# 排除某些类
response.css('.product:not(.sold-out) .name::text').getall()

CSS选择器实战示例

def css_selectors_practical(response):
    """
    CSS选择器实战用法
    """
    results = {}
    
    # 提取多重标题的文本
    results['titles'] = response.css('h1, h2, h3::text').getall()
    
    # 提取所有链接的href属性
    results['links'] = response.css('a::attr(href)').getall()
    
    # 只提取外部链接（http/https开头）
    results['external_links'] = response.css(
        'a[href^="http://"], a[href^="https://"]::attr(href)'
    ).getall()
    
    # 提取第一个产品的名称
    results['first_product'] = response.css(
        '.product:first-child .name::text'
    ).get()
    
    return results

XPath选择器详解

XPath本是为XML设计的查询语言，在处理复杂HTML时，它比CSS更强——尤其是需要根据文本内容、属性或节点关系筛选时。

XPath基础语法

写法	含义
`/`	从根节点开始
`//`	从任意位置开始
`.`	当前节点
`..`	父节点
`@`	属性
`text()`	节点的文本内容

几个常见路径示例：

# 文档里所有div
//div

# 根节点下的div
/div

# class为"product"的div
//div[@class="product"]

# 第一个div
//div[1]

# 最后一个div
//div[last()]

XPath轴选择

轴让你能在节点间灵活跳转：

轴名称	作用
`parent::*`	父元素
`child::*`	子元素
`ancestor::*`	所有祖先
`descendant::*`	所有后代
`following-sibling::*`	后面的同级节点
`preceding-sibling::*`	前面的同级节点

实际用法：

# 获取当前节点的父节点
response.xpath('//h2/parent::*')

# h2后面的第一个同级p
response.xpath('//h2/following-sibling::p[1]/text()')

XPath常用函数

# 包含某个值
//div[contains(@class, "product")]

# 以某字符串开头
//a[starts-with(@href, "http://")]

# 去除多余空格
normalize-space(//h1/text())

# 元素位置
//div[position()=2]

# 计数
count(//div[@class="item"])

XPath实战示例

def xpath_selectors_practical(response):
    """
    XPath选择器实战用法
    """
    results = {}
    
    # 合并多个标题
    results['headings'] = response.xpath(
        '//h1/text() | //h2/text() | //h3/text()'
    ).getall()
    
    # 提取所有外部链接
    results['external_links'] = response.xpath(
        '//a[starts-with(@href, "http://") or starts-with(@href, "https://")]/@href'
    ).getall()
    
    # 提取h2后的第一个段落文本
    results['next_p'] = response.xpath(
        '//h2/following-sibling::p[1]/text()'
    ).getall()
    
    # 提取带有折扣标签的产品名称
    results['discount_products'] = response.xpath(
        '//div[@class="product" and .//span[@class="discount"]]//h3/text()'
    ).getall()
    
    return results

get与getall方法

提取数据时，最常用的两个方法是get()和getall()，它们的行为有明显差别。

get() 方法

get()只返回第一个匹配结果。如果没有匹配，返回None，你也可以设置一个默认值。

def get_method_example(response):
    """
    get()方法示例
    """
    # 获取第一个h1文本，可能为None
    title = response.css('h1::text').get()
    
    # 获取第一个链接，无匹配时返回自定义字符串
    first_link = response.css('a::attr(href)').get(default='No link found')
    
    return {
        'title': title,
        'first_link': first_link
    }

getall() 方法

getall()返回所有匹配结果的列表。即使没找到任何东西，也会返回空列表[]。

def getall_method_example(response):
    """
    getall()方法示例
    """
    # 获取所有链接href
    all_links = response.css('a::attr(href)').getall()
    
    # 毫无匹配时返回[]
    no_match = response.css('.nonexistent::text').getall()
    
    return {
        'all_links': all_links,
        'no_match': no_match
    }

性能对比

get()在找到第一个匹配后就停止，因此通常比getall()更快。尤其在大型文档中，差异更明显。

import time

def performance_test(response):
    """
    get() vs getall() 性能对比
    """
    # 测试get()耗时
    start = time.time()
    for _ in range(1000):
        response.css('div.item h2::text').get()
    get_time = time.time() - start
    
    # 测试getall()耗时
    start = time.time()
    for _ in range(1000):
        response.css('div.item h2::text').getall()
    getall_time = time.time() - start
    
    return {
        'get_time': get_time,
        'getall_time': getall_time,
        'get_is_faster': get_time < getall_time
    }

高级选择器技巧

混合使用CSS和XPath

你可以先用CSS快速锁定区域，再用XPath做精细提取，反之亦可。

def mixed_selectors(response):
    """
    混合使用CSS和XPath
    """
    results = {}
    
    # 先用CSS定位产品块，再用相对XPath取子元素
    results['css_then_xpath'] = response.css('.product').xpath('./h2/text()').getall()
    
    # 先用XPath定位，再用CSS取价格
    results['xpath_then_css'] = response.xpath('//div[@class="item"]').css('.price::text').getall()
    
    return results

嵌套选择器处理列表

对于列表型数据，最稳健的做法是：先选出所有“行”，再在每一“行”里抽字段。

def nested_extraction(response):
    """
    嵌套提取示例
    """
    products = []
    
    # 选取所有产品容器
    for product in response.css('.product'):
        item = {
            'name': product.css('.name::text').get(),
            'price': product.css('.price::text').get(),
            'url': product.css('a::attr(href)').get()
        }
        products.append(item)
    
    return products

健壮的选择器策略

网页结构经常改动，准备多个备选选择器能提高爬虫的存活率。

def robust_extraction(response):
    """
    健壮的提取策略
    """
    def extract_with_fallbacks(selectors):
        """依次尝试多个选择器，返回第一个有效结果"""
        for sel in selectors:
            try:
                if sel.startswith('xpath:'):
                    result = response.xpath(sel[6:]).get()
                else:
                    result = response.css(sel).get()
                
                if result and result.strip():
                    return result.strip()
            except:
                continue
        return None
    
    # 为标题准备多个可能的路径
    title_selectors = [
        'h1.product-title::text',
        'h1::text',
        'title::text',
        'xpath://h1/text()',
        'xpath://title/text()'
    ]
    
    return {
        'title': extract_with_fallbacks(title_selectors)
    }

性能优化策略

选择器越具体越快

宽泛的选择器（如*）会让引擎扫描大量节点。尽量指明标签名、类名等限定条件。

def optimized_selectors(response):
    """
    优化选择器性能
    """
    # 推荐：具体的选择器
    good = response.css('div.product.highlighted .name::text').get()
    
    # 不推荐：过于宽泛
    # bad = response.css('*[class*="product"] *::text').get()
    
    return good

批量处理减少DOM遍历

先一次性选出父级容器，再在其内部提取子字段，避免反复扫描整棵树。

def batch_processing(response):
    """
    批量处理示例
    """
    # 高效：一次选出所有产品div，然后循环提取
    products = response.css('div.product')
    data = []
    for product in products:
        data.append({
            'name': product.css('.name::text').get(),
            'price': product.css('.price::text').get()
        })
    
    # 低效：分别对整页执行两次全文档扫描
    # names = response.css('div.product .name::text').getall()
    # prices = response.css('div.product .price::text').getall()
    
    return data

实战应用场景

电商产品数据提取

下面是一个可配置的提取器，自动尝试多个备选选择器，并单独处理规格参数。

class ProductExtractor:
    """电商产品数据提取器"""
    
    def __init__(self):
        # 为每个字段定义多个可选选择器
        self.selectors = {
            'name': ['h1.product-title::text', 'h1::text'],
            'price': ['.price::text', '.current-price::text'],
            'description': ['.product-detail::text', '.description::text'],
            'images': ['.gallery img::attr(src)', '.product-image::attr(src)']
        }
    
    def extract(self, response):
        """提取产品数据"""
        product = {}
        
        for field, sels in self.selectors.items():
            product[field] = self._extract_with_fallbacks(response, sels)
        
        # 特殊处理：提取规格参数
        product['specs'] = self._extract_specs(response)
        
        return product
    
    def _extract_with_fallbacks(self, response, selectors):
        """尝试多个选择器"""
        for sel in selectors:
            result = response.css(sel).get()
            if result and result.strip():
                return result.strip()
        return None
    
    def _extract_specs(self, response):
        """提取规格参数表格"""
        specs = {}
        for row in response.css('.spec-table tr'):
            key = row.css('td:first-child::text').get()
            value = row.css('td:last-child::text').get()
            if key and value:
                specs[key.strip()] = value.strip()
        return specs

新闻文章内容提取

新闻页面结构多样，采用“容器+后备”策略可适应大部分站点。

class NewsExtractor:
    """新闻文章内容提取器"""
    
    def extract(self, response):
        """提取新闻内容"""
        return {
            'title': self._extract_title(response),
            'author': self._extract_author(response),
            'date': self._extract_date(response),
            'content': self._extract_content(response),
            'tags': self._extract_tags(response)
        }
    
    def _extract_title(self, response):
        selectors = ['h1.article-title::text', 'h1::text', 'title::text']
        return self._try_selectors(response, selectors)
    
    def _extract_author(self, response):
        selectors = ['.author::text', '.byline::text']
        author = self._try_selectors(response, selectors)
        if author:
            author = author.replace('作者：', '').replace('By ', '')
        return author
    
    def _extract_date(self, response):
        selectors = ['.publish-date::text', 'time::text']
        return self._try_selectors(response, selectors)
    
    def _extract_content(self, response):
        """提取正文内容"""
        containers = ['.article-content', '.content', '.post-content']
        for container in containers:
            paragraphs = response.css(f'{container} p::text').getall()
            if paragraphs and len(paragraphs) > 2:
                return '\n'.join(p.strip() for p in paragraphs if p.strip())
        return None
    
    def _extract_tags(self, response):
        """提取标签并去重"""
        tags = response.css('.tag::text, .tags a::text').getall()
        return list(set(tag.strip() for tag in tags if tag.strip()))
    
    def _try_selectors(self, response, selectors):
        """依次尝试选择器，返回第一个非空结果"""
        for sel in selectors:
            result = response.css(sel).get()
            if result and result.strip():
                return result.strip()
        return None

常见问题与解决方案

问题1：提取的文本前后有大量空白

方法一：用Python的strip()
方法二：用XPath的normalize-space()

# Python清理
text = response.css('h1::text').get()
clean_text = text.strip() if text else ''

# XPath一步到位
clean_text = response.xpath('normalize-space(//h1/text())').get()

问题2：需要保留HTML标签的内容

有些场景下，你需要拿到的不是纯文本，而是包含HTML的片段。

# 直接获取内部HTML
html_content = response.css('.content').get()

# 反过来，仅要纯文本（去掉所有标签）
plain_text = response.xpath('string(//div[@class="content"])').get()

问题3：选择器压根找不到元素

可能原因有三：

动态渲染：数据是JavaScript后加载的，查看页面源代码确认。
选择器写错了：在Scrapy shell中即时调试。
页面结构已变：定期检查并维护选择器列表。

对于动态内容，准备切换方案（如Selenium、Playwright）或直接分析接口。

最佳实践建议

选择器编写原则

越具体越好：避免用*或过于宽泛的路径。
为关键字段准备多个备选选择器：预防对方改版。
简单场景优先CSS，复杂逻辑用XPath。
将选择器集中管理：放在配置文件或类的开头，方便修改。

错误处理策略

始终假设提取结果可能是None。
善用get()的default参数。
记录提取失败的次数和位置，便于排查。
实现降级方案：主选择器失败时，自动切换到备用选择器。

💡 核心要点：Selector是Scrapy数据提取的根基，把CSS和XPath用熟、用好，你的爬虫就会变得又准又稳。合理准备备选方案，能极大减少维护工作量。

🔗 相关教程推荐

Spider 实战 – 爬虫逻辑实现
Item 与 Item Loader – 数据结构定义
Pipeline管道实战 – 数据处理管道

🏷️ 标签云: Scrapy Selector CSS选择器 XPath 数据提取 爬虫框架 Python爬虫

#Selector选择器完全指南 - CSS与XPath数据提取技术详解

#目录

#Selector基础概念

#Selector如何工作？

#从Response获取Selector

#CSS选择器详解

#常用CSS选择器类型

#1. 基础选择器

#2. 组合选择器

#3. 伪类选择器

#CSS选择器实战示例

#XPath选择器详解

#XPath基础语法

#XPath轴选择

#XPath常用函数

#XPath实战示例

#get与getall方法

#get() 方法

#getall() 方法

#性能对比

#高级选择器技巧

#混合使用CSS和XPath

#嵌套选择器处理列表

#健壮的选择器策略

#性能优化策略

#选择器越具体越快

#批量处理减少DOM遍历

#实战应用场景

#电商产品数据提取

#新闻文章内容提取

#常见问题与解决方案

#问题1：提取的文本前后有大量空白

#问题2：需要保留HTML标签的内容

#问题3：选择器压根找不到元素

#最佳实践建议

#选择器编写原则

#错误处理策略