ImagesPipeline 与 FilesPipeline:多媒体资源下载与关联

📂 所属阶段:第二阶段 — 数据流转(数据处理篇)


1. ImagesPipeline

# settings.py
IMAGES_URLS_FIELD = 'image_urls'
IMAGES_STORE = 'images'

# pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item.get('image_urls', []):
            yield scrapy.Request(image_url)
    
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Failed to download image")
        item['images'] = image_paths
        return item

2. FilesPipeline

from scrapy.pipelines.files import FilesPipeline

class MyFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        return f'files/{request.url.split("/")[-1]}'

3. 小结

ImagesPipeline:自动下载图片
FilesPipeline:自动下载文件

好处:
- 自动管理
- 自动去重
- 自动重命名

💡 记住:使用 Pipeline 下载多媒体,比手动下载快 10 倍。


🔗 扩展阅读