Selenium/Playwright 集成:JavaScript 动态渲染处理

📂 所属阶段:第三阶段 — 攻防演练(中间件与反爬篇)


1. Selenium 集成

from selenium import webdriver

class SeleniumMiddleware:
    def __init__(self):
        self.driver = webdriver.Chrome()
    
    def process_request(self, request, spider):
        if request.meta.get('use_selenium'):
            self.driver.get(request.url)
            return HtmlResponse(
                url=request.url,
                body=self.driver.page_source.encode('utf-8'),
                request=request
            )

2. Playwright 集成

from playwright.async_api import async_playwright

class PlaywrightMiddleware:
    async def process_request(self, request, spider):
        if request.meta.get('use_playwright'):
            async with async_playwright() as p:
                browser = await p.chromium.launch()
                page = await browser.new_page()
                await page.goto(request.url)
                content = await page.content()
                await browser.close()
                
                return HtmlResponse(
                    url=request.url,
                    body=content.encode('utf-8'),
                    request=request
                )

3. 小结

浏览器自动化:

Selenium:成熟、稳定
Playwright:快速、现代

何时使用:
- 静态页面:直接爬虫
- 动态页面:Selenium/Playwright
- 复杂交互:Playwright

💡 记住:浏览器自动化很慢,但有时是必需的。优先用静态爬虫,实在不行再用浏览器。


🔗 扩展阅读