Python crawler agent setting guide (2024 latest version)

The most troublesome thing for crawlers is that the IP is blocked instantly and the target site restricts regional access. Don’t worry, this guide will help you understand the proxy configuration methods of urllib, requests, httpx, and Playwright in one go, covering HTTP/HTTPS (including authentication) and SOCKS5. You can use the code immediately.


1. Preparation: 5 Things You Must Know Before You Start

Take a minute to read these 5 points before configuring the proxy, which can save a lot of troubleshooting time.

  1. Make a list of IP formats The common way to write an agent isIP:端口, for example, the default HTTP proxy of the local tool Clash is usually127.0.0.1:7890, SOCKS5 is127.0.0.1:7891. Don't get confused.

  2. Find available agents

  • Hands-on test: You can go to 快代理免费区, but the speed is slow and the survival time is short.
  • Production environment: Consider paid services such as Abuyun, Zhandaye, and Oxylabs. Purchase them based on traffic or duration. The stability and success rate are much higher than free services.
  1. Clear the protocol type Not all HTTP proxies can forward HTTPS traffic. If the target site is HTTPS, be sure to choose a proxy that supports HTTPS, or directly use a full-protocol proxy (HTTP/HTTPS/SOCKS5).

  2. Authentication information must handle special characters If the username or password contains@#symbols, must be URL encoded (e.g.@written as%40), otherwise the proxy address will be parsed incorrectly.

  3. Verify immediately after configuration All code examples end withhttp://httpbin.org/getTest and check the returned JSON"origin"field is the proxy IP. Don’t skip this step.


2. urllib: Python native library can also play proxy

urllib is Python's own request library. Although it is rarely used, it still has opportunities to appear in projects that pursue zero dependencies or old projects.

Basic HTTP/HTTPS proxy

useProxyHandlerCreate a proxy processor and passbuild_openerGet a custom opener and use it to replace the default one laterurlopen

from urllib.request import ProxyHandler, build_opener

proxy_addr = "127.0.0.1:7890"
# 注意:字典的 key 必须写 "http" 和 "https"
proxy_config = {
    "http": f"http://{proxy_addr}",
    "https": f"http://{proxy_addr}"
}

opener = build_opener(ProxyHandler(proxy_config))
with opener.open("http://httpbin.org/get") as resp:
    print(resp.read().decode("utf-8"))

Proxy with authentication

Simply spell the username and password into the address:username:password@ip:port. Remember to encode special characters.

username = "your_user"
password = "your_pwd"
proxy_addr = f"{username}:{password}@127.0.0.1:7890"
# 后续代码同上

SOCKS5 PROXY

urllib itself does not support SOCKS and needs to use a third-party libraryPySocksModify the underlying socket.

pip install PySocks
import socks
import socket
from urllib.request import urlopen

socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 7891)
socket.socket = socks.socksocket

with urlopen("http://httpbin.org/get") as resp:
    print(resp.read().decode("utf-8"))

The proxy settings of requests are much more concise than urllib, all request methods passproxiesParameter passing.

Basic HTTP/HTTPS proxy

import requests

proxy_addr = "127.0.0.1:7890"
proxies = {
    "http": f"http://{proxy_addr}",
    "https": f"http://{proxy_addr}"
}

resp = requests.get("http://httpbin.org/get", proxies=proxies)
print(resp.text)

Proxy with authentication

Likewise, just add the username and password to the address.

proxy_addr = "your_user:your_pwd@127.0.0.1:7890"
# proxies 配置不变

SOCKS5 PROXY

InstallrequestsThe socks extension can directly support it.

pip install requests[socks]
import requests

proxies = {
    "http": "socks5://127.0.0.1:7891",
    "https": "socks5://127.0.0.1:7891"
}

resp = requests.get("http://httpbin.org/get", proxies=proxies)
print(resp.text)

4. httpx: Asynchronous tool in the new era

httpx supports HTTP/2 and native asynchronous, and is increasingly popular among high-concurrency crawlers. The key of the proxy configuration should use the complete prefix (http://https://) to facilitate assigning different proxies to different domain names.

Basic HTTP/HTTPS proxy (synchronous)

import httpx

proxy = "http://127.0.0.1:7890"
proxies = {
    "http://": proxy,
    "https://": proxy
}

with httpx.Client(proxies=proxies) as client:
    resp = client.get("http://httpbin.org/get")
    print(resp.text)

SOCKS5 proxy (synchronous & asynchronous)

First install the extension library focusing on SOCKS:

pip install httpx-socks

Synchronization usage

import httpx
from httpx_socks import SyncProxyTransport

transport = SyncProxyTransport.from_url("socks5://127.0.0.1:7891")
with httpx.Client(transport=transport) as client:
    resp = client.get("http://httpbin.org/get")
    print(resp.text)

Asynchronous usage

import httpx
import asyncio
from httpx_socks import AsyncProxyTransport

async def check():
    transport = AsyncProxyTransport.from_url("socks5://127.0.0.1:7891")
    async with httpx.AsyncClient(transport=transport) as client:
        resp = await client.get("http://httpbin.org/get")
        print(resp.text)

asyncio.run(check())

5. Automation tools (Selenium/Playwright)

When encountering dynamically rendered pages, you need to simulate the browser. Playwright is currently the first choice, with intuitive configuration and excellent support for dynamic content. The configuration idea of ​​Selenium is similar, you can refer to the corresponding Driveradd_argumentplus--proxy-server

Playwright Agent Configuration

The agent information is written directly inlaunch()ofproxyAmong the parameters, HTTP/HTTPS/SOCKS5 is supported, and authentication also comes with its own fields.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        proxy={
            "server": "socks5://127.0.0.1:7891",  # 也支持 http:// 或 https://
            # "username": "your_user",            # 如需认证,取消注释
            # "password": "your_pwd"
        },
        headless=False  # 调试时可设为 False 查看浏览器界面
    )
    page = browser.new_page()
    page.goto("http://httpbin.org/get")
    print(page.content())
    browser.close()

6. Avoid Pitfalls & Best Practices

  • Free proxies do not use in production Free IPs are short-lived and slow, and are easily identified and blocked by the target station. They are only suitable for temporary testing.

  • Prefer full-protocol agents Clearly marking the proxy that supports HTTP/HTTPS/SOCKS5 can reduce configuration errors and facilitate code reuse.

  • Use proxy pool in high concurrency scenarios Don’t use a single IP to fight hard. We recommend open source projects such as proxy_pool to build automatic IP scheduling.

  • Do not hardcode sensitive information Do not write the authentication user name and password directly in the code. Instead, use environment variables or configuration files to read them to protect account security.

  • Regular rotation and verification The production script should check the IP availability in the proxy pool regularly (such as every 5 minutes) and remove failed nodes.


7. Complete sample code

Complete running examples of all the above libraries, including agent pool management, environment variable reading and batch verification, have been compiled in the GitHub repository and are welcome to use:

👉 Python3WebSpider/ProxyTest