A complete guide to modern web crawler proxy technology

How to build a modern crawler without touching the proxy? There is a high probability that it will run into Cloudflare protection after just 10 minutes of running, or the IP will be blacklisted for 7 days. Today we will simply and neatly dismantle the core principles and engineering implementation methods of agents. Two thousand words of dry information and runnable code will help you understand how to use agents.


1. Basics of agent technology

1.1 Core working principle

The proxy server is like a "transit courier" between the client and the target server:

  1. You hand the request (package) to the courier without revealing your real address
  2. The courier changes his identity information and sends the request to the target server.
  3. The target server hands the response to the courier, and the courier transfers it back to you intact.

During the whole process, the target server only knows the information of the courier (agent IP) and knows nothing about your real IP.

graph LR
    A[客户端] -->|隐藏真实IP| B[代理服务器]
    B -->|代理IP转发| C[目标服务器]
    C -->|返回数据| B
    B -->|中转数据| A

1.2 Why it must be used

  • Break through anti-crawling restrictions: Bypass risk control based on IP frequency (for example, the same IP can have up to 100 requests per hour, trigger verification codes, etc.)
  • Unblock geo-blocking: Access content restricted to specific regions, such as overseas versions of Google Scholar and US Netflix
  • Data integrity guaranteed: The same site in different regions may display completely different results (such as Amazon sites, Meituan in different cities)
  • Support distributed crawling: When hundreds or thousands of nodes work together, an IP pool can achieve load balancing and avoid single points being blocked.

2. Common protocols and anonymity levels

2.1 Comparison of mainstream proxy protocols

Protocol typeCommon portsApplicable scenariosAdvantagesDisadvantages
HTTP/HTTPS8080/3128Web page, API requestLow cost, high penetration rateOnly supports HTTP(S), may be addedVia/X-Forwarded-ForHead
SOCKS4/4a1080Any TCP trafficProtocol-independent, SOCKS4a supports server-side DNS resolution (to prevent local DNS pollution)Does not support UDP, simple authentication method
SOCKS51080Full protocol traffic (including UDP, FTP)Supports multiple authentications, UDP forwarding, flexible DNS configuration, extremely low performance overheadSlightly more complicated than SOCKS4

Practical suggestions: When crawling web pages, use high-profile HTTP proxy first, and use SOCKS5 in other TCP scenarios.

2.2 Anonymity classification (anti-detection core)

The degree of anonymity of the agent directly determines the judgment result of the anti-crawling system. Use the following code to quickly check whether the proxy is exposing your real information:

import requests

def quick_check_anonymity(proxy):
    """
    使用 httpbin 检测代理匿名程度
    :param proxy: 代理地址,格式 'http://ip:port' 或 'socks5://ip:port'
    """
    test_url = "http://httpbin.org/headers"
    try:
        resp = requests.get(
            test_url,
            proxies={"http": proxy, "https": proxy},
            timeout=10
        )
        headers = resp.json()["headers"]
        
        if "X-Forwarded-For" in headers and headers["X-Forwarded-For"]:
            return "⚠️ 普通匿名:暴露了部分IP信息"
        elif "Via" in headers:
            return "❌ 透明代理:完全暴露了真实IP"
        else:
            return "✅ 高匿名:目标服务器完全看不到任何代理痕迹"
    except Exception as e:
        return f"💀 无效代理:{str(e)}"

The difference between the three is simply:

  • Transparent Proxy: Directly tell the website "I am an agent and the customer's real IP is xxx", which is almost equivalent to streaking.
  • Normal Anonymous Proxy: Hide your IP but leave it in the request headerX-Forwarded-Forand other fields, easy to identify.
  • High Anonymity Proxy: completely disguised as a normal user, without adding any additional headers, and the target server cannot identify it.

3. Engineering solution: agent pool and rotation

3.1 How to choose the proxy type

TypeIP SourceBlock RateSpeed ​​CostBest Scenario
Data center agentCloud server room (AWS, Azure, etc.)Extremely highExtremely fastExtremely lowHigh-frequency testing, public API with extremely weak anti-climbing
Residential proxyIP assigned by the operator to home usersLowMediumHighE-commerce data collection, general crawler
Mobile proxy4G/5G cellular network IPAlmost 0SlowExtremely highSocial platforms, highly sensitive sites (subject to compliance)

Most commercial projects will choose residential agents as the main force, with a small number of data center agents for verification and low-risk tasks.

3.2 Minimalist Redis proxy pool (including rotation)

There is no need to introduce bloated third-party frameworks, as followsSimpleProxyPoolOnly relying on Redis can complete the verification, storage and selective acquisition of agents. At the same time, we have built-in success rate ranking + random rotation to avoid always using the same IP.

import random
from concurrent.futures import ThreadPoolExecutor
import redis
import requests
import time

class SimpleProxyPool:
    def __init__(self, redis_host="localhost", redis_port=6379):
        self.redis = redis.Redis(host=redis_host, port=redis_port,
                                 db=0, decode_responses=True)
        self.test_url = "http://www.baidu.com"   # 验证用目标,换海外站点亦可
        self.valid_key = "proxies:valid"         # zset,score代表成功次数
        self.pending_key = "proxies:pending"     # set,待验证代理

    def add_proxies(self, proxy_list):
        """批量添加待验证代理"""
        self.redis.sadd(self.pending_key, *proxy_list)

    def _validate_single(self, proxy):
        """单个代理验证,成功则增加score,失败则移出有效集"""
        try:
            requests.get(self.test_url, proxies={"http": proxy}, timeout=3)
            # 每验证成功一次,score +1
            self.redis.zincrby(self.valid_key, 1, proxy)
        except Exception:
            self.redis.zrem(self.valid_key, proxy)
            self.redis.srem(self.pending_key, proxy)

    def run_validation(self, threads=10):
        """多线程验证所有待验证 + 已有代理"""
        candidates = (list(self.redis.smembers(self.pending_key)) +
                      list(self.redis.zrange(self.valid_key, 0, -1)))
        # 去重
        candidates = list(set(candidates))
        with ThreadPoolExecutor(max_workers=threads) as executor:
            executor.map(self._validate_single, candidates)

    def get_best_proxy(self):
        """获取当前成功率最高的代理(用于对稳定性要求极高的任务)"""
        top = self.redis.zrevrange(self.valid_key, 0, 0)
        return top[0] if top else None

    def get_random_proxy(self, min_score=1):
        """随机获取一个得分 >= min_score 的代理,实现简单轮换"""
        candidates = self.redis.zrangebyscore(self.valid_key, min_score, float("inf"))
        return random.choice(candidates) if candidates else None

# ———— 使用示例 ————
if __name__ == "__main__":
    pool = SimpleProxyPool()
    # 添加你在供应商处购买的代理,格式为 "http://user:pass@ip:port"
    pool.add_proxies(["http://127.0.0.1:7890"])  # 仅示例,请替换为真实代理

    # 每10分钟自动验证一轮
    while True:
        pool.run_validation()
        valid_count = pool.redis.zcard(pool.valid_key)
        print(f"当前有效代理数:{valid_count}")
        time.sleep(60 * 10)

Rotation Strategy:

  • For core tasks that require high stability, useget_best_proxy()Take the agent with the highest score.
  • For general crawling tasks, useget_random_proxy(min_score=1)Draw randomly from verified agents to spread the pressure.
  • Can be combined with a retry mechanism: if the request fails, the agent is automatically removed and a new one is used to try again.

4. Pitfall avoidance guide and compliant use

4.1 Pitfalls that must not be stepped on

  1. Use free proxies for cheap: 99% of free proxies are transparent proxies, or have been abused. Websites with slightly stricter access will be directly blocked.
  2. Uncontrolled request frequency: Even with a high-hidden residential IP, 100 requests per second will trigger risk control, and a random delay must be added.
  3. Ignore browser fingerprinting: Just changing the IP without fingerprint disguise is equivalent to wasting your efforts. When using Selenium/Playwright, be sure to use an anti-detection plug-in (such asundetected-chromedriver)。
  4. Crawling sensitive content: Personal privacy, copyrighted content, and unauthorized commercial data must not be touched from a legal or ethical perspective.

4.2 Compliance Best Practices

  • Strictly adhere to the target website'srobots.txtRegulation
  • Control the request frequency within a reasonable range (for example, once every 2~3 seconds, appropriately lower during peak periods)
  • Clear settings with contact detailsUser-Agent,like:Mozilla/5.0 (compatible; MyScraperBot/1.0; +contact@example.com)
  • Keep complete crawl logs, including timestamp, request URL, and proxy IP used, for compliance review

5. Resource recommendation

  • Open Source Tools

    • ProxyBroker: Automatically find and verify free agents (for learning, not for production)
    • scrapy-rotating-proxies: Scrapy-specific proxy rotation middleware
    • curl-cffi: Disguise TLS fingerprints to make Python requests more like real browsers
  • TESTING TOOLS

    • httpbin.org/headers: View the actual request header and verify the proxy anonymity
    • browserleaks.com: Check browser fingerprint and IP information
  • Learning Materials

  • "Web Scraping with Python" 2nd Edition (O'Reilly)

  • MITMproxy official documentation: Understanding proxy interception and traffic debugging

(Full text ends)