Python crawler agent setting guide (2024 latest version)
The most troublesome thing for crawlers is that the IP is blocked instantly and the target site restricts regional access. Don’t worry, this guide will help you understand the proxy configuration methods of urllib, requests, httpx, and Playwright in one go, covering HTTP/HTTPS (including authentication) and SOCKS5. You can use the code immediately.
1. Preparation: 5 Things You Must Know Before You Start
Take a minute to read these 5 points before configuring the proxy, which can save a lot of troubleshooting time.
-
Make a list of IP formats The common way to write an agent is
IP:端口, for example, the default HTTP proxy of the local tool Clash is usually127.0.0.1:7890, SOCKS5 is127.0.0.1:7891. Don't get confused. -
Find available agents
- Hands-on test: You can go to 快代理免费区, but the speed is slow and the survival time is short.
- Production environment: Consider paid services such as Abuyun, Zhandaye, and Oxylabs. Purchase them based on traffic or duration. The stability and success rate are much higher than free services.
-
Clear the protocol type Not all HTTP proxies can forward HTTPS traffic. If the target site is HTTPS, be sure to choose a proxy that supports HTTPS, or directly use a full-protocol proxy (HTTP/HTTPS/SOCKS5).
-
Authentication information must handle special characters If the username or password contains
@、#symbols, must be URL encoded (e.g.@written as%40), otherwise the proxy address will be parsed incorrectly. -
Verify immediately after configuration All code examples end with
http://httpbin.org/getTest and check the returned JSON"origin"field is the proxy IP. Don’t skip this step.
2. urllib: Python native library can also play proxy
urllib is Python's own request library. Although it is rarely used, it still has opportunities to appear in projects that pursue zero dependencies or old projects.
Basic HTTP/HTTPS proxy
useProxyHandlerCreate a proxy processor and passbuild_openerGet a custom opener and use it to replace the default one laterurlopen。
Proxy with authentication
Simply spell the username and password into the address:username:password@ip:port. Remember to encode special characters.
SOCKS5 PROXY
urllib itself does not support SOCKS and needs to use a third-party libraryPySocksModify the underlying socket.
3. requests: the most popular library, super simple to configure
The proxy settings of requests are much more concise than urllib, all request methods passproxiesParameter passing.
Basic HTTP/HTTPS proxy
Proxy with authentication
Likewise, just add the username and password to the address.
SOCKS5 PROXY
InstallrequestsThe socks extension can directly support it.
4. httpx: Asynchronous tool in the new era
httpx supports HTTP/2 and native asynchronous, and is increasingly popular among high-concurrency crawlers. The key of the proxy configuration should use the complete prefix (http://、https://) to facilitate assigning different proxies to different domain names.
Basic HTTP/HTTPS proxy (synchronous)
SOCKS5 proxy (synchronous & asynchronous)
First install the extension library focusing on SOCKS:
Synchronization usage
Asynchronous usage
5. Automation tools (Selenium/Playwright)
When encountering dynamically rendered pages, you need to simulate the browser. Playwright is currently the first choice, with intuitive configuration and excellent support for dynamic content. The configuration idea of Selenium is similar, you can refer to the corresponding Driveradd_argumentplus--proxy-server。
Playwright Agent Configuration
The agent information is written directly inlaunch()ofproxyAmong the parameters, HTTP/HTTPS/SOCKS5 is supported, and authentication also comes with its own fields.
6. Avoid Pitfalls & Best Practices
-
Free proxies do not use in production Free IPs are short-lived and slow, and are easily identified and blocked by the target station. They are only suitable for temporary testing.
-
Prefer full-protocol agents Clearly marking the proxy that supports HTTP/HTTPS/SOCKS5 can reduce configuration errors and facilitate code reuse.
-
Use proxy pool in high concurrency scenarios Don’t use a single IP to fight hard. We recommend open source projects such as proxy_pool to build automatic IP scheduling.
-
Do not hardcode sensitive information Do not write the authentication user name and password directly in the code. Instead, use environment variables or configuration files to read them to protect account security.
-
Regular rotation and verification The production script should check the IP availability in the proxy pool regularly (such as every 5 minutes) and remove failed nodes.
7. Complete sample code
Complete running examples of all the above libraries, including agent pool management, environment variable reading and batch verification, have been compiled in the GitHub repository and are welcome to use:

