title: ADSL dial-up proxy-usage description: Python3 crawler tutorial: setting up ADSL dial-up proxy service
Python3 crawler tutorial: setting up ADSL dial-up proxy service
Overview
When engaging in stringent anti-climbing businesses such as e-commerce data collection and self-media monitoring, it is almost common for IPs to be blocked. If you use a third-party paid proxy, the price is outrageous, or the same IP is used by many people, and it is already on the blacklist of the target website before you start crawling.
This tutorial will take you to build an ADSL dial-up proxy service from scratch that is completely independent and controllable, has controllable costs, has tens of millions of IP addresses, and has high anonymity. There is no need to share the resource pool with others. All proxy IPs are "fresh" addresses dynamically allocated by the operator. A dial-up is a new life.
Compared with the agency services on the market that are often billed by volume, the outstanding advantages of this solution are:
- ✅ Purely dynamic IP: Forced renewal every time it is redialed, IP segments are randomly assigned, and there is almost no duplication;
- ✅ Exclusive Bandwidth: The downlink bandwidth of a single cloud host is 1-2Mbps, and the network is stable and interference-free;
- ✅ Completely customizable configuration: anonymity, dialing frequency, and open ports are all under your control;
- ✅ Convenient horizontal expansion: Multiple cloud hosts run at the same time, and the IP pool can be easily expanded to hundreds or thousands of levels.
1. ADSL technical principles (popular version)
ADSL stands for "Asymmetric Digital Subscriber Line". We only need to grasp two cores and one advantage.
Two cores
-
Asymmetric bandwidth with small top and large bottom Downstream (pulling data from the server) is fast, but upstream (sending requests to the server) is slow - this just matches the crawler's "small request, large response" scenario, and there is almost no need to pay extra for useless upstream bandwidth.
-
Forced dialing to change IP Just like the Modem dial-up Internet access at home in the early years, every time you reconnect to the broadband, the operator will randomly allocate a new address from a massive dynamic IP pool. Usually, the IP dialed twice are not in the same network segment, so it is difficult to be associated with the website.
An advantage
IPs in the tens of millions: The ADSL dynamic pool of mainstream cloud service providers covers the IP resources of operators in many provinces and cities across the country. The duplication rate is so low that it can be ignored. There is no need to worry about the problem of "dirty IPs that have been used by others".
2. Preparation
2.1 Purchase ADSL dial-up cloud host
Currently, there are two main types of cloud hosts available on the market:
- Pure dial-up cloud: The bandwidth is cheap (1-2Mbps downlink is enough for daily use), and the IP pool is relatively clean. Asyun and Yuncube are common choices. Be sure to select "Dynamic IP ADSL" instead of "Static Cloud" when ordering.
- Alibaba Cloud/Tencent Cloud Lightweight Dial-up: The stability is slightly higher, but you need to configure PPPoE yourself, and the operation threshold is slightly higher.
Recommended configuration: Prepare at least **2 1-core 1G, 1-2Mbps downlink cloud hosts, and then stagger their dial-up times to avoid all hosts reconnecting at the same time and no agent is temporarily available.
2.2 SSH connection server
Get the IP, port, and username of the cloud host (usuallyroot) and password, log in remotely using the terminal:
3. Test dialing function
Before installing the agent, first ensure that the cloud host itself can dial up normally and change the IP successfully.
3.1 Common dialing commands
The dial-up scripts provided by different manufacturers may be slightly different. You can try the general command first; if it does not work, contact customer service directly to obtain a special script:
3.2 Verify IP changes
View before and after dialingppp0The address of the interface, compare whether it has changed:
if onlyinetIf the IP displayed on one line changes after dialing, it means that the dialing function is normal.
4. Set up Squid proxy server
Next we need to use Squid to convert the dynamic IP of the cloud host into an HTTP/HTTPS proxy that can be accessed externally. At the same time, it must be configured as a highly hidden proxy so that the target server is completely unaware that the request is forwarded through the proxy.
4.1 Install Squid
Install directly with yum on CentOS/RedHat system:
4.2 Basic startup configuration
Execute startup, set auto-start at boot, and check the running status in sequence:
4.3 Core configuration (high anonymity + public network access)
Edit Squid’s main configuration file/etc/squid/squid.conf(Availablevimornano), modify the following 5 key points in sequence.
① Allow access from all public IPs
Find the default access control rules and delete or comment them outhttp_access deny all, then add:
② Add global ACL rules
at the beginning of the fileaclNear the area, add a new line:
③ Turn on high hiding mode (must change!)
Append the following 3 lines of configuration at the end of the file to erase all header information that may reveal the identity of the agent:
④ Modify the default port (recommended)
Squid listens on 3128 by default, which is easy to be scanned and detected. It is recommended to change it to a custom port, for example3328:
⑤ Restart Squid to make the configuration take effect
4.4 Local test agent
Used locally on the cloud hostcurlAccess the test interface through the newly built proxy:
Returned IP if current with cloud hostppp0The IPs are consistent, indicating that the agent is working normally.
5. Implement dynamic IP automatic management
There is already a proxy, but it is too troublesome to manually dial and record the IP every time. We can use Python toolsadslproxyRealize: "Automatic dialing → Obtain new IP → Automatically register to Redis proxy pool" a complete set of automated processes.
5.1 Install dependencies
First install Python3, pip3 andadslproxyLibrary:
5.2 Prepare Redis (optional but recommended)
If you have multiple cloud hosts, or want to uniformly obtain proxy addresses from the outside, you need a public network accessible Redis service. You can purchase a separate lightweight cloud server and build it yourself, or you can directly use managed Redis (more stable) provided by Alibaba Cloud, Tencent Cloud, etc.
5.3 Set environment variables
Configure on cloud hostadslproxyRelated parameters for connecting to Redis and dialing:
5.4 Start the dial-up process
Put the dial-up process to run in the background and let it work automatically in cycles:
can passtail -f /var/log/adslproxy_dial.logCheck the logs in real time to confirm whether dialing is normal and whether the IP is successfully stored in Redis.
6. Get IP from proxy pool (Python call/API interface)
adslproxyNot only does it help us maintain the agent pool, it also provides two convenient ways to access it.
6.1 Using built-in API services
① Start API
Also install on local computer or a server that can communicate with RedisadslproxyAnd set the Redis environment variables, and then start:
The default port is8000, if you need to modify it, you can use-pparameter:
② Commonly used API interfaces
6.2 Use Python to read Redis directly (more flexible)
In some scenarios that require in-depth customization, you can also directly retrieve the proxy address from Redis:
7. FAQ & Best Practices
FAQ
- **What should I do if the dial-up fails? **
- Execute manually first
pppoe-stop; pppoe-start, observe the console error message; - Check the broadband account password and PPPoE configuration file of the cloud host (you can consult customer service to obtain the exclusive
/etc/ppp/chap-secretsor/etc/ppp/peers/pppoe)。
- **Proxy access is blocked? **
- Used on cloud host
curl -x http://127.0.0.1:3328 https://httpbin.org/ipConfirm whether the local agent is available; - Check whether the firewall has allowed the proxy port (for example:
firewall-cmd --zone=public --add-port=3328/tcp --permanent && firewall-cmd --reload)。
- **Redis cannot be connected? **
- Check the Redis configuration file for
binditem, preferably changed tobind 0.0.0.0Or comment it out directly; - Confirm that the firewall has opened port 6379 and the password is correct.
Best Practices
-
Multiple host staggered dialing pass
adslproxyofDIAL_CYCLEEnvironment variables set different dial-up periods for different hosts (such as 15min, 17min, 20min) to avoid temporary lack of agents caused by all hosts reconnecting at the same time. -
Regularly clean up expired agents
adslproxyUnavailable agents will be automatically removed, but you can set additional scheduled tasks to run Redis regularly.HDELClean up the residue. -
Customized availability testing on demand pass
PROXY_TEST_URLThe environment variable replaces the detection URL with the target website you want to crawl (such ashttps://www.taobao.com), so that the surviving proxy is more meaningful to the target site. -
Protect API Interface If the API service needs to be exposed to the public network, it is recommended to use a reverse proxy behind Nginx and add simple Token authentication or restrict access to IP to prevent abuse.
10. Summary
Through this tutorial, you have mastered the entire process of building an ADSL dial-up proxy service from scratch that is independently controllable, low-cost, and has an IP scale of tens of millions. This solution completely solves the stubborn problems in the crawler field such as "purchasing agents is expensive, sharing agents is dirty, and IP quantity bottlenecks". It is especially suitable for platforms such as e-commerce, self-media, and recruitment websites that have strict collection and anti-crawling requirements.
Now, you can focus on the crawler logic itself, and leave the remaining network identity issues to this dynamic ADSL proxy pool!

