title: ADSL dial-up proxy-usage description: Python3 crawler tutorial: setting up ADSL dial-up proxy service

Python3 crawler tutorial: setting up ADSL dial-up proxy service

Overview

When engaging in stringent anti-climbing businesses such as e-commerce data collection and self-media monitoring, it is almost common for IPs to be blocked. If you use a third-party paid proxy, the price is outrageous, or the same IP is used by many people, and it is already on the blacklist of the target website before you start crawling.

This tutorial will take you to build an ADSL dial-up proxy service from scratch that is completely independent and controllable, has controllable costs, has tens of millions of IP addresses, and has high anonymity. There is no need to share the resource pool with others. All proxy IPs are "fresh" addresses dynamically allocated by the operator. A dial-up is a new life.

Compared with the agency services on the market that are often billed by volume, the outstanding advantages of this solution are:

  • Purely dynamic IP: Forced renewal every time it is redialed, IP segments are randomly assigned, and there is almost no duplication;
  • Exclusive Bandwidth: The downlink bandwidth of a single cloud host is 1-2Mbps, and the network is stable and interference-free;
  • Completely customizable configuration: anonymity, dialing frequency, and open ports are all under your control;
  • Convenient horizontal expansion: Multiple cloud hosts run at the same time, and the IP pool can be easily expanded to hundreds or thousands of levels.

ADSL stands for "Asymmetric Digital Subscriber Line". We only need to grasp two cores and one advantage.

Two cores

  1. Asymmetric bandwidth with small top and large bottom Downstream (pulling data from the server) is fast, but upstream (sending requests to the server) is slow - this just matches the crawler's "small request, large response" scenario, and there is almost no need to pay extra for useless upstream bandwidth.

  2. Forced dialing to change IP Just like the Modem dial-up Internet access at home in the early years, every time you reconnect to the broadband, the operator will randomly allocate a new address from a massive dynamic IP pool. Usually, the IP dialed twice are not in the same network segment, so it is difficult to be associated with the website.

An advantage

IPs in the tens of millions: The ADSL dynamic pool of mainstream cloud service providers covers the IP resources of operators in many provinces and cities across the country. The duplication rate is so low that it can be ignored. There is no need to worry about the problem of "dirty IPs that have been used by others".


2. Preparation

2.1 Purchase ADSL dial-up cloud host

Currently, there are two main types of cloud hosts available on the market:

  • Pure dial-up cloud: The bandwidth is cheap (1-2Mbps downlink is enough for daily use), and the IP pool is relatively clean. Asyun and Yuncube are common choices. Be sure to select "Dynamic IP ADSL" instead of "Static Cloud" when ordering.
  • Alibaba Cloud/Tencent Cloud Lightweight Dial-up: The stability is slightly higher, but you need to configure PPPoE yourself, and the operation threshold is slightly higher.

Recommended configuration: Prepare at least **2 1-core 1G, 1-2Mbps downlink cloud hosts, and then stagger their dial-up times to avoid all hosts reconnecting at the same time and no agent is temporarily available.

2.2 SSH connection server

Get the IP, port, and username of the cloud host (usuallyroot) and password, log in remotely using the terminal:

# 将 your_server_ip、port_number 替换成实际信息
ssh root@your_server_ip -p port_number

3. Test dialing function

Before installing the agent, first ensure that the cloud host itself can dial up normally and change the IP successfully.

3.1 Common dialing commands

The dial-up scripts provided by different manufacturers may be slightly different. You can try the general command first; if it does not work, contact customer service directly to obtain a special script:

pppoe-start  # 开始拨号
pppoe-stop   # 断开拨号

3.2 Verify IP changes

View before and after dialingppp0The address of the interface, compare whether it has changed:

ifconfig | grep ppp0 -A 2

if onlyinetIf the IP displayed on one line changes after dialing, it means that the dialing function is normal.


4. Set up Squid proxy server

Next we need to use Squid to convert the dynamic IP of the cloud host into an HTTP/HTTPS proxy that can be accessed externally. At the same time, it must be configured as a highly hidden proxy so that the target server is completely unaware that the request is forwarded through the proxy.

4.1 Install Squid

Install directly with yum on CentOS/RedHat system:

yum install squid -y

4.2 Basic startup configuration

Execute startup, set auto-start at boot, and check the running status in sequence:

systemctl start squid
systemctl enable squid   # 设置开机自启动
systemctl status squid   # 看到 active (running) 即为成功

4.3 Core configuration (high anonymity + public network access)

Edit Squid’s main configuration file/etc/squid/squid.conf(Availablevimornano), modify the following 5 key points in sequence.

① Allow access from all public IPs

Find the default access control rules and delete or comment them outhttp_access deny all, then add:

http_access allow all

② Add global ACL rules

at the beginning of the fileaclNear the area, add a new line:

acl localnet src 0.0.0.0/0

③ Turn on high hiding mode (must change!)

Append the following 3 lines of configuration at the end of the file to erase all header information that may reveal the identity of the agent:

request_header_access Via deny all
request_header_access X-Forwarded-For deny all
request_header_access From deny all

Squid listens on 3128 by default, which is easy to be scanned and detected. It is recommended to change it to a custom port, for example3328

http_port 3328

⑤ Restart Squid to make the configuration take effect

systemctl restart squid

4.4 Local test agent

Used locally on the cloud hostcurlAccess the test interface through the newly built proxy:

# 将端口换成你设定的值
curl -x http://127.0.0.1:3328 https://httpbin.org/ip

Returned IP if current with cloud hostppp0The IPs are consistent, indicating that the agent is working normally.


5. Implement dynamic IP automatic management

There is already a proxy, but it is too troublesome to manually dial and record the IP every time. We can use Python toolsadslproxyRealize: "Automatic dialing → Obtain new IP → Automatically register to Redis proxy pool" a complete set of automated processes.

5.1 Install dependencies

First install Python3, pip3 andadslproxyLibrary:

yum install python3 python3-pip -y
pip3 install adslproxy

If you have multiple cloud hosts, or want to uniformly obtain proxy addresses from the outside, you need a public network accessible Redis service. You can purchase a separate lightweight cloud server and build it yourself, or you can directly use managed Redis (more stable) provided by Alibaba Cloud, Tencent Cloud, etc.

5.3 Set environment variables

Configure on cloud hostadslproxyRelated parameters for connecting to Redis and dialing:

# 替换成你自己的 Redis 连接信息
export REDIS_HOST='your_redis_public_ip'
export REDIS_PORT='6379'
export REDIS_PASSWORD='your_redis_password'
export REDIS_DB='0'          # 数据库编号,默认用0

# 代理端口、拨号脚本和拨号接口
export PROXY_PORT='3328'
export DIAL_BASH='pppoe-stop; sleep 3; pppoe-start'  # sleep 3 可避免连续拨号冲突
export DIAL_IFNAME='ppp0'      # 一般是ppp0,不确定就问客服

# 每台云主机必须设置唯一的名称,否则会相互覆盖
export CLIENT_NAME='adsl1'

5.4 Start the dial-up process

Put the dial-up process to run in the background and let it work automatically in cycles:

nohup adslproxy dial > /var/log/adslproxy_dial.log 2>&1 &

can passtail -f /var/log/adslproxy_dial.logCheck the logs in real time to confirm whether dialing is normal and whether the IP is successfully stored in Redis.


6. Get IP from proxy pool (Python call/API interface)

adslproxyNot only does it help us maintain the agent pool, it also provides two convenient ways to access it.

6.1 Using built-in API services

① Start API

Also install on local computer or a server that can communicate with RedisadslproxyAnd set the Redis environment variables, and then start:

adslproxy server

The default port is8000, if you need to modify it, you can use-pparameter:

adslproxy server -p 9000

② Commonly used API interfaces

Interface PathMethodFunction
/randomGETGet an available proxy at random
/allGETGet a list of all currently available proxies
/countGETQuery the number of currently available proxies

6.2 Use Python to read Redis directly (more flexible)

In some scenarios that require in-depth customization, you can also directly retrieve the proxy address from Redis:

import redis
import random

# 连接 Redis(替换为你的实际信息)
r = redis.StrictRedis(
    host='your_redis_public_ip',
    port=6379,
    password='your_redis_password',
    db=0,
    decode_responses=True   # 直接返回字符串,省略手动解码
)

# 获取所有代理
proxies = r.hkeys('adslproxy:proxies')

# 随机挑选一个
if proxies:
    proxy = random.choice(proxies)
    print(f"当前可用代理:http://{proxy}")
else:
    print("当前没有可用代理!")

7. FAQ & Best Practices

FAQ

  1. **What should I do if the dial-up fails? **
  • Execute manually firstpppoe-stop; pppoe-start, observe the console error message;
  • Check the broadband account password and PPPoE configuration file of the cloud host (you can consult customer service to obtain the exclusive/etc/ppp/chap-secretsor/etc/ppp/peers/pppoe)。
  1. **Proxy access is blocked? **
  • Used on cloud hostcurl -x http://127.0.0.1:3328 https://httpbin.org/ipConfirm whether the local agent is available;
  • Check whether the firewall has allowed the proxy port (for example:firewall-cmd --zone=public --add-port=3328/tcp --permanent && firewall-cmd --reload)。
  1. **Redis cannot be connected? **
  • Check the Redis configuration file forbinditem, preferably changed tobind 0.0.0.0Or comment it out directly;
  • Confirm that the firewall has opened port 6379 and the password is correct.

Best Practices

  1. Multiple host staggered dialing passadslproxyofDIAL_CYCLEEnvironment variables set different dial-up periods for different hosts (such as 15min, 17min, 20min) to avoid temporary lack of agents caused by all hosts reconnecting at the same time.

  2. Regularly clean up expired agents adslproxyUnavailable agents will be automatically removed, but you can set additional scheduled tasks to run Redis regularly.HDELClean up the residue.

  3. Customized availability testing on demand passPROXY_TEST_URLThe environment variable replaces the detection URL with the target website you want to crawl (such ashttps://www.taobao.com), so that the surviving proxy is more meaningful to the target site.

  4. Protect API Interface If the API service needs to be exposed to the public network, it is recommended to use a reverse proxy behind Nginx and add simple Token authentication or restrict access to IP to prevent abuse.


10. Summary

Through this tutorial, you have mastered the entire process of building an ADSL dial-up proxy service from scratch that is independently controllable, low-cost, and has an IP scale of tens of millions. This solution completely solves the stubborn problems in the crawler field such as "purchasing agents is expensive, sharing agents is dirty, and IP quantity bottlenecks". It is especially suitable for platforms such as e-commerce, self-media, and recruitment websites that have strict collection and anti-crawling requirements.

Now, you can focus on the crawler logic itself, and leave the remaining network identity issues to this dynamic ADSL proxy pool!