Python urllib usage tutorial (2024 latest version)

In the absence of third-party libraries, Python’s ownurllibIt is your "only built-in weapon" for making HTTP requests. it doesn't needpip install, ready to use right out of the box, perfect for quick scripts, embedded devices, educational demonstrations, or any "standard library only" situation. Although production-level projects typically userequests, but understandurllibIt will allow you to truly understand the underlying logic of HTTP requests.

This tutorial will use the best practices in 2024 to help you masterurllibMost commonly used functions. The content is easy to understand and starts from scratch, and each example can be run directly.


1. Four core modules, first get familiar with them

urllibIt is a package, which is divided into four sub-modules, each performing its own duties:

  1. urllib.request – Responsible for opening URLs, sending requests, and reading responses. It is the door to the entire library.
  2. urllib.error – Specifically catches HTTP and URL related exceptions to make your code more robust.
  3. urllib.parse – Handles URL encoding, parameter splicing, and splitting, equivalent to a URL toolbox.
  4. urllib.robotparser – parserobots.txt, crawler specification tool (although optional, but highly recommended).

Remember these four brothers, all subsequent operations revolve around them (mainly the first three).


2. Let’s start with the simplest GET request

GET is like entering the URL in the browser address bar and pressing Enter. It is the most commonly used HTTP method.

2.1 Three lines of code, plus automatic release of resources

useurlopen()Open a URL and matchwithStatements can automatically close network connections to prevent resource leaks. This is the golden rule for operating files and networks in Python.

from urllib.request import urlopen

# 一个公共的测试 API,返回一条待办事项
TEST_URL = "https://jsonplaceholder.typicode.com/todos/1"

with urlopen(TEST_URL) as response:
    # 看看状态码和原因(比如 200 OK)
    print(f"状态码:{response.status} | 原因:{response.reason}")
    
    # 获取响应头中的 Content-Type
    content_type = response.getheader('Content-Type')
    print(f"Content-Type: {content_type}")
    
    # 读取内容(原始字节流)
    raw_data = response.read()
    # 解码成字符串(默认 utf-8 能覆盖99%场景)
    text = raw_data.decode("utf-8")
    print(f"\n响应内容:\n{text}")

Tips: UsewithIt is equivalent to telling Python: "When you are done with this connection, remember to close the door for me." Definitely don’t skip it!

2.2 Put a "browser" cloak on the request

Many websites check whether visitors haveUser-Agent(browser identifier). If it is not set, you may be directly blocked. At this time we need to construct aRequestObject and add request headers to it.

from urllib.request import Request, urlopen

URL = "https://jsonplaceholder.typicode.com/todos/1"

# 先创建 Request 对象
req = Request(URL)

# 添加常见的请求头
req.add_header("User-Agent",
               "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
               "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36")
req.add_header("Accept", "application/json")  # 告诉服务器我只想要 JSON

# 然后像之前一样打开
with urlopen(req) as resp:
    print(resp.read().decode("utf-8"))

Set upUser-Agent, your request will not be discarded directly as a machine program.


3. POST request: Submit data to the server

POST is commonly used for logging in, uploading forms, and writing data. The core step is to encode the form data and attach it to the request body.

3.1 Send traditional form (application/x-www-form-urlencoded

Most login interfaces use this format, which is equivalent to using parameters&put together, e.g.username=test&password=123. We useurllib.parse.urlencodeto generate.

from urllib.request import Request, urlopen
from urllib.parse import urlencode

# 模拟登录信息
login_data = {
    "username": "test_user",
    "password": "test_pass_123",
    "remember": "true"
}

# 第1步:把字典变成 URL 编码字符串
encoded_str = urlencode(login_data)
# 第2步:字符串转换成 UTF-8 字节流(urllib 只接受 bytes 作为 POST 数据)
encoded_bytes = encoded_str.encode("utf-8")

# 第3步:创建 Request,指定方法和数据
req = Request(
    url="https://jsonplaceholder.typicode.com/posts",  # 测试用的 POST 接口
    data=encoded_bytes,
    method="POST"   # 默认就是 POST 当提供了 data,但显式写更清楚
)

# 第4步:告诉服务器我们传的是表单格式
req.add_header("Content-Type", "application/x-www-form-urlencoded")
req.add_header("User-Agent", "Chrome/130.0.0.0")

with urlopen(req) as resp:
    print(resp.read().decode("utf-8"))

Note: Be sure to encode the dictionary intobytes, otherwise an error will occur. andContent-TypeThe header cannot be missing, otherwise some strict backends may not recognize it.


4. JSON response, directly parsed using the built-in library

Today's APIs almost all return JSON. Use Python’s ownjsonmodules can be easily processed.

import json
from urllib.request import Request, urlopen

def get_todo(todo_id: int) -> dict:
    """封装的函数:根据 ID 获取待办事项,返回字典"""
    url = f"https://jsonplaceholder.typicode.com/todos/{todo_id}"
    req = Request(url, headers={"User-Agent": "Chrome/130.0.0.0"})
    
    with urlopen(req) as resp:
        if resp.status != 200:
            raise Exception(f"请求失败,状态码:{resp.status}")
        # 直接读取字节并解析为 JSON
        return json.loads(resp.read())

# 调用测试
try:
    todo = get_todo(2)
    print(f"任务标题:{todo['title']}")
    print(f"是否完成:{todo['completed']}")
except Exception as e:
    print(f"出错啦:{e}")

You will find,json.loads()It is more direct than decoding the string first and then parsing it, and it is done in one step.


5. Three high-frequency practical skills

5.1 Hang up the proxy

When working on an intranet or accessing over a wall, it is often necessary to configure a proxy.urllibuseProxyHandlerto achieve.

from urllib.request import ProxyHandler, build_opener, install_opener, urlopen

# 假设你的代理运行在 127.0.0.1:7890
PROXY = {
    "http": "http://127.0.0.1:7890",
    "https": "http://127.0.0.1:7890"
}

# 创建代理处理器
proxy_handler = ProxyHandler(PROXY)
# 构建一个 opener
opener = build_opener(proxy_handler)
# 全局安装(之后所有 urlopen 都走代理)
install_opener(opener)

# 测试:查看当前出口 IP
with urlopen("https://ifconfig.me/ip") as resp:
    print(f"代理后的 IP:{resp.read().decode('utf-8').strip()}")

If you only want to use the proxy for specific requests, you can use it directly without installing it globally.opener.open()substituteurlopen

5.2 Skip SSL certificate verification (testing only!)

The development environment may use a self-signed certificate, causing the SSL handshake to fail. You can temporarily skip verification, but it is absolutely prohibited in production environments.

import ssl
from urllib.request import urlopen

# 创建一个不验证证书的上下文
ctx = ssl._create_unverified_context()

# 测试一个著名的自签名站点
with urlopen("https://self-signed.badssl.com/", context=ctx) as resp:
    print(f"成功访问,状态码:{resp.status}")

⚠️ Doing so will expose you to the risk of man-in-the-middle attacks and can only be used during local debugging.

5.3 Timeout setting: Don’t keep requests waiting

defaulturlopenThere is no timeout limit. Once the other party does not respond, your program may remain stuck. pass totimeoutParameters can set the maximum waiting seconds.

from urllib.request import urlopen

try:
    with urlopen("https://httpbin.org/delay/10", timeout=5) as resp:
        print(resp.read().decode())
except Exception as e:
    print(f"超时或其它错误:{e}")

In actual calls, within 10 seconds is a more reasonable timeout value.


6. exception-handling: Make the code more solid

Network requests are full of pitfalls: disconnection, 404, 500, timeout... Code without exception capture is a time bomb.

from urllib.request import Request, urlopen
from urllib.error import HTTPError, URLError
import socket

def safe_get(url: str, timeout: int = 10) -> str | None:
    """带完整exception-handling的 GET 请求函数"""
    req = Request(url, headers={"User-Agent": "Chrome/130.0.0.0"})
    
    try:
        with urlopen(req, timeout=timeout) as resp:
            return resp.read().decode("utf-8")
    except HTTPError as e:
        # 服务器返回了错误状态码(4xx, 5xx)
        print(f"HTTP 错误:{e.code} - {e.reason}")
        # 有时错误响应里也有内容,可以尝试读取 e.read()
        return None
    except URLError as e:
        # 网络层面的错误:DNS 解析失败、连接拒绝等
        if isinstance(e.reason, socket.timeout):
            print(f"请求超时(>{timeout}秒)")
        else:
            print(f"URL 错误:{e.reason}")
        return None
    except Exception as e:
        # 兜底:其他未知异常
        print(f"未知错误:{e}")
        return None

# 几个典型错误场景
safe_get("https://jsonplaceholder.typicode.com/todos/999999")  # 可能 404
safe_get("https://不存在的域名.com")          # URLError
safe_get("https://httpbin.org/delay/15", timeout=5)   # 超时

By writing exception-handling well, your program can fail gracefully instead of crashing directly.


7. Best Practice List for 2024

  1. Always usewithManage connections
  2. Must be setUser-Agent
  3. Always specifiedtimeout(3~10 seconds is appropriate)
  4. Process JSON directlyjson.loads()
  5. Check before crawlingrobots.txt,useurllib.robotparserBe a legal crawler
  6. When encountering complex requirements (such as session retention, file upload), changerequestsSave money and worry

eight,urllib vs requests: One picture to understand

Featuresurllib (built-in)requests (3rd party)
InstallationNo installation requiredpip install requests
Code amountRelatively complexSimple and elegant
FunctionsComplete basicsExtremely rich (Keep-Alive, Cookie persistence, OAuth, etc.)
Recommended scenariosTeaching, restricted environment, single file script99% of actual projects

userequestsTo achieve the same function, the amount of code is much smaller:

import requests

# GET + JSON 解析
todo = requests.get("https://jsonplaceholder.typicode.com/todos/1").json()
print(todo["title"])

# POST
resp = requests.post("https://jsonplaceholder.typicode.com/posts",
                     data={"username": "test", "password": "123"},
                     headers={"User-Agent": "Chrome/130.0.0.0"},
                     timeout=10)
print(resp.status_code)

But after studyingurllibAfter that, you can seerequestsThe source code will suddenly become clear - the basic ideas behind them are exactly the same.


9. Hands-on exercise: Obtaining weather information

Below we useurllibcall freewttr.inThe weather interface directly returns a one-line simplified weather forecast. This exercise just strings together GET, header settings, and exception-handling.

from urllib.request import Request, urlopen
from urllib.error import HTTPError, URLError

def get_weather(city: str) -> str | None:
    """返回城市的精简天气信息(格式:城市: 天气 温度)"""
    url = f"https://wttr.in/{city}?format=3"   # ?format=3 返回一行文本
    req = Request(url, headers={"User-Agent": "curl/7.68.0"})
    
    try:
        with urlopen(req, timeout=10) as resp:
            return resp.read().decode("utf-8").strip()
    except HTTPError as e:
        print(f"城市名可能不存在({e.code})")
        return None
    except URLError as e:
        print(f"网络连接出错:{e.reason}")
        return None

if __name__ == "__main__":
    city_name = input("请输入城市名(拼音或英文):")
    weather = get_weather(city_name)
    if weather:
        print(f"\n{weather}")

Enter after runningbeijingorlondon, you can see a line of weather overview, which is both practical and consolidates knowledge.


Summarize

urllibIt is the "original HTTP toolbox" for the Python world. It may not be fancy enough, but it's reliable, dependency-free, and available everywhere. Mastering it, you can not only cope with request tasks under various standard library limitations, but also have a deeper understanding of the HTTP protocol itself.

The development suggestions for 2024 are very clear: ** Use small scripts easilyurllib, for serious projectsrequests**. Whichever tool you use, remember to add exception-handling, timeouts, andUser-Agent, your network code is more than half successful.