Python Hash Algorithm Security Practice Guide

1. What is the hash algorithm?

The hash algorithm (also called digest algorithm or hash algorithm) is a basic tool for computer security. It can convert arbitrary length input data (such as a sentence, a picture, a software installation package) into a string of fixed length "digital fingerprints" - this fingerprint is called a hash value (digest), usually expressed in hexadecimal.

Don’t get hung up on the mathematical details behind it, just remember its 5 core security features:

  • Absolute Certainty: The same input, no matter when and on which machine it is calculated, the result will always be the same.
  • High-speed calculation (normal scenario): Fingerprints can be calculated in milliseconds to seconds for files ranging from several KB to several GB.
  • 🔒 Theoretical irreversibility (for security algorithms): The original data cannot be restored even if the fingerprint is obtained.
  • 🦋 Butterfly Effect (Avalanche Effect): If you change even one character in the input, the output fingerprint will be completely different.
  • 🎯 Extremely low collision probability (security algorithm): In reality, it is almost impossible to find a situation where two different inputs produce the same fingerprint.

2. hashlibQuick start

Python standard libraryhashlibProvides mainstream hashing algorithms, no need to install third-party libraries, just use them out of the box.

2.1 Basic usage: single piece of text hashing

All algorithms follow three steps:创建对象 -> 喂数据 -> 取指纹. For short text, you can directly chain the call:

import hashlib

# 注意:输入必须是 bytes 类型
text = b"Python is the best language for security?"

# MD5(128位 / 16字节 → 32个十六进制字符)
# ⚠️ 已彻底淘汰,不适合任何安全场景
md5_digest = hashlib.md5(text).hexdigest()
print("MD5:", md5_digest)

# SHA-1(160位 / 20字节 → 40个十六进制字符)
# ⚠️ 也不推荐用于数字签名等安全场景
sha1_digest = hashlib.sha1(text).hexdigest()
print("SHA-1:", sha1_digest)

# SHA-256(256位 / 32字节 → 64个十六进制字符)
# ✅ 当前通用的安全选择
sha256_digest = hashlib.sha256(text).hexdigest()
print("SHA-256:", sha256_digest)

# SHA-512(512位 / 64字节 → 128个十六进制字符)
# ✅ 安全等级更高(速度稍慢,但普通场景感知不到)
sha512_digest = hashlib.sha512(text).hexdigest()
print("SHA-512:", sha512_digest)

:::tip Tips If you want to see which algorithms the system supports, you can use:

  • hashlib.algorithms_available— View all available algorithms in the current environment (including system local supplements)
  • hashlib.algorithms_guaranteed— View algorithms guaranteed to be available across platforms :::

3. Hash processing of large files/streaming data

If you transfer a video file of several GB in one goread()If the hash is calculated again in the memory, the computer will definitely freeze. This is when you need to usehashlibGradually updated features.

import hashlib

def compute_file_hash(file_path: str, algorithm: str = "sha256") -> str:
    """
    分块计算大文件的哈希值,避免内存溢出

    :param file_path: 文件路径
    :param algorithm: 跨平台保证可用的哈希算法(推荐 sha256 / sha512)
    :return: 十六进制哈希值
    """
    if algorithm not in hashlib.algorithms_guaranteed:
        raise ValueError(f"Unsupported guaranteed algorithm: {algorithm}")

    hasher = hashlib.new(algorithm)
    chunk_size = 8192  # 一次读取 8KB,兼顾效率和内存

    with open(file_path, "rb") as f:
        while chunk := f.read(chunk_size):   # Python 3.8+ 海象运算符
            hasher.update(chunk)

    return hasher.hexdigest()


if __name__ == "__main__":
    try:
        file_hash = compute_file_hash("test.txt")
        print(f"test.txt 的 SHA-256 指纹:{file_hash}")
    except FileNotFoundError:
        print("文件不存在,请先创建 test.txt")

4. Password storage: from “falling into pitfalls” to “avoiding pitfalls”

Password storage is the most common security scenario for hashing algorithms, but you must never use basic hashing (even SHA‑256) directly! The following is a step-by-step upgrade from the worst implementation to the industry recommended solution.

❌ Pit 1: Storing passwords in clear text

Once the database is leaked, user passwords are exposed at a glance - this kind of mistake is rare now, but it is still necessary to be wary of.

❌ Pit 2: Only do basic hashing once

A hacker can use a rainbow table (a large dictionary of "common passwords → basic hashes") to instantly reverse the plaintext.

fake_db = {}

def unsafe_save_password(username: str, plain_pwd: str) -> None:
    """只用 SHA‑256 存密码,弱密码会立刻被彩虹表破解"""
    pwd_bytes = plain_pwd.encode("utf-8")
    fake_db[username] = hashlib.sha256(pwd_bytes).hexdigest()

unsafe_save_password("bob", "123456")
print(fake_db)  # 查询彩虹表就能知道密码是 123456

⚠️ Transitional version: salted hashing

Salt is a randomly generated byte string that is different for each user. When saving the password, put the salt and password together and then hash them, and do the same when verifying. In this way, even if a hacker obtains the database, he will still have to generate a separate rainbow table for each user, which greatly increases the cost.

import os

fake_db_salted = {}

def salted_save_password(username: str, plain_pwd: str) -> None:
    """给每个用户生成唯一的 16 字节盐值,单独存储"""
    # os.urandom 生成加密级安全随机数,普通 random 模块不安全
    salt = os.urandom(16)
    salted_pwd = salt + plain_pwd.encode("utf-8")
    hash_val = hashlib.sha256(salted_pwd).hexdigest()

    # 盐值必须和哈希值一起存,否则无法验证
    fake_db_salted[username] = {
        "salt": salt.hex(),
        "hash": hash_val
    }

def salted_verify_password(username: str, plain_pwd: str) -> bool:
    """验证加盐哈希的密码"""
    if username not in fake_db_salted:
        return False

    user_data = fake_db_salted[username]
    salt = bytes.fromhex(user_data["salt"])
    input_salted_pwd = salt + plain_pwd.encode("utf-8")
    input_hash = hashlib.sha256(input_salted_pwd).hexdigest()

    return input_hash == user_data["hash"]

# 示例
salted_save_password("charlie", "abc123!@#")
print(salted_verify_password("charlie", "abc123!@#"))  # True
print(salted_verify_password("charlie", "wrong"))      # False

Although safer than the previous two, the basic hash operation is too fast - modern GPUs can calculate SHA‑256 billions of times per second, and brute force cracking is still only a matter of time. So we need to turn around and look for "slow hashes".

✅ Advanced version: Use built-in PBKDF2

PBKDF2 (Password-Based Key Derivation Function 2) is a slow hash function specially designed for passwords. Through a large number of iterations (such as 100,000 times), the speed of brute force cracking is reduced from "second level" to "day level" or even "grade".

from hashlib import pbkdf2_hmac

fake_db_pbkdf2 = {}

# 全局配置参数,方便统一调整
PBKDF2_ALGORITHM = "sha256"
PBKDF2_ITERATIONS = 100_000   # 至少 10 万次,硬件允许可加到百万
PBKDF2_KEY_LENGTH = 32        # 输出 32 字节(64 个十六进制字符)
PBKDF2_SALT_LENGTH = 16       # 盐值至少 16 字节

def pbkdf2_save_password(username: str, plain_pwd: str) -> None:
    """用 PBKDF2 慢哈希存储密码"""
    salt = os.urandom(PBKDF2_SALT_LENGTH)
    hash_key = pbkdf2_hmac(
        PBKDF2_ALGORITHM,
        plain_pwd.encode("utf-8"),
        salt,
        PBKDF2_ITERATIONS,
        PBKDF2_KEY_LENGTH
    )

    fake_db_pbkdf2[username] = {
        "salt": salt.hex(),
        "hash": hash_key.hex(),
        "algorithm": PBKDF2_ALGORITHM,
        "iterations": PBKDF2_ITERATIONS,
        "key_length": PBKDF2_KEY_LENGTH
    }

def pbkdf2_verify_password(username: str, plain_pwd: str) -> bool:
    """验证 PBKDF2 密码"""
    if username not in fake_db_pbkdf2:
        return False

    user_data = fake_db_pbkdf2[username]
    salt = bytes.fromhex(user_data["salt"])
    input_hash_key = pbkdf2_hmac(
        user_data["algorithm"],
        plain_pwd.encode("utf-8"),
        salt,
        user_data["iterations"],
        user_data["key_length"]
    )

    # 比较哈希值(生产环境建议用专门的常量时间比较函数防时序攻击)
    return input_hash_key.hex() == user_data["hash"]

PBKDF2 is already good, but Password Hash Competition (PHC) champion Argon2 and the proven bcrypt are even better - they come with security designs such as salt value generation, parameter adaptation, constant time comparison, etc., and are easier to use.

Here we take the most widely used bcrypt as an example (Argon2 can passargon2-cffilibrary installation):

pip install bcrypt
import bcrypt

fake_db_bcrypt = {}

def bcrypt_save_password(username: str, plain_pwd: str) -> None:
    """
    bcrypt 自动生成盐值、自动存储配置,返回的字符串里包含了所有验证信息。
    rounds=12 表示 2^12 = 4096 轮迭代,硬件允许可以改成 14。
    """
    salt = bcrypt.gensalt(rounds=12)
    hashed_pwd = bcrypt.hashpw(plain_pwd.encode("utf-8"), salt)
    fake_db_bcrypt[username] = hashed_pwd.decode("utf-8")  # 保存为字符串

def bcrypt_verify_password(username: str, plain_pwd: str) -> bool:
    """bcrypt 自动从存储的字符串里提取盐值和迭代次数,并安全比较"""
    if username not in fake_db_bcrypt:
        return False

    stored_pwd = fake_db_bcrypt[username].encode("utf-8")
    return bcrypt.checkpw(plain_pwd.encode("utf-8"), stored_pwd)


# 进一步封装成用户管理类
class BcryptUserManager:
    def __init__(self):
        self._users = {}

    def register(self, username: str, plain_pwd: str) -> None:
        if username in self._users:
            raise ValueError(f"用户名 '{username}' 已存在")
        bcrypt_save_password(username, plain_pwd)

    def login(self, username: str, plain_pwd: str) -> bool:
        return bcrypt_verify_password(username, plain_pwd)


if __name__ == "__main__":
    manager = BcryptUserManager()
    manager.register("david", "StrongPassw0rd!2024")
    print(manager.login("david", "StrongPassw0rd!2024"))  # True
    print(manager.login("david", "weakpass"))              # False

:::danger The iron law of password storage

  • Password storage** can only use** specialized slow hash functions (PBKDF2, bcrypt, Argon2)
  • Each password must have a unique, cryptographically secure random salt
  • The number of iterations / work factor must be adjusted high enough to ensure that brute force cracking is extremely slow
  • Never use MD5 / SHA‑1 / raw SHA‑256 for password storage :::

5. Other security/non-security uses of hashing algorithms

In addition to password storage, hashing algorithms have a variety of legitimate applications:

  1. Data Integrity Verification: When downloading a file, compare it with the officially provided SHA-256 fingerprint to ensure that the file has not been tampered with or downloaded incorrectly.
  2. Data deduplication: The "second transfer" of the network disk is to first calculate the file fingerprint. If there is already a file with the same fingerprint on the server, it will be referenced directly to you without repeated uploading.
  3. Digital signature prefix: Digital signatures usually do not sign the original file directly, but first sign the hash value of the file (the original file is too large and the signature algorithm is slow).
  4. Blockchain: Each block contains the hash value of the previous block, forming a chain structure that cannot be tampered with.

6. Summary

hashlibIt is a very useful standard library for Python, but its value can only be exerted when used in the right place:

ScenarioRecommendation Algorithm
File deduplication, fast verificationMD5 / SHA‑1 (fast)
General security scenarios (digital signature prefix, etc.)SHA‑256 / SHA‑512
Password Storagebcrypt/Argon2/PBKDF2

Remember these red lines and you can avoid most pitfalls:

  • 🚫 Don’t use MD5 / SHA‑1 for security related work
  • 🧂 Each password must be equipped with a unique random salt
  • 🐢 The password must use slow hashing, and the number of iterations must be high enough
  • 🔑 Force users to set strong passwords and use two-step authentication (2FA) if possible

I hope this guide can help you use hashing algorithms safely and efficiently.