Python Serialization and Deserialization Guide

1. Serialization overview

When the program is running, all variables—whether they are simple dictionaries, lists, or complex custom class instances—are temporarily stored in the memory stack. Once the program ends, the operating system will immediately reclaim this memory, and the data will disappear.

But in actual development, we often need to "retain data":

Save the game progress as an archive and play again next time;
Cache the data captured by the crawler locally to avoid repeated requests;
Transfer structured information through API between front-end and back-end and microservices...

At this time, it is necessary to convert those "living, three-dimensional objects" in the memory into a format that can be stored (such as writing to files, databases) or transmitted (such as sending through the network). This process is called serialization. (In Python's binary-specific tools, this process is also vividly called pickling - "pickling" the data and saving it.)

In turn, restoring these flattened data into objects that can be directly manipulated in memory is deserialization (unpickling, unpickling).

Next, let’s take a look at the two most commonly used serialization schemes in Python.

2. Python pickle module

pickleIt is the simplest and most direct binary serialization tool that comes with Python. It handles almost all of Python's built-in types, even custom class instances with methods.

2.1 Basic usage

There are only four core APIs, which are very easy to remember:

dumps(obj): Serialize the object tobytesA binary string of type;
dump(obj, file): Directly serialize and write file objects;
loads(bytes):Deserialize back to object from binary string;
load(file):Deserialize from file object back to object.

import pickle

# 一个混合了多种类型的字典，pickle 完全可以处理
game_save = {
    "player": "小明",
    "level": 15,
    "inventory": ["铁剑", "生命药水*3", "金币120"],
    "hp": 88.5,
    "is_alive": True
}

# 1. 序列化为内存中的二进制串
serialized_bytes = pickle.dumps(game_save)
print(f"序列化后的二进制串（前 20 字节）：{serialized_bytes[:20]}")

# 2. 直接写入本地文件（务必使用二进制写入模式 'wb'）
with open("game_save.pkl", "wb") as f:
    pickle.dump(game_save, f)

# 3. 从本地文件反序列化（二进制读取模式 'rb'）
with open("game_save.pkl", "rb") as f:
    loaded_save = pickle.load(f)

print(f"加载后的玩家信息：{loaded_save['player']}，等级 {loaded_save['level']}")

2.2 Limitations of pickle

pickleAlthough convenient, the applicable scenarios are very narrow and have three main flaws:

core limit warning

Absolutely exclusive to Python The binary data generated by pickle is completely incomprehensible to other languages (Java, Go, JavaScript, etc.), so it can only be used for data exchange within the Python environment.
Poor version compatibility Pickle files generated between different Python major versions (such as 2.x and 3.x) or even minor versions (such as 3.8 and 3.12) are likely to be incompatible, and old archives may not be read properly after upgrading the interpreter.
High-risk security vulnerabilities **Never deserialize pickle data from untrusted sources! ** The restore process of pickle is essentially executing a piece of Python bytecode. A maliciously constructed pickle file can directly run any system command to delete your important files, steal privacy, and even control your computer. :::

3. JSON serialization

JSON (JavaScript Object Notation) is currently the most common cross-language text serialization format. Not only are Python, JavaScript, Java, Go and other mainstream languages supported natively, but JSON itself is plain text and is very clear for humans to read.

3.1 Data type correspondence table

JSON is a "lightweight" format that supports only six basic types. Python's standard libraryjsonType mapping is performed automatically during serialization and deserialization:

JSON type	Python type
`{}`(object)	`dict`
`[]`(array)	`list` / `tuple`(Defaults to list after deserialization)
`"string"`(string)	`str`
`1234.56`(number)	`int`or`float`
`true` / `false`	`True` / `False`
`null`	`None`

3.2 Basic usage

jsonModule API design andpickleVery similar, there are still four core methods, but the object processed is UTF-8 text or string in text form:

import json

# 一个符合 JSON 规则的 Python 字典
api_data = {
    "code": 200,
    "msg": "success",
    "data": {
        "user_id": 10086,
        "username": "Alice",
        "favorites": ["Python", "读书", "旅行"]
    }
}

# 1. 序列化为 JSON 字符串（文本）
json_str = json.dumps(api_data)
print(f"序列化后的 JSON 文本：{json_str}")

# 2. 格式化输出 —— indent 参数指定缩进空格数，可读性更高
pretty_json = json.dumps(api_data, indent=2)
print(f"格式化后的 JSON：\n{pretty_json}")

# 3. 直接写入本地文件（使用文本写入模式 'w'，默认 UTF-8）
with open("api_response.json", "w", encoding="utf-8") as f:
    json.dump(api_data, f, indent=2, ensure_ascii=False)  # ensure_ascii 稍后解释

# 4. 从 JSON 字符串反序列化回 Python 对象
loaded_data = json.loads(json_str)
print(f"加载后的用户 ID：{loaded_data['data']['user_id']}")

# 5. 从本地文件反序列化
with open("api_response.json", "r", encoding="utf-8") as f:
    loaded_from_file = json.load(f)

3.3 Handle Chinese characters well (a practical tip)

::: tip Chinese display optimization By default,json.dumpswill escape non-ASCII characters (such as Chinese) into\uXXXXform. For the program, the front and back ends can parse it normally, but it is very unfriendly for human eyes to read. Just addensure_ascii=False, you can retain the native Chinese characters. At the same time, be sure to remember to specify it explicitly when reading and writing files.encoding="utf-8", to avoid garbled characters.

chinese_data = {"name": "小红", "address": "北京市朝阳区"}

# 默认行为：中文被转义
print(json.dumps(chinese_data))
# 输出：{"name": "\u5c0f\u7ea2", "address": "\u5317\u4eac\u5e02\u671d\u9633\u533a"}

# 保留中文阅读版
print(json.dumps(chinese_data, ensure_ascii=False))
# 输出：{"name": "小红", "address": "北京市朝阳区"}

4. Serialize custom objects

jsonBy default, the module cannot directly handle instances of custom classes, and we need to provide our own "object → dictionary" conversion logic. There are two commonly used methods.

4.1 Simple method: use directly`dict`

If your class is just a pure data container with no private attributes and no complex inheritance relationships, you can directly use the Python object that comes with it.__dict__Properties - It automatically packs all the public properties of the instance into a dictionary.

class SimpleStudent:
    def __init__(self, name, age, score):
        self.name = name
        self.age = age
        self.score = score

s1 = SimpleStudent("Bob", 20, 88)

# 序列化：通过 default 参数指定转换函数
simple_json = json.dumps(s1, default=lambda obj: obj.__dict__, ensure_ascii=False)
print(simple_json)  # {"name": "Bob", "age": 20, "score": 88}

# 反序列化：通过 object_hook 参数指定还原函数
def dict_to_simple_student(d):
    return SimpleStudent(d["name"], d["age"], d["score"])

loaded_s1 = json.loads(simple_json, object_hook=dict_to_simple_student)
print(f"加载后的学生：{loaded_s1.name}，分数 {loaded_s1.score}")

4.2 A more flexible method: achieving exclusive`to_dict`and`from_dict`

When your class has private attributes, attributes inherited from the parent class, or you want to mark the class information during serialization to facilitate global deserialization, it is more recommended to specifically implement the conversion method inside the class.

class FlexibleStudent:
    def __init__(self, name, age, score):
        self.name = name
        self.age = age
        self.__secret = "我的梦想是当程序员"  # 私有属性，__dict__ 不会直接暴露
    
    def to_dict(self):
        # 手动构建需要序列化的字典，并可加入 __class__ 标记元信息
        return {
            "__class__": "FlexibleStudent",
            "name": self.name,
            "age": self.age,
            "score": self.score,
            # 如果需要，也可以手动暴露部分私有属性
            # "secret": self.__secret
        }
    
    @classmethod
    def from_dict(cls, d):
        if d.get("__class__") == "FlexibleStudent":
            return cls(d["name"], d["age"], d["score"])
        # 如果不是目标类，直接返回字典本身，避免影响其他数据的还原
        return d

s2 = FlexibleStudent("小红", 19, 92)

# 序列化
flexible_json = json.dumps(s2, default=lambda obj: obj.to_dict(), ensure_ascii=False)
print(flexible_json)

# 反序列化 —— 全局只需要调用这个类方法即可
loaded_s2 = json.loads(flexible_json, object_hook=FlexibleStudent.from_dict)
print(f"加载后的学生类型：{type(loaded_s2)}")  # <class '__main__.FlexibleStudent'>

5. Security and Best Practices

No matter which serialization scheme you choose, keep the following points in mind:

pickle never touch untrusted data This cannot be emphasized enough. Only use pickle in scripts, local caches, or internal pipelines that you have complete control over.
Perform strict structure verification on JSON input The deserialized data is likely to have missing fields or unexpected types. You need to check manually, or usepydanticWait for the verification library to ensure the reliability of the data structure.
Always do exception-handling Whether it's a file that doesn't exist, insufficient permissions, or a malformed JSON, it can happen at any time. Example:

import json

malformed_json = '{"name": "Bob", age: 20}'  # 错误：键 age 缺少双引号

try:
    data = json.loads(malformed_json)
except json.JSONDecodeError as e:
    print(f"JSON 解析失败：错误位置 {e.pos}，错误信息：{e.msg}")
except FileNotFoundError:
    print("文件不存在")

6. Performance optimization (big data scenario)

When the amount of data processed reaches the GB level, or JSON is used frequently in high-concurrency network requests, the standard libraryjsonMay become a performance bottleneck. At this point you may wish to consider the following alternatives.

6.1 Faster JSON library

orjson
Currently the fastest JSON library in the Python ecosystem, installation method:pip install orjson。
What it returns isbytesinstead ofstr, and can be automatically serializeddatetime、UUIDand other common types.
ujson
It is also much faster than the standard library and has slightly better compatibility thanorjson, installation method:pip install ujson。

import orjson

big_data = [{"id": i, "value": f"data_{i}"} for i in range(100_000)]

# 序列化后得到 bytes，可直接写入文件或通过网络发送
orjson_bytes = orjson.dumps(big_data)

# 反序列化
loaded_big_data = orjson.loads(orjson_bytes)

6.2 Binary cross-language format

If you have higher requirements for parsing speed and data compression rate, you can abandon plain text JSON and use binary format instead:

MessagePack
Similar structure to JSON, but smaller and faster. Installation method:pip install msgpack。
Protocol Buffers（protobuf）
The structured binary serialization format produced by Google has the strongest performance and highest compression rate, but it needs to be written in advance..protofile to define the data structure, the cost of getting started is slightly higher.

7. Summary

When choosing a serialization solution, make trade-offs based on your core needs:

Solution	Applicable scenarios	Advantages	Disadvantages
`pickle`	Python internal temporary data storage, local cache	Supports almost all Python types, easy to use	Not cross-language, unsafe, and poor version compatibility
Standard`json`	Front-end/cross-language API, small data storage	Cross-language, safe, human-readable	Average performance, only supports basic types
`orjson` / `ujson`	High-frequency JSON processing, large-scale data storage or transmission	Cross-language, safe, extremely fast	Slightly lower compatibility,`orjson`return`bytes`
Protocol Buffers	Cross-language, strict structure, ultra-high frequency data exchange	Highest performance, highest compression rate	Requires definition`.proto`, high learning costs

Finally, I would like to emphasize again: **Safety first, pickle only believes in yourself! **

#Python Serialization and Deserialization Guide

#1. Serialization overview

#2. Python pickle module

#2.1 Basic usage

#2.2 Limitations of pickle

#3. JSON serialization

#3.1 Data type correspondence table

#3.2 Basic usage

#3.3 Handle Chinese characters well (a practical tip)

#4. Serialize custom objects

#4.1 Simple method: use directly__dict__

#4.2 A more flexible method: achieving exclusiveto_dictandfrom_dict

#5. Security and Best Practices

#6. Performance optimization (big data scenario)

#6.1 Faster JSON library

#6.2 Binary cross-language format

#7. Summary