A complete guide to detecting text encoding using chardet
In the text processing journey of Python, encoding recognition is definitely a pit that cannot be avoided. Whether it is a garbled web page captured by a crawler, a CSV file exported by an old system, or a text attachment transmitted via email/Bluetooth, these "unidentified" binary data will not tell you what encoding it is.encode()anddecode()The method is simple, but the premise is that you have to know which encoding parameters to use - unfortunately, reality often does not give this option.
At this time, a tool that can automatically "smell" the encoding becomes extremely valuable, it is chardet.
Why chardet?
chardet is one of the oldest and most versatile character encoding detection libraries in the Python ecosystem. Its core idea is very simple and crude: through a pre-trained character distribution feature library, analyze the byte pattern and statistical rules of the input binary data, and finally output the most likely encoding candidate and a confidence score.
Compared with manually trying to code one by one, or relying on some niche tools, the advantages of chardet are clear at a glance:
- Out of the box: The API is minimalist and can be done with just a few lines of code
- Wide coverage: Supports dozens of mainstream/niche encodings, including Chinese, Japanese, Korean and other multi-byte encodings
- Progressive Detection: When processing large files, it is not necessary to read them all into the memory. It can end early when enough characteristic bytes are read.
Installation method
chardet supports pip and conda installation, just choose according to your environment.
1. Install via pip (recommended)
The most common way, can be used in both virtual environment and global Python:
If you have permission issues, you can install to your user directory (avoid usingsudoPollution system environment):
2. Install via conda
If you use Anaconda or Miniconda, chardet may already be preinstalled. If not, you can execute:
Core basic usage
The core API of chardet is very simple, mainly including two:
chardet.detect(): Suitable for small size binary data (such as API response fragments)chardet.universaldetector.UniversalDetector: Suitable for large files or streaming data, progressive analysis
Scenario 1: Small text detection (one read)
Throw the binary data directly todetect(), which returns a dictionary with three fields:
⚠️ There are two details worth paying attention to here:
- GBK encoding is recognized as GB2312, because GBK is a superset of GB2312, and chardet’s pre-training library is more biased towards basic encoding;
- The confidence level of pure ASCII is 1.0, Chinese UTF-8 is 0.99, and GBK is only about 0.74 - this reminds us: The longer the text and the more obvious the features, the higher the confidence level.
Scenario 2: Large file detection (streaming reading)
If used directlydetect()Reading a log file of several hundred MB will cause the memory to skyrocket. Please use UniversalDetector Feed data line by line, and the detector will automatically stop when it feels "enough":
Advanced practical usage: secure decoding function
Even if chardet gives a candidate encoding, decoding may fail due to reasons such as the text is too short and lacks features, the encoding boundary is truncated, etc. We can write a "cover-up function" to make the program both smart and robust:
Supported common encoding list
Chardet officially supports dozens of encodings. Here are some of the most commonly used encodings in Chinese, English, Japanese and Korean scenarios:
- Single-byte encoding: ASCII, ISO-8859-1 (latin1), Windows-1252
- UTF series: UTF-8, UTF-16 BE/LE, UTF-32 BE/LE
- Chinese encoding: GB2312, GBK, GB18030, Big5 (Traditional)
- Japanese encoding: EUC-JP, Shift_JIS, ISO-2022-JP
- Korean encoding: EUC-KR, ISO-2022-KR
The complete list can be found at chardet 的 GitHub 文档.
Three key reminders in use
1. Don’t blindly believe in the test results, learn to look at the “face”——Confidence
- Confidence > 0.9: You can basically use it with confidence
- Confidence level 0.7 ~ 0.9: Give priority to try, but it is best to cooperate with the secret coding
- Confidence < 0.7: Skip directly and use a customized cover-up strategy
2. Don’t swallow large files “in one gulp”
The first few hundred to a thousand lines of text usually contain enough features, and reading them all will just waste memory. aboveUniversalDetectoris the right way.
3. Pay attention to the "superset" relationship of encoding
chardet may return a more basic encoding, such as identifying GBK as GB2312. When decoding fails, you might as well try its superset: GB2312 → GBK → GB18030.
Comparison of two alternatives
chardet works well enough, but in some extreme scenarios you may need more speed or accuracy. Two alternatives are provided here:
1. cchardet——speed maniac
This is a rewritten version of chardet in C language, 10~100 times faster, fully API compatible, and almost seamless switching:
2. charset-normalizer——Precision guidance
The libraries that have emerged in recent years have optimized the feature library to more accurately identify short text and edge encoding, and the API is similar:
Summarize
chardet is a **simple, reliable, and wide-coverage encoding detection tool that can solve more than 90% of your Python encoding identification problems. Recommended best practices list:
- Use small text directly
detect(), on large filesUniversalDetector - Always bring custom coding and exception-handling (
safe_decodeUse the template directly) - Short text and high-precision scenes can be used interchangeably
charset-normalizer - Invest in the pursuit of ultimate speed
cchardetembrace - Develop the habit of recording detected encodings and confidence levels to troubleshoot problems with less effort.
With chardet and these techniques, you no longer have to manually test code after code with a screen full of garbled characters!

