If you are using Python 3 on Windows, you may have seen a Unicode decoding error when opening files in UTF-8 format:
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xaf in position 5: illegal multibyte sequence
If you read the same file on Linux or MacOS, you will find that this file can be opened without any error. Why is there a difference? It has something to do with the default encoding Python chooses to use on different platforms.
The reason and the solution
Reading UTF-8 files
To show the default encoding used by Python on your platform, use the following snippet:
import locale locale.getpreferredencoding()
locale.getpreferredencoding() is used to get the encoding for text files on the system. On Linux, the output is
UTF-8 and on my Windows (a Chinese simplified version of Windows 10 Pro), the output is
cp936, which means that
UTF-8 is used as the default encoding on Linux and
cp936 is used as the default encoding on Windows1.
cp936 is the abbreviation for “Microsoft code page 936”, which encodes all the characters in the GBK character set.
The solution to this issue is quite simple. If you are sure that the file to read is encoded in UTF-8, you can use
encoding="utf-8" in the builtin
with open("test.txt", "r", encoding="utf-8"): text = f.read()
The Python 3 official documentation for
encoding parameter says:
encodingis the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever
locale.getpreferredencoding()returns), but any text encoding supported by Python can be used. See the
codecsmodule for the list of supported encodings.
Writing UTF-8 files
You may wonder that when you write non-ASCII characters to files on Windows, you haven’t encountered any issues related to encoding. The reason is probably that the
gbk codec used can encode the characters by chance. If you use characters not in the GBK character set, you will see encoding errors. To verify this, use the following snippet:
with open("test.txt", "w") as f: f.write("조선말")
If you run the above script on Windows, you may see the following error message:
UnicodeEncodeError Traceback (most recent call last)
1 with open(“test.txt”, “w”) as f:
—-> 2 f.write(“조선말”)
UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\uc870’ in position 0: illegal multibyte sequence
To solve this issue and save files in UTF-8 encoding, you should use
encoding="utf-8" when writing texts to files.
- UTF-8 mode in Python 3.7
- Note that in different regions, the output of
locale.getpreferredencoding()will change. ⏎
License CC BY-NC-ND 4.0