If you are using Python 3 on Windows, you may have seen a Unicode decoding error when opening files in UTF-8 format:

UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xaf in position 5: illegal multibyte sequence

If you read the same file on Linux or MacOS, you will find that this file can be opened without any error. Why is there a difference? It has something to do with the default encoding Python chooses to use on different platforms.

The reason and the solution

Reading UTF-8 files

To show the default encoding used by Python on your platform, use the following snippet:

import locale
locale.getpreferredencoding()

The method locale.getpreferredencoding() is used to get the encoding for text files on the system. On Linux, the output is UTF-8 and on my Windows (a Chinese simplified version of Windows 10 Pro), the output is cp936, which means that UTF-8 is used as the default encoding on Linux and cp936 is used as the default encoding on Windows1.cp936 is the abbreviation for “Microsoft code page 936”, which encodes all the characters in the GBK character set.

The solution to this issue is quite simple. If you are sure that the file to read is encoded in UTF-8, you can use encoding="utf-8" in the builtin open() method:

with open("test.txt", "r", encoding="utf-8"):
    text = f.read()

The Python 3 official documentation for encoding parameter says:

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

Writing UTF-8 files

You may wonder that when you write non-ASCII characters to files on Windows, you haven’t encountered any issues related to encoding. The reason is probably that the gbk codec used can encode the characters by chance. If you use characters not in the GBK character set, you will see encoding errors. To verify this, use the following snippet:

with open("test.txt", "w") as f:
    f.write("조선말")

If you run the above script on Windows, you may see the following error message:

UnicodeEncodeError Traceback (most recent call last) in ()
1 with open(“test.txt”, “w”) as f:
—-> 2 f.write(“조선말”)
3

UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\uc870’ in position 0: illegal multibyte sequence

To solve this issue and save files in UTF-8 encoding, you should use encoding="utf-8" when writing texts to files.

References


  1. Note that in different regions, the output of locale.getpreferredencoding() will change.