Обложка канала

Библиотека Python разработчика

20835 @BookPython

Библиотека Python разработчика. Книги по программированию на Python.

Библиотека Python разработчика

4 года назад
Открыть в
The same string can be represented in different ways in Unicode and the standard is aware of it. It defines two types of equivalence: sequences can be canonically equivalent or compatible. Canonically equivalent sequences look exactly the same but contain different code points. For example, ö can be just LATIN SMALL LETTER O WITH DIAERESIS (U+00F6) or a combination of o and a diaeresis modifier: LATIN SMALL LETTER O (U+006F) + COMBINING DIAERESIS (U+0308). Compatible sequences look different but may be treated the same semantically, e. g. ff and ff. For each of these types of equivalence, you can normalize a Unicode string by compressing or decompressing sequences. In Python, you can use unicodedata for this: modes = [ # Compress canonically equivalent 'NFC', # Decompress canonically equivalent 'NFD', # Compress compatible 'NFKC', # Decompress compatible 'NFKD', ] s = ' + ö' for mode in modes: norm = unicodedata.normalize(mode, s) print('\t'.join([ mode, norm, str(len(norm.encode('utf8'))), ])) NFC + ö 8 NFD + ö 9 NFKC ff + ö 7 NFKD ff + ö 8