Библиотека Python разработчика(@BookPython). The same string can be represented in different ways in Unicode and the standard is aware of it. It

The same string can be represented in different ways in Unicode and the standard is aware of it. It defines two types of equivalence: sequences can be canonically equivalent or compatible. Canonically equivalent sequences look exactly the same but contain different code points. For example, ö can be just LATIN SMALL LETTER O WITH DIAERESIS (U+00F6) or a combination of o and a diaeresis modifier: LATIN SMALL LETTER O (U+006F) + COMBINING DIAERESIS (U+0308). Compatible sequences look different but may be treated the same semantically, e. g. ﬀ and ff. For each of these types of equivalence, you can normalize a Unicode string by compressing or decompressing sequences. In Python, you can use unicodedata for this:

modes = [
    # Compress canonically equivalent
    'NFC',
    # Decompress canonically equivalent
    'NFD',
    # Compress compatible
    'NFKC',
    # Decompress compatible
    'NFKD',
]
s = '

ﬀ

 + ö'

for mode in modes:
     norm = unicodedata.normalize(mode, s)
     print('\t'.join([
        mode,
        norm,
        str(len(norm.encode('utf8'))),
    ]))

NFC ﬀ

 + ö   8
NFD

ﬀ

 + ö   9
NFKC    ff + ö  7
NFKD    ff + ö  8