The same string can be represented in different ways in Unicode and the standard is aware of it. It defines two types of equivalence: sequences can be canonically equivalent or compatible.
Canonically equivalent sequences look exactly the same but contain different code points. For example, ö can be just LATIN SMALL LETTER O WITH DIAERESIS (U+00F6) or a combination of o and a diaeresis modifier: LATIN SMALL LETTER O (U+006F) + COMBINING DIAERESIS (U+0308).
Compatible sequences look different but may be treated the same semantically, e. g. ff and ff.
For each of these types of equivalence, you can normalize a Unicode string by compressing or decompressing sequences. In Python, you can use unicodedata for this:
modes = [
# Compress canonically equivalent
'NFC',
# Decompress canonically equivalent
'NFD',
# Compress compatible
'NFKC',
# Decompress compatible
'NFKD',
]
s = 'ff + ö'
for mode in modes:
norm = unicodedata.normalize(mode, s)
print('\t'.join([
mode,
norm,
str(len(norm.encode('utf8'))),
]))NFC ff + ö 8
NFD ff + ö 9
NFKC ff + ö 7
NFKD ff + ö 8