Question

Unicode look-alikes

I am looking for an easy way to match all unicode characters that look like a given letter. Consider an example for a selection of characters that look like small letter n.

import re
import sys

sys.stdout.reconfigure(encoding='utf-8')

look_alike_chars = [ 'n', 'ń', 'ņ', 'ň', 'ʼn', 'ŋ', 'ո', 'п', 'ո']
pattern = re.compile(r'[nńņňʼnŋոп]')


for char in look_alike_chars:
    if pattern.match(char):
        print(f"Character '{char}' matches the pattern.")
    else:
        print(f"Character '{char}' does NOT match the pattern.")

Instead of r'[nńņňʼnŋոп]' I expect something like r'[\LookLike{n}] where LookLike is a token.

If this is not possible, do you know of a program or website that lists all the look-alike symbols for a given ASCII letter?

 3  159  3
1 Jan 1970

Solution

 5

There is a Python library called Unidecode that will translate some of these special Unicode character to the corresponding ascii character. A code sample should be this:

import unidecode

input_string = "ñ"
clean_string = unidecode.unidecode(input_string) # = 'n'

However, the correspondences are made based on their actual idiom equivalent.

So whenever you want to translate a capital Greek 'PI' (written п), it will return a 'p' because pi is equivalent to a 'p' and not to an 'n'. Besides there is no publicly available database of such correspondences. Feel free to build one and share it!

2024-07-08
Lrx

Solution

 0

There isn't any specific way to achieve this, you'll need to build up a set of characters, and possibly add additional characters automation misses.

You need to find:

  1. equivalence sets [=n=] or [\p{toNFKD=/n/] but Python regular expression engines do not support equivalence sets. So we need to build it.
  2. you need in-script confusables
  3. you need in script transliterations that resolve to the character you are trying to match.

Then build a pattern out of those

I will try to limit the number of packages required, but robust unicode support is needed:

import icu
import regex

To start with we will need some helper functions.

def NFKD(chars):
    normaliser = icu.Normalizer2.getNFKDInstance()
    return normaliser.normalize(chars)
def to_ascii(char):
    transliterator = icu.Transliterator.createInstance('Latin-ASCII')
    return transliterator.transliterate(char)
def single_letter(chars):
    pattern = r'^[\p{sc=Common}\p{sc=Inherited}]*\p{Letter}[\p{sc=Common}\p{sc=Inherited}]*$'
    if regex.match(pattern, chars):
        return chars
    return ''

NFKD() speaks for itself: we need decomposed compatibility sequences that match the target letter.

to_ascii() will asciify and Latin string.

single_letter() will weed out compatibility decompositions or asciification that resolve to two or more letters.

We then define a set of characters we want to restrict our choices from. We aren't interested in lookalikes from other scripts, just Latin, But I will include Inherited and Common scripts as well.

LATIN = set(icu.UnicodeSet(r'[[\p{sc=Latin}][\p{sc=Inherited}][\p{sc=Common}]]'))

We will test our target character against this set using a series of set comprehensions, then create a pattern from the union of those sets:

def get_target_pattern(target):
    equivalence = {x for x in LATIN if target in single_letter(NFKD(x))}
    checker = icu.SpoofChecker()
    checker.setChecks(icu.USpoofChecks.ALL_CHECKS)
    confusable = {x for x in LATIN if checker.areConfusable(target, x) != 0}
    asciified = {x for x in LATIN if target in single_letter(to_ascii(x))}
    return f'[{''.join(set.union(equivalence, confusable, asciified))}]'

For n this gives us:

pattern = get_target_pattern('n')
print(pattern)
# [𝗻ň𝓃ɲnǹ𝖓𝘯𝓷𝐧ŋ𝑛ṋ𝔫ꞑₙṇ𝒏ꝴƞᶇ⒩ⓝɳⁿṅ𝙣ȵñńņᵰʼnnṉꞥ𝕟𝚗𝗇]

Update:

There are a number of ways of modifying the results:

  1. Use NFD instead of NFKD
  2. Add Cyrillic and Greek script confusables
def normalize(chars, nf="nfd"):
    match nf.lower():
        case "nfc":
            form = "nfc"
            mode = icu.UNormalizationMode2.COMPOSE
        case "nfkc":
            form = "nfkc"
            mode = icu.UNormalizationMode2.COMPOSE
        case "nkkc_cf":
            form = "nfkc_cf"
            mode = icu.UNormalizationMode2.COMPOSE
        case "nfkd":
            form = "nfkc"
            mode = icu.UNormalizationMode2.DECOMPOSE
        case "nfd":
            form = "nfc"
            mode = icu.UNormalizationMode2.DECOMPOSE
    normalizer = icu.Normalizer2.getInstance(None, form, mode)
    return normalizer.normalize(chars)

LATN_CODESPACE = set(icu.UnicodeSet(r'[[\p{sc=Latin}][\p{sc=Inherited}][\p{sc=Common}]]'))
LCG_CODESPACE = set(icu.UnicodeSet(r'[[\p{sc=Latin}][\p{sc=Cyrillic}][\p{sc=Greek}][\p{sc=Inherited}][\p{sc=Common}]]'))

def get_target_pattern(target, norm_form = "nfkd", extended=False):
    equivalence = {x for x in LATN_CODESPACE if target in single_letter(normalize(x, norm_form))}
    checker = icu.SpoofChecker()
    checker.setChecks(icu.USpoofChecks.ALL_CHECKS)
    if extended:
        checker.setChecks(icu.USpoofChecks.MIXED_SCRIPT_CONFUSABLE)
        confusable = {x for x in LCG_CODESPACE if checker.areConfusable(target, x) != 0}
    else:
        checker.setChecks(icu.USpoofChecks.WHOLE_SCRIPT_CONFUSABLE)
        confusable = {x for x in LATN_CODESPACE if checker.areConfusable(target, x) != 0}
    asciified = {x for x in LATN_CODESPACE if target in single_letter(to_ascii(x))}
    return f'[{''.join(set.union(equivalence, confusable, asciified))}]'

get_target_pattern('n')
get_target_pattern('n', norm_form="nfd")
get_target_pattern('n', norm_form="nfd", extended = True)
2024-07-09
Andj