This is a Python package providing search-and-replace in a unidecode compatible manner. This means that, when searching, unidecode equivalences are taken into account, but the replacing is still done on the original string.
For example, German character "ö" unidecodes as "o". So, when searching a text in German language, one could search for "o" and expect to find every "o", but also every "ö" (the opposite is also true: searching for "ö" would find both of these characters). Unidecode Replace will do this:
>>> from unidecode_replace import unidecode_replace
>>> unidecode_replace("Übergröße", "ö", "!!!")
'Übergr!!!ße'
>>> unidecode_replace("Übergröße", "o", "!!!")
'Übergr!!!ße'
>>> unidecode_replace("Just an ordinary 'o'", "ö", "!!!")
"Just an !!!rdinary '!!!'"
Note: If you like the features described in this document, but you want a regular (not unidecode-compatible) search, there is a flag for that.
The core of the package is unidecode_replace
function, which is a wrapper for the class UnidecodeReplace
that does all the work. It's signature is this:
def unidecode_replace(
string: str,
search: SearchT,
sub: SubT,
*,
allow_overlaps: bool = False,
re_search: bool = False,
re_flags: re.RegexFlag = re.RegexFlag(0),
count: Optional[int] = None,
pos: Optional[int] = None,
endpos: Optional[int] = None,
unidecoded_search: bool = True,
str_case_sensitive: bool = True,
) -> str: ...
The arguments are as follows:
string
: The input string on which the search and replace are performed.search
: A search string or a regex, or a sequence of them. For more details, see Search(es).sub
: A replacement string or a function, or a sequence of them. For more details, see Substitution(s).allow_overlaps
: IfTrue
, overlapping instances of found text are replaced.re_search
: IfTrue
andsearch
is a string, it'll be compiled as a regular expression. If it's a sequence, any string in it will also be compiled.re_flags
: Flags used when compiling regular expressions (ifre_search
is set toTrue
).count
: If set, only the firstcount
occurrences are replaced.pos
,endpos
: If defined, search and replace will only be done from positionpos
(default: 0) toendpos - 1
(defaulting to the end ofstring
).unidecoded_search
: If set toFalse
, the search will be literal (i.e., not based on the unidecoded version ofstring
). This behaves equivalently tostr.replace
andre.sub
, but with the added benefit of doing multiple searches in one pass and allowing one to usepos
andendpos
.str_case_sensitive
: If set toTrue
, string searches will be performed in case-sensitive manner. If set toFalse
, string searches will be performed in case-insensitive manner. This setting does not affect regex searching.
The function returns a copy of string
with matched substrings replaced.
This function does pretty much the same thing as unidecode_replace
, except that its sub
is replaced by two arguments: prefix
and suffix
:
def unidecode_wrap(
string: str,
search: SearchT,
prefix: str,
suffix: str,
*,
allow_overlaps: bool = False,
re_search: bool = False,
re_flags: re.RegexFlag = re.RegexFlag(0),
count: Optional[int] = None,
pos: Optional[int] = None,
endpos: Optional[int] = None,
unidecoded_search: bool = True,
str_case_sensitive: bool = True,
) -> str: ...
What it does is wrapping each found substring with two values. The main purpose is easy marking of what was found. For example,
>>> unidecode_wrap("This is some string.", "some", "<b>", "</b>")
'This is <b>some</b> string.'
Let us consider two main arguments - search(es) and substitution(s) - in more detail.
As mentioned above, the search argument string
can be a string, a regular expression (re.Pattern
), or a sequence of one or more of these.
If it is a string, then:
-
If
re_search
is set toFalse
, it will be matched tostring
as it is. -
If
re_search
is set toTrue
, it will be compiled as a regular expression before being matched tostring
.
If it is a regular expression or re_search
is set to True
, the matching will be the usual regex one.
If it is a regex pattern (i.e., a compiled regex), the matching is done as one would expect of regex searches.
If it is a sequence, each item (a string or a regex pattern) is treated as just described.
If str_case_sensitive
flag is set to False
, all string matches are done in case-insensitive manner, and if it's True
then they are case-sensitive. This setting does not affect regular expressions.
When a string is compiled as a regex (because re_search
is set to True
), the value of re_flags
is used as flags
in re.compile
. For example, include re.I
here to get case-insensitive search. However, for regex pattern searches (i.e., those that are already compiled), re_search
and re_flags
have no effect.
To avoid unidecode-compatible searching (for example, treating "ö" and "o" as the same character), set unidecoded_search
to False
.
As mentioned above, the substitution argument sub
can be a string or a callable or a sequence of these things. Let us see how these act one-on-one (i.e., when there is only one search
and one sub
item).
-
If
search
is a string andsub
is a string, than any matched substring ofstring
will be replaced bysub
. -
If
search
is a string andsub
is a callable, it should accept one string (which will have the found substring provided) and any matched substring ofstring
will be replaced by whatever that callable returns. -
If
search
is a regex pattern andsub
is a string, than any matched substring ofstring
will be replaced by the expandedsub
(i.e., any regex groups in it will be replaced by the values of those groups). -
If
search
is a regex pattern andsub
is a callable, it should accept one argumentm
and any matched substring ofstring
will be replaced by whatever that callable returns.
The argument m
mentioned in item #4 is a UnidecodeReMatch
instance. This can be used as a normal re.Match
as it wraps the original one (accessible as m.original
), but it will be properly mapped to string
(m.original
has its values matching the unidecoded version of string
).
If given as sequences, search
and sub
should be of the same length and they will match 1-to-1, meaning that the first search item will be replaced with the first substitution item, the second search item will be replaced with the second substitution item, etc.
There is one exception to this rule: if search
is a sequence, then sub
can still be a single item or a sequence containing only one item. In this special case, all matched search items will be substituted by that same one substitution item. This is effectively used by unidecode_wrap
, which internally provides a single callable sub
.
There are two auxiliary functions whose functionality is more related to unidecode
, but they are exposed in this package:
-
can_be_unidecoded(string: str) -> bool
: ReturnTrue
if thestring
can be unidecoded without errors. -
get_invalid_chars(string: str) -> set[str]
: Return a set of chars instring
that fail to unidecode.
Apart from unidecode_replace
and unidecode_wrap
, the package also exposes the following classes (but most of them won't be imported with from unidecode_replace import *
):
-
UnidecodeReplace
: The class that does all the actual work. Sinceunidecode_replace
exposes all of its functionality, you'd only want to use this if you were to inherit it. -
SearchItem
,SearchItemStr
,SearchItemRegex
: Internal classes that hold(search, sub)
pairs and perform the actual searching and generating the replacement strings. You probably don't need these for more than just type annotations in the event of inheritingUnidecodeReplace
and extending its search capabilities. -
UnidecodeReMatch
: The wrapper class forre.Match
, used to map the results of regular expressions performed on unidecoded strings back to the original ones. Also unlikely to be used for more than type annotations, but this one is exposed in__all__
. The reason for this is that one might need it for type annotations in the callables used as substituted when searching with regular expressions. -
SearchT
,SubT
,u2iT
: Types forsearch
andsub
arguments, and foru2i
attribute (an internal mapping used byUnidecodeReplace
to map indexes from unidecoded string to the original (input) one).