ericcornelissen/wordrow

Improve functionality regarding replacing words with prefix or suffix

Closed this issue · 1 comments

Replace the ad-hoc functionality to only replace words if preceded by a space with something more proper that gives the creator of the map the power to determine if the word should be replaced if it has a prefix and/or suffix.

This is useful in a few scenarios, for example:

  • One may want to replaces all instances of "mail" with "email", but if any instance of "mail" is replaced with "email", then "email" will be replaced by "eemail". Here, an instance of "mail" with prefix should not be replaced.
  • In another scenario, one may want to replace words ending in "ize" with "ise" (to convert American English spelling to British English spelling)). Here, an instance of "ize with prefix should be replaced.

A similar argument can be made for suffixes.

Proposal

Define a very simple language to deal with suffixes and prefixes based on the - (hyphen) symbol.

To replace words only if they match the "from" value completely (i.e. the instance is surrounded by whitespace), simply add the word as is (without any changes) to the mapping file. For example, to convert "mail" with "email" (in CSV):

mail, email

Which should work as illustrated by:1

- I received a mail and an email.
+ I received a email and an email.

To replace words if they match the "from" value completely or with a prefix (i.e. after the instance is whitespace, but before the instance can be whitespace or letters), add the word with - in front of it. For example, to convert all instances of words ending in "ize" with "ise" (in CSV):

-ize, -ise

Which should work as illustrated by:

- They realize that they should not idealize.
+ They realise that they should not idealise.

To replace words if they match the "from" value completely or with a suffix (i.e. before the instance is whitespace, but after the instance can be whitespace or letters), add the word with - behind it. For example, to convert all instances of words starting with "color" by "colour" (in CSV):

color-, colour-

Which should work as illustrated by:

- The colors are amazing on this colorful painting. What is your favourite color?
+ The colours are amazing on this colourful painting. What is your favourite colour?

Discussion

Hyphens in the middle

Hyphens in the middle of a mapping string should not be affected by this proposal at all. For example, for the (CSV) mapping:

e-mail, email

the hyphen between the "e" and "m" does not mean anything special.

Mapping inversion

This could proof difficult to implement when combined with the --invert option... 😟

Regular expressions

Arguably, this could be implemented using regular expressions. However, I believe this is unnecessarily powerful and may very well be unfamiliar to users of this tool. The proposed solution should be much more intuitive to use and read.

If required, regular expressions could be layer on top or implemented in parallel to this proposal:

  • On top: since hyphens have no special meaning in regular expression at the start or end of an expression, hyphens in these places could be escaped (as, e.g. \-ise to match "-ise" exactly). Moreover, this kind of escaping may turn out to be necessary nonetheless.
  • In parallel: to avoid confusion it might be useful to make the use of regular expressions explicit in mapping files. This can achieved, for example, by surrounding the mapping string by forward slashes (/), which is used in some programming languages to denote regular expression literals.

  1. Perhaps it would be nice if we could replace "a mail" by "an email" automatically (inserting an extra "n") even though the mapping is [mail -> email]. However, this won't not work for all languages... For now this is solved easily by also adding a mapping for [a email -> an email].

This could proof difficult to implement when combined with the --invert option... 😟

Although a bit awkward, I don't think it poses a problem. Consider the following mapping file

foo, -bar-

Which reads something like "replace all instance of foo - without prefix or suffix - by bar - maintaining the prefix and suffix". This sounds weird, and is not something anyone should ever use, but it is functionally correct. I.e. the program should have no problem with this mapping. It can just map foo to bar, and it never has to care about any prefix or suffix.