2Toad/Profanity

Support for other languages than English

Closed this issue · 10 comments

Thanks for this great tool! I'm wondering: Are there any plans to add support for other languages as well?

Greetings @devmount. It's certainly crossed my mind, but until now, nobody has voiced an interest. Do you have a particular language in mind?

Well, German if I may wish 😇 I can PR a list of German profane words, if that helps!

@devmount please attach the list to this issue (as a text document) vs. a PR. That would be very helpful in moving this request forward

Will do, thanks!

Thank you for suggesting this @devmount. I've outlined some initial requirements and acceptance criteria for the ticket, that I'll use to implement this new feature. Please let me know your thoughts:

Profanity is currently limited to filtering English words. To enhance its functionality, we need to extend support to multiple languages, starting with German. This update should allow the filter to accurately detect and filter profanity in German, while maintaining its existing support for English. The decision has been made to handle filtering based on language codes (e.g., en, de, es), rather than locale-specific codes (e.g., en-US, en-GB), to keep the implementation simple and scalable.

Requirements

  1. Language Selection: Implement a mechanism to specify the language via ProfanityOptions (e.g., profanityOptions.language="de" for German). Default to English (en).
  2. German Profanity List: Add a comprehensive list of common German profanity words to the ./data folder. This list should be stored separately from the English words to ensure clarity and ease of future expansion to other languages.
  3. Word Loading Update: Update the library's word loading process to handle multiple language sets. Ensure it is scalable to easily add new languages in the future.
  4. Testing:
    • Write unit tests for German profanity filtering.
    • Ensure the existing English profanity filter tests continue to pass.

Acceptance Criteria

  1. The library should correctly identify and filter German profanity when the language is set to German (de).
  2. English profanity filtering should remain unaffected (en).
  3. The filter should allow switching between languages via a configuration option.
  4. The library should default to English (en).

Out of Scope

  1. Locale-specific support (e.g., en-US, en-GB) is not required at this stage.

I totally agree with those requirements! Everything sounds reasonable for me. Holding the data files separate for each language should keep everything stay organized. Locale-specific support ist IMHO not necessary for this tool. Maybe in the future (but out of scope for now) it would be nice to combine languages (e.g. filtering for EN and DE at the same time) since many German swear words are also English.

For retrieving data, you could look out for repos like https://github.com/thisandagain/washyourmouthoutwithsoap (MIT licensed). I've curated a list with German profanity words for now:

de_profanity.txt

Thanks for the suggestion, @devmount, and for sharing the German list of profane words. washyourmouthoutwithsoap's method of auto-generating profanity lists for other languages based on the core English list is an interesting approach. While it may not be as accurate as having a native speaker create the list, it could still serve as a solid starting point. I'm re-evaluating this ticket's approach to multi-language support and will update the requirements and acceptance criteria accordingly

You're welcome. Yes, curated lists from native speakers will always be more accurate. But since you don't have any requirement to support a lot of different languages from the start, you can just make this tool support multiple languages in general and add new languages, when they are provided by contributors.

After further consideration, I've updated the requirements and AC to support being able to specify multiple languages at the exists() and censor() level, rather than locking the user into a list of languages during the creation of a custom Profanity class. I've added caching (keyed by languages) so we don't lose any performance with this approach.


Profanity is currently limited to filtering English words. To enhance its functionality, we need to extend support to multiple languages, starting with German. This update should allow the filter to accurately detect and filter profanity in German, while maintaining its existing support for English. The decision has been made to handle filtering based on language codes (e.g., en, de, es), rather than locale-specific codes (e.g., en-US, en-GB), to keep the implementation simple and scalable.

Requirements

  1. Language Selection: Implement a mechanism to specify the default languages via ProfanityOptions (e.g., profanityOptions.languages=["de"] for German). Default to English (en).
  2. German Profanity List: Add a comprehensive list of common German profanity words to the ./data folder. This list should be stored separately from the English words to ensure clarity and ease of future expansion to other languages.
  3. Word Loading Update: Update the library's word loading process to handle multiple language sets. Ensure it is scalable to easily add new languages in the future.
  4. Testing:
    • Write unit tests for German profanity filtering.
    • Ensure the existing English profanity filter tests continue to pass.

Acceptance Criteria

  1. The library should correctly identify and filter German profanity when the language is set to German (de).
  2. English profanity filtering should remain unaffected (en).
  3. ProfanityOptions contains a new languages: string[] property that defaults to English ["en"]
  4. exists() and censor() take a new optional languages: string[] argument
  5. If a language is not specified in exists() or censor(), it defaults to the languages specified in ProfanityOptions.languages

Out of Scope

  1. Locale-specific support (e.g., en-US, en-GB) is not required at this stage.

Wow, thank you so much for implementing this in just a few days!! Much appreciated. I'll let you know, if I found any issues after testing the multilang feature.
Thanks again!