This module can be used to replace keywords in sentences or extract keywords from sentences.
$ pip install flashtext
- Extract keywords
>>> from flashtext.keyword import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.add_keyword('Bay Area') >>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.') >>> keywords_found >>> # ['New York', 'Bay Area']
- Replace keywords
>>> keyword_processor.add_keyword('New Delhi', 'NCR region') >>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.') >>> new_sentence >>> # 'I love New York and NCR region.'
- Case Sensitive example
>>> from flashtext.keyword import KeywordProcessor >>> keyword_processor = KeywordProcessor(case_sensitive=True) >>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.add_keyword('Bay Area') >>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.') >>> keywords_found >>> # ['Bay Area']
- No clean name for Keywords
>>> from flashtext.keyword import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('Big Apple') >>> keyword_processor.add_keyword('Bay Area') >>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.') >>> keywords_found >>> # ['Big Apple', 'Bay Area']
- Add Multiple Keywords simultaneously
>>> from flashtext.keyword import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_dict = { >>> "java": ["java_2e", "java programing"], >>> "product management": ["PM", "product manager"] >>> } >>> # {'clean_name': ['list of unclean names']} >>> keyword_processor.add_keywords_from_dict(keyword_dict) >>> # Or add keywords from a list: >>> keyword_processor.add_keywords_from_list(["java", "python"]) >>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform') >>> # output ['product management', 'java']
- To Remove keywords
>>> from flashtext.keyword import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_dict = { >>> "java": ["java_2e", "java programing"], >>> "product management": ["PM", "product manager"] >>> } >>> keyword_processor.add_keywords_from_dict(keyword_dict) >>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform')) >>> # output ['product management', 'java'] >>> keyword_processor.remove_keyword('java_2e') >>> # you can also remove keywords from a list/ dictionary >>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]}) >>> keyword_processor.remove_keywords_from_list(["java programing"]) >>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform') >>> # output ['product management']
For detecting Word Boundary currently any character other than this \w [A-Za-z0-9_] is considered a word boundary.
- To set or add characters as part of word characters
>>> from flashtext.keyword import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('Big Apple') >>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.')) >>> # ['Big Apple'] >>> keyword_processor.add_non_word_boundary('/') >>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.')) >>> # []
Documentation can be found at FlashText Read the Docs.
$ git clone https://github.com/vi3k6i5/flashtext $ cd flashtext $ pip install pytest $ python setup.py test
$ git clone https://github.com/vi3k6i5/flashtext $ cd flashtext/docs $ pip install sphinx $ make html $ # open _build/html/index.html in browser to view it locally
It's a custom algorithm based on Aho-Corasick algorithm and Trie Dictionary.
To do the same with regex it will take a lot of time:
Docs count | # Keywords | Regex | flashtext |
---|---|---|---|
1.5 million | 2K | 16 hours | Not measured |
2.5 million | 10K | 15 days | 15 mins |
The idea for this library came from the following StackOverflow question.
- Issue Tracker: https://github.com/vi3k6i5/flashtext/issues
- Source Code: https://github.com/vi3k6i5/flashtext/
The project is licensed under the MIT license.