johnbumgarner/wordhoard

Questionable behavior of find_synonyms()

Closed this issue · 13 comments

with the following code generic_utils.py:

from wordhoard import Synonyms
def synonym(word):
    syn = Synonyms(word)
    syn_res = syn.find_synonyms()
    return syn_res

ran from terminal with clean state:

>>> import generic_utils as gu
>>> gu.synonym('mother')
['ma', 'mom', 'mum', 'dam', 'mama', 'mater', 'mommy', 'mummy', 'mamma', 'mammy', 'momma', 'parent', 'para i', 'supermom', 
'puerpera', 'old lady', 'old woman', 'primipara', 'quadripara', 'quintipara', 'birth mother', 'mother-in-law',
'foster mother', 'female parent', 'surrogate mother', 'biological mother']
>>> gu.synonym('mother')
['noun']

env info with wordhoard==1.5.3 and python 3.10.10:

backoff==2.2.1
beautifulsoup4==4.12.2
certifi==2022.12.7
charset-normalizer==3.1.0
cloudscraper==1.2.71
deckar01-ratelimit==3.0.2
deepl==1.14.0
idna==3.4
lxml==4.9.2
pyparsing==3.0.9
requests==2.28.2
requests-toolbelt==1.0.0
soupsieve==2.4.1
urllib3==1.26.15

Seems to be some error with caching; once was able to get some error message, but not 100% sure that this is it.

ERROR:wordhoard.synonyms:A KeyError occurred in the following code segment:
ERROR:wordhoard.synonyms:  File "/<path>/.conda/envs/sam/lib/python3.10/site-packages/wordhoard/synonyms.py", line 571, in _query_thesaurus_com
    self._update_cache(part_of_speech_category, synonyms_list)
  File "/<path>/.conda/envs/sam/lib/python3.10/site-packages/wordhoard/synonyms.py", line 134, in _update_cache
    caching.insert_word_cache_synonyms(self._word, pos_category, synonyms)
  File "/<path>/.conda/envs/sam/lib/python3.10/site-packages/wordhoard/utilities/caching.py", line 65, in insert_word_cache_synonyms
    temporary_dict_synonyms[word][pos_category] += deduplicated_values

Was able to fix behaviour, when disabling caching totally by changing the line

check_cache = self._check_cache()
to
check_cache = [False]

I have never used generic_utils, so I need to look into what they do.

this concerns me:

gu.synonym('mother')
['noun']

and I need to create a python 3.10.10 environment to see what breaks.

The the reason that I use caching is to prevent redundant queries for words.

I will look into this a get back to you.

I have never used generic_utils, so I need to look into what they do.

The code in generic_utils.py is provided in the issue. It is just a wrapper in the first code cell.

The the reason that I use caching is to prevent redundant queries for words.

I understand, but this is a quick and dirty fix until the caching works.

Where I execute this code in Python 3.9.16 I get no errors in wordhoard_error.yaml

from wordhoard import Synonyms
def synonym(word):
    syn = Synonyms(word)
    syn_res = syn.find_synonyms()
    return syn_res

words = ['mother', 'mother']
for word in words:
    results = synonym(word)
    print(results)

Hmmm, I check wordhoard_error.yaml with the provided code, with no errors, but still the same behavior with python 3.10.10 (sadly would be thought to change as already have a lot of different libraries for this version):

['ma', 'mum', 'dam', 'mom', 'mama', 'mommy', 'momma', 'mater', 'mammy', 'mamma', 'mummy', 'parent', 'para i', 'old lady', 'puerpera', 'supermom', 'primipara', 'old woman', 'quadripara', 'quintipara', 'birth mother', 'female parent', 'mother-in-law', 'foster mother', 'surrogate mother', 'biological mother']
['noun']

Check with the debugger,

part_of_speech = list(check_cache[1].keys())[0]

check_cache[1] has the following value:

{'noun': ['mom', 'parent', 'female parent', 'momma', 'mama', 'mammy', 'mommy', 'ma', 'mom', ...]}

That is where the noun comes from

Yes, noun is the part_of_speech. I need to create a Python 3.10.10 to see what errors I get.

So the problem is related to Python 3.10.10. I need to rework the code to support Python 3.10.10. I will post an update here when I have fixed this issue.

@johnbumgarner I have made a PR. It does seem questionable why you don't pass the synonyms variable that is passed in 2 other format types.

what does this mean it does seem questionable why you don't pass the synonyms variable that is passed in 2 other format types.

elif valid_word is True:
check_cache = self._check_cache()
if check_cache[0] is True:
part_of_speech = list(check_cache[1].keys())[0]
synonyms = cleansing.flatten_multidimensional_list(list(check_cache[1].values()))
if self._output_format == 'list':
return sorted(set([word.lower() for word in check_cache[1]]))
elif self._output_format == 'dictionary':
output_dict = {self._word: {'part_of_speech': part_of_speech, 'synonyms': sorted(set(
synonyms), key=len)}}
return output_dict
elif self._output_format == 'json':
json_object = json.dumps({self._word: {'part_of_speech': part_of_speech,
'synonyms': sorted(set(synonyms), key=len)}},
indent=4, ensure_ascii=False)
return json_object

In the provided snippet of the current version of find_synonyms.
We can observe that we have 3 code paths based on self._output_format.
Only in the list case do we provide check_cache[1]. In the JSON and dict return types, we return the newly created synonyms variable, which already holds the required information.

Maybe fixing this would make it work with Python 3.10 while keeping it backward compatible?

This is precisely what I change in the ac1b33b PR.

I believe that the issue in the code below, because the dictionary and json code works.

if check_cache[0] is True: 
         part_of_speech = list(check_cache[1].keys())[0] 
         synonyms = cleansing.flatten_multidimensional_list(list(check_cache[1].values())) 
         if self._output_format == 'list': 
             return sorted(set([word.lower() for word in check_cache[1]])) 

Totally agree, the issue is in check_cache[1], instead should be synonyms, as discussed in the previous comment and the PR.

This is correct. I see that I need to do more testing before pushing out a new release. Thanks for finding this bug.