Case of hashtags (keywords) should be retained

Question

Case of hashtags (keywords) should be retained

wilddeej opened this issue 6 years ago · 16 comments

Currently the code that gathers the list of keywords converts these all to lower-case (to prevent duplication - which is useful), but when the keywords are applied, they should be in the case as originally specified.

Implementation notes:
Since the keywords need to be kept in a list (as already decided), it seems that a parallel dict could be kept that is keyed by the same lower-case keyword, with the original/unchanged keyword as the value. Is there another alternative that could be implemented in a single data structure, without losing the ordering provided by list implementation?
Another alternative is just to use the unaltered keywords as list entries, but this will mean a more complicated operation will be needed to prevent keyword duplication (rather than just lower-casing a potential entry and using "is not in keywords" to know if that new entry should be added).

Answer 1 · 2018-12-07T09:02:42.000Z

I don't think there's a need for a dict, the list can handle it. It's pretty simple to ignore the case when removing duplicated elements from a list (or to prevent their addition to this list if it already contains the same keyword with a different case, FWIW).

If we want to keep one particular case for a keyword for which we have several variants, we need to decide which ones should be discarded. Such a destructive process implies the definition of rules that have to be documented and understood by users. Sadly, they won't be able to do anything to comply with these rules (i.e. to choose what to keep / what should be discarded), unless they can modify the rss/atom feed. If it's the case, keyword deduplication should probably be handled at the feed level, not in feedspora.

I think the best default would be a case-sensitive deduplication. Maybe the deduplication should be case sensitive or insensitive depending on a parameter?

Answer 2 · 2018-12-07T14:22:40.000Z

Since we already have the priority of keyword sources defined, I think we just keep the first version of whatever case and leave out any duplicates found later. You're correct that this will have more to do with the feeds than anything else, except the fact that the user-defined keywords will take the highest priority.
I like your idea of adding "case-sensitive keywords" as an option too, that should also be worked into the mix!
It sounds like the entry-specific keyword list (created by the feed parsers, of course) should just include every keyword, unless it's a case-sensitive duplicate. Then there should be a client-specific function to build its list of keywords to process by

Prepending the user-defined keywords
Eliminating any duplicates (based on the client-specific tags_use_case parameter (proposed name)
Truncating the list per the max_tags parameter

Hmmm, I can see some potential confusion in some of our option names - we should decide on whether the term we use is "keywords" or "tags" and then stick to that convention. Do you know of a client consensus on which term would be best?

Answer 3 · 2018-12-07T15:07:02.000Z

I think it may make sense to have the parser retrieve the full and potentially redundant list of keywords/tags/whatever, so that the whole deduplication logic can be implemented as a single clean dedicated function in the GenericClient.

We can implement a boolean flag in the configuration file to toggle case-sensitivity (tag_dedup_use_case?). It probably makes sense to keep case-insensitive deduplication as default, since that's the current behavior?

WRT option names, I've no idea. You have probably noticed my English is a bit, say... "international" (at best). Both keyword and tag would be ok for me, but they aren't very accurate. Perhaps hashtag would be easier to understand. If it's too long and too twitt-ish, I tend to think tag could be the best alternative (closer to hashtag, and shorter than both). What do you think?

Answer 4 · 2018-12-07T15:57:46.000Z

I agree that it would be simpler and more straightforward to have all the tag-resolution code in a single client-specific function, and the feed parsers should just stay simple in their operation.
As far as terms, let's stick with "tag" for the preferred term, as you suggest.
In thinking about the option name, I realized that we may want to have multiple options, at some point. For example, it might be desirable for the user to turn off the tag extraction from content, and only use tags in the title and that they specify. So maybe what we really want here is tag_filter_opts that can be a comma-separated set of options, with currently supported values of case-insensitive, and include_from_content (both currently implemented as default behavior). I can't think of more option values at this point (unless we wanted to also consider include_from_title?), but this leaves the door open for more things later.

Answer 5 · 2018-12-07T16:02:59.000Z

This tag_filter_opts is a very good idea. :-)

Answer 6 · 2018-12-07T17:45:16.000Z

OK, I was planning to take this one on as my next task. Let me know if you are OK with this, or if you think there is something else that should be given priority (or if you had your eye on this one)!

Answer 7 · 2018-12-07T20:05:11.000Z

I'm progressing well on this one, but it looks like you are making modifications directly in dev? That could be problematic, moving forward...

Answer 8 · 2018-12-07T20:08:00.000Z

Indeed, I realized I should have implemented my changes in another branch. Really sorry about that. I started a major refactoring to put each client in a dedicated file. I should definitely have told you :-/ It's over now, I hope your modifications can be ported to this new file layout?

Answer 9 · 2018-12-07T21:34:45.000Z

Fortunately, I think I grabbed it after you had already finished breaking things up, so no worries there, I've been applying my changes in that new structure with no issue.
But the pytests are in bad shape, it appears.

Answer 10 · 2018-12-07T21:46:11.000Z

There are also some new and puzzling pylint errors too.

I am "mostly" done with the tag code modifications (which also include some steps towards uniformity in client functionality), but I definitely wanted to add some more tests to confirm the new tags/filtering functionality. I can certainly wait to do that later; should I submit a PR now for the code changes I have, or wait until things are straightened out more in dev?

Answer 11 · 2018-12-07T21:49:04.000Z

All the tests are working on my machine, either by running ./test.py in the main directory, or with pytest . or pytest <test_file.py> in tests. I agree that the file layout should be optimized, I haven't had the time to do it yet.

Edit: The big changes in dev are finished, you can submit your PR (I won't be able to check it now, need to sleep (UTC+1 here)).

Edit: From what I see here, the pylint report for all *.py files the src directory is fine (see here). It finds similar lines in several files because there's some overlap between the different test_output implementations (it was hidden before because of the concatenation, but now it's in plain sight). Or maybe you see other issues in the report?

Answer 12 · 2018-12-07T22:57:10.000Z

Those pylint errors you mention are what I meant, so that makes sense.
As far as the pytest stuff, sounds like I just need to modify how I'm running it and all will be well. I might be able to put in the test changes I wanted anyway, based on that, so I guess DON'T expect a PR in the immediate term, but should be able to provide one early tomorrow (I'm UTC-7).

Answer 13 · 2018-12-09T22:58:54.000Z

I've got the changes for this issue finished and I've verified their operation via interactive testing. Unfortunately, I have been unable to get the pytest tests working correctly, and I'm not sure if that's because of

the changes I introduced and have somehow failed to reflect in the tests
the problem you pointed out, caused by the dict implementation and ordering being therefore unpredictable
a combination of the above, or something else I'm not catching

The strangest error output is the following:
ERROR:root:Cannot connect Twitter_full : new_init() takes 1 positional argument but 3 were given

The part that is strange about this is that I didn't make any changes having to do with client initialization (including in the tests), so I'm not sure what to do about this one, or if maybe it's a red herring and something else is wrong.

It looks like I'll need to merge in your latest changes first anyway, since I have changes to golden files too... Just wanted to let you know where this issue was as far as getting a PR in...

Answer 14 · 2018-12-10T06:08:55.000Z

The strangest error output is the following:

I had this error too and fixed it with 7fd83f2. I monkeypatch'd the __init__ of clients in tests/client_test.py, and this new_init remains active for other tests as well.

I also improved the reliability of tests by fixing #44 with f730897. You can run them with make test. I also added CI support for branches/PR. It took quite a bit of time, but I think it's now much easier to find out if something goes wrong, and what exactly.

Answer 15 · 2018-12-10T19:52:11.000Z

OK, I've got your dev merged back into my branch, resolved all conflicts, and re-incorporated all necessary changes into the golden files.
Unfortunately, the only way I could get all the tests to pass was to comment out a couple of assertions. I did my best to track down the underlying error condition, but was unable to get them resolved. Maybe you could take a look and give some further insight? If so, just look for "#!#" comment strings (lines 49 and 215 of tests/client_test.py) in wilddeej@36cf618
Meanwhile I want to add some additional tests for the new functionality and add some updates to the template config, so am not ready for a PR yet (still).

Answer 16 · 2018-12-16T17:20:20.000Z

This is now fixed in dev as of PR #46.