bmabey/pyLDAvis

[feature] decouple visualisation UI's topic numbering with their label

ed9w2in6 opened this issue · 2 comments

We have whole family of issues that are just about the numbering of topics during visualisation:

They can all be resolved just by decoupling the numbering from labels, which also remove the need of sort_topics, and start_index options in the python API.

Now I am not going into details on how to implement or specification of outcomes, but here are some ideas:

Outline

python API side

We currently generate topic numbers at topic_top_term_df in _prepare.py. We use enumerate and start_index to generate the numbering, in which it is supplied by user from prepare method, smuggled through _topic_info method.

topic_dfs = map(topic_top_term_df, enumerate(top_terms.T.iterrows(), start_index))

Sorting is orthogonal to this logic, hence we can safely ignored it when changing such code:

if (sort_topics):
topic_proportion = (topic_freq / topic_freq.sum()).sort_values(ascending=False)
else:
topic_proportion = (topic_freq / topic_freq.sum())

The number generated from enumerate will ultimately be used to name the topic, stored as Category:

'Category': 'Topic%d' % new_topic_id,

I believe we should allow user to supply a list of strings.

If we change this we need to change this too:

class PreparedData(namedtuple('PreparedData', ['topic_coordinates', 'topic_info', 'token_table',
'R', 'lambda_step', 'plot_opts', 'topic_order'])):
def sorted_terms(self, topic=1, _lambda=1):
"""Returns a dataframe using _lambda to calculate term relevance of a given topic."""
tdf = pd.DataFrame(self.topic_info[self.topic_info.Category == 'Topic' + str(topic)])
if _lambda < 0 or _lambda > 1:

and made sure none of them are named "Default", since we used it as default:

default_term_info = pd.DataFrame({
'saliency': saliency,
'Term': vocab,
'Freq': term_frequency,
'Total': term_frequency,
'Category': 'Default'})

And that is for topic_info data only, we have to do the same of mdsData and token_table too.
Clearly a better way is just to side-step it and just supply a desired list of names and store into the PreparedData namedtuple.

Solution: side step at JS visualisation side

Currently, our visualisation logic made hard assumptions that Category must be in the form of "TopicN" where N is a number:

function reorder_bars(increase) {
// grab the bar-chart data for this topic only:
var dat2 = lamData.filter(function(d) {
return d.Category == "Topic" + vis_state.topic;
});

Therefore, again, the path of lowest friction is to side-step it only changing the visualisation logic:

  1. RHS Table title
    .attr("y", -30)
    .attr("class", "bubble-tool") // set class so we can remove it when highlight_off is called
    .style("text-anchor", "middle")
    .style("font-size", "16px")
    .text("Top-" + R + " Most Relevant Terms for Topic " + topics + " (" + Freq + "% of tokens)");
  2. circle label
    .style("font-size", "11px")
    .style("fontWeight", 100)
    .text(function(d) {
    return d.topics;
    });

In which 2 is optional. So only 3 changes in total!


Summary, changes needed

  1. new parameter for topic names
  2. store it at PreparedData
  3. change RHS Table title, optionally the circle labels too

Are you creating a matching pull request?

@msusol Yes, still WIP though. Ideally cleaning up the code base would be better but I do not have such plans.
My plan is to just, as mentioned above, a quick hack:

  1. adding new param at prepare, default to None, some logic to generate dummy topic name if None.
  2. store it at PreparedData
  3. change the visualisation accordingly:
    • RHS Table title
    • the circle labels too if it looked good.
    • allow select topic by topic name too, if not too difficult