MaartenGr/BERTopic

AttributeError: 'BERTopic' object has no attribute 'c_tf_idf'

Opened this issue · 3 comments

I'm following the steps in this Issue to test how the meta data influence the prevalence/content of topics

#360

But I get AttributeError: 'BERTopic' object has no attribute 'c_tf_idf' when running

ests = estimate_effect(topic_model=topic_model,
topics=[-1, 0],
metadata=metadata,
docs=enr_df_docs,
probs=probs,
estimator="content ~ score",
y="content")
print([est.summary() for est in ests])

There is a bunch of code in that issue, so I'm not sure which you are referring to. Could you share it?

I tried the following:

I first ran the basic BERTopic model:

vectorizer_model = CountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(enr_df_docs)

I then ran the estimate_effect function in the comment:

def estimate_effect(topic_model, 
                    docs: List[str], 
                    topics: Union[int, List[int]], 
                    metadata: pd.DataFrame, 
                    y: str = "prevalence", 
                    probs: np.ndarray = None, 
                    estimator: Union[str, Callable] = None,
                    estimator_kwargs: Mapping[str, Any] = None) -> List[wrap.ResultsWrapper]:
    
    """ Estimate the effect of metadata on topic prevalence and topic content
    
    Arguments:
        docs: The original list of documents on which the model was trained on
        probs: A mxn probability matrix, *m* is the number of document and 
               *n* the number of topics. It represents the probabilities of all topics 
               across all documents. 
        topics: The topic(s) for which you want to estimate the effect of metadata on
        metadata: The metadata in a dataframe. Make sure that the columns have the exact same 
                  name as the elements in the estimator
        y: The target, either "prevalence" (topic prevalence) or "content" (topic content)
        estimator: Either the formula used in the estimator or a custom estimator. 
                   When it is used as a formula, it follows R-style formulas, for example:
                      * 'prevalence ~ rating'
                      * 'prevalence ~ rating + day + rating:day'
                   Make sure that the target is either 'prevalence' or 'content'
                   The custom estimator should be a `statsmodels.formula.api`, currently, 
                   `statsmodels.api` is not supported.
        estimator_kwargs: The arguments needed within the estimator, needs at 
                          least a "formula" argument
                          
    Returns:
        fitted_estimators: List of fitted estimators for either topic prevalence or topic content
    """

    data = metadata.loc[::] 
    data["topics"] = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
    data["docs"] = docs
    fitted_estimators = []
    
    if isinstance(topics, int):
        topics = [topics]
    
    # As a proxy for the topic prevalence, we take the probability of a document
    # belonging to a specific topic. We assume that a higher probability of a topic 
    # belonging to that topic also results in that document talking more about that topic    
    if y == "prevalence":
        for topic in topics:
            # Prepare topic prevalence, 
            # Exclude probs == 1 as no zero-one inflated beta regressions are currently avaible
            data["prevalence"] = list(probs[:, topic])
            data_filtered = data.loc[data.prevalence < 1, :]

            # Either use a custom estimator or a pre-set model
            if callable(estimator):
                est = estimator(data=data_filtered, **estimator_kwargs).fit()
            else:
                est = smf.glm(estimator, data=data_filtered, family=sm.families.Gamma(link=sm.families.links.log())).fit()
            fitted_estimators.append(est)

    # Topic content is modeled on a document-level by calculating the document cTFIDF 
    # representation. Based on that representation, we calculate its cosine similarity 
    # with its topic cTFIDF representation. The assumption here, is that we expect different 
    # similarity scores if a covariate changes the topic content.
    elif y == "content":
        for topic in topics:
            # Extract topic content and prevalence
            selected_data = data.loc[data.topics == topic, :]
            c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": selected_data.docs.tolist()}), fit=False)
            sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
            selected_data["content"] = sim_matrix[:, topic+1]

            # Either use a custom estimator or a pre-set model
            if callable(estimator):
                est = estimator(data=selected_data, **estimator_kwargs).fit()
            else:
                est = smf.glm(estimator, data=selected_data, 
                              family=sm.families.Gamma(link=sm.families.links.log())).fit()  # perhaps remove the gamma + link?
            fitted_estimators.append(est)

    return fitted_estimators

The code on prevalence is working well

ests = estimate_effect(topic_model=topic_model, 
                      topics=[-1, 1],
                      metadata=metadata, 
                      docs=enr_df_docs, 
                      probs=probs, 
                      estimator="prevalence ~ score",
                      y="prevalence")
print([est.summary() for est in tests])

But the code on content returns an error

ests = estimate_effect(topic_model=topic_model, 
                      topics=[-1, 0],
                      metadata=metadata, 
                      docs=enr_df_docs, 
                      probs=probs, 
                      estimator="content ~ score",
                      y="content")
print([est.summary() for est in ests])

I guess I messed up something here, but I didn't really change any code

elif y == "content":
        for topic in topics:
            # Extract topic content and prevalence
            selected_data = data.loc[data.topics == topic, :]
            c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": selected_data.docs.tolist()}), fit=False)
            sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
            selected_data["content"] = sim_matrix[:, topic+1]

Sorry for the trouble and thanks for your response in advance.

I think you might need to change .c_tf_idf to .c_tf_idf_ instead in order to get the correct variable. I believe it was updated a while ago which explains your issue.