MaartenGr/BERTopic

multimodal problem

Opened this issue · 6 comments

I am a student from China, and I really appreciate your project. I am now trying to do some interesting work, but I have encountered some problems. My idea is to perform topic modeling using product images and text reviews. Since the clip-ViT-B-32 encoder does not support Chinese, I am using another CLIP model trained on Chinese data to generate image_features and text_features. Then, I perform a concatenation operation to generate combined_image_features as the embeddings for BERTopic, and pass each image's corresponding review as the docs to the model. The good news is that the model works, but there is a problem with the topic representation: it only produces some meaningless English words and numbers. Since I am not an expert in the field of multimodal computing, I don't know which part of the model has gone wrong.
微信图片_20240409121430
微信图片_20240409121452

Most likely, you are not using the right processor in the CountVectorizer. Could you share your full code? Also, please check out the FAQ.

import os
import pandas as pd
import torch
from PIL import Image
import cn_clip.clip as clip
from cn_clip.clip import load_from_name

加载 CLIP 模型

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("ViT-H-14", device=device, download_root='./')

图像文件夹路径和标题列表的 Excel 文件路径

image_folder = "C:/soft/pycharm/file11111111/爬虫/合并后的图片"
excel_file = "C:/soft/pycharm/file11111111/爬虫/合并后的文档.xlsx"

读取 Excel 文件中的标题列表

captions_df = pd.read_excel(excel_file, names=['index', 'text','usefulVoteCount'])

存储所有图像的特征向量和文本的特征向量

all_image_features = []
all_text_features = []

遍历图像文件夹中的每张图片

for filename in os.listdir(image_folder):
if filename.endswith(".jpg"): # 假设所有图片都是 jpg 格式的
# 加载图像并进行预处理
image_path = os.path.join(image_folder, filename)
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)

    # 使用 CLIP 模型编码图像
    with torch.no_grad():
        image_features = model.encode_image(image)
        # 对特征进行归一化
        image_features /= image_features.norm(dim=-1, keepdim=True)

    # 存储图像特征向量
    all_image_features.append(image_features)

    # 使用相应的标题(根据文件名匹配)编码文本
    index = int(filename.split("_")[1].split(".")[0]) - 1  # 提取文件名中的索引号
    text = clip.tokenize([captions_df.loc[index, 'text']]).to(device)

    # 使用 CLIP 模型编码文本
    with torch.no_grad():
        text_features = model.encode_text(text)
        # 对特征进行归一化
        text_features /= text_features.norm(dim=-1, keepdim=True)

    # 存储文本特征向量
    all_text_features.append(text_features)

import numpy as np

将所有的图像特征和文本特征拼接成一个嵌入向量

combined_image_features = torch.cat(all_image_features, dim=0)
combined_text_features = torch.cat(all_text_features, dim=0)
combined_features = torch.cat((combined_image_features, combined_text_features), dim=1)

将 combined_features 转换为 NumPy 数组

combined_features = combined_features.cpu().numpy()

将标题列表转换为 Python 列表

docs = captions_df['text'].tolist()

检查combined_features的形状

print(f"combined_features shape: {combined_features.shape}")
print(f"Number of documents: {len(docs)}")

combined_features已经是正确的形状(num_samples, embedding_dim)

直接将其赋值给embeddings

embeddings = combined_features
import jieba
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic.vectorizers import ClassTfidfTransformer

Step 1 - Reduce dimensionality

umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine',random_state=42)

Step 2 - Cluster reduced embeddings

hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', prediction_data=True)

Step 3 - Create topic representation

from sklearn.feature_extraction.text import CountVectorizer
stoplists = list(pd.read_csv('停用词.txt', names=['w'], sep='\t', encoding='utf-8').w)
vectorizer_model = CountVectorizer(stop_words=stoplists, ngram_range=(1,1))
ctfidf_model = ClassTfidfTransformer()
topic_model = BERTopic(
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
nr_topics='none',
top_n_words=10,
)

Train model

topics, probs = topic_model.fit_transform(docs, embeddings)

Most likely, you are not using the right processor in the CountVectorizer. Could you share your full code? Also, please check out the FAQ.

here is my full code, thanks for your helping

Thanks! Definitely check out the FAQ, it should solve your problem since your input are Chinese texts.

Thanks! Definitely check out the FAQ, it should solve your problem since your input are Chinese texts.

Thank you very much for your response. In fact, I had tried the def tokenize_zh(text) method before, but it still failed. I think there might be more complex reasons. Because when I used the sentence embeddings generated by the sentence transformer for analysis, the model was extremely successful, but when I used CLIP to generate embeddings, problems arose. My current approach is to use the clustering results generated by the model, and feed the documents under different topics to the LLM for topic word extraction. Also, more and more Chinese scholars are using your model for research and applications, because it is really great!

Sorry for the late reply!

Thank you very much for your response. In fact, I had tried the def tokenize_zh(text) method before, but it still failed. I think there might be more complex reasons. Because when I used the sentence embeddings generated by the sentence transformer for analysis, the model was extremely successful, but when I used CLIP to generate embeddings, problems arose.

I fixed some things in BERTopic v0.16.1 that might relate to the problem you had. You should indeed still use tokenize_zh but the problems with CLIP should/might be resolved.

Also, more and more Chinese scholars are using your model for research and applications, because it is really great!

Thank you for sharing this! Wonderful to hear that more Chinese scholars are using BERTopic. If you ever have any feedback, feel free to reach out!