multimodal problem
Opened this issue · 6 comments
I am a student from China, and I really appreciate your project. I am now trying to do some interesting work, but I have encountered some problems. My idea is to perform topic modeling using product images and text reviews. Since the clip-ViT-B-32 encoder does not support Chinese, I am using another CLIP model trained on Chinese data to generate image_features and text_features. Then, I perform a concatenation operation to generate combined_image_features as the embeddings for BERTopic, and pass each image's corresponding review as the docs to the model. The good news is that the model works, but there is a problem with the topic representation: it only produces some meaningless English words and numbers. Since I am not an expert in the field of multimodal computing, I don't know which part of the model has gone wrong.
Most likely, you are not using the right processor in the CountVectorizer. Could you share your full code? Also, please check out the FAQ.
import os
import pandas as pd
import torch
from PIL import Image
import cn_clip.clip as clip
from cn_clip.clip import load_from_name
加载 CLIP 模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("ViT-H-14", device=device, download_root='./')
图像文件夹路径和标题列表的 Excel 文件路径
image_folder = "C:/soft/pycharm/file11111111/爬虫/合并后的图片"
excel_file = "C:/soft/pycharm/file11111111/爬虫/合并后的文档.xlsx"
读取 Excel 文件中的标题列表
captions_df = pd.read_excel(excel_file, names=['index', 'text','usefulVoteCount'])
存储所有图像的特征向量和文本的特征向量
all_image_features = []
all_text_features = []
遍历图像文件夹中的每张图片
for filename in os.listdir(image_folder):
if filename.endswith(".jpg"): # 假设所有图片都是 jpg 格式的
# 加载图像并进行预处理
image_path = os.path.join(image_folder, filename)
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
# 使用 CLIP 模型编码图像
with torch.no_grad():
image_features = model.encode_image(image)
# 对特征进行归一化
image_features /= image_features.norm(dim=-1, keepdim=True)
# 存储图像特征向量
all_image_features.append(image_features)
# 使用相应的标题(根据文件名匹配)编码文本
index = int(filename.split("_")[1].split(".")[0]) - 1 # 提取文件名中的索引号
text = clip.tokenize([captions_df.loc[index, 'text']]).to(device)
# 使用 CLIP 模型编码文本
with torch.no_grad():
text_features = model.encode_text(text)
# 对特征进行归一化
text_features /= text_features.norm(dim=-1, keepdim=True)
# 存储文本特征向量
all_text_features.append(text_features)
import numpy as np
将所有的图像特征和文本特征拼接成一个嵌入向量
combined_image_features = torch.cat(all_image_features, dim=0)
combined_text_features = torch.cat(all_text_features, dim=0)
combined_features = torch.cat((combined_image_features, combined_text_features), dim=1)
将 combined_features 转换为 NumPy 数组
combined_features = combined_features.cpu().numpy()
将标题列表转换为 Python 列表
docs = captions_df['text'].tolist()
检查combined_features的形状
print(f"combined_features shape: {combined_features.shape}")
print(f"Number of documents: {len(docs)}")
combined_features已经是正确的形状(num_samples, embedding_dim)
直接将其赋值给embeddings
embeddings = combined_features
import jieba
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic.vectorizers import ClassTfidfTransformer
Step 1 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine',random_state=42)
Step 2 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', prediction_data=True)
Step 3 - Create topic representation
from sklearn.feature_extraction.text import CountVectorizer
stoplists = list(pd.read_csv('停用词.txt', names=['w'], sep='\t', encoding='utf-8').w)
vectorizer_model = CountVectorizer(stop_words=stoplists, ngram_range=(1,1))
ctfidf_model = ClassTfidfTransformer()
topic_model = BERTopic(
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
nr_topics='none',
top_n_words=10,
)
Train model
topics, probs = topic_model.fit_transform(docs, embeddings)
Most likely, you are not using the right processor in the CountVectorizer. Could you share your full code? Also, please check out the FAQ.
here is my full code, thanks for your helping
Thanks! Definitely check out the FAQ, it should solve your problem since your input are Chinese texts.
Thanks! Definitely check out the FAQ, it should solve your problem since your input are Chinese texts.
Thank you very much for your response. In fact, I had tried the def tokenize_zh(text) method before, but it still failed. I think there might be more complex reasons. Because when I used the sentence embeddings generated by the sentence transformer for analysis, the model was extremely successful, but when I used CLIP to generate embeddings, problems arose. My current approach is to use the clustering results generated by the model, and feed the documents under different topics to the LLM for topic word extraction. Also, more and more Chinese scholars are using your model for research and applications, because it is really great!
Sorry for the late reply!
Thank you very much for your response. In fact, I had tried the def tokenize_zh(text) method before, but it still failed. I think there might be more complex reasons. Because when I used the sentence embeddings generated by the sentence transformer for analysis, the model was extremely successful, but when I used CLIP to generate embeddings, problems arose.
I fixed some things in BERTopic v0.16.1 that might relate to the problem you had. You should indeed still use tokenize_zh
but the problems with CLIP should/might be resolved.
Also, more and more Chinese scholars are using your model for research and applications, because it is really great!
Thank you for sharing this! Wonderful to hear that more Chinese scholars are using BERTopic. If you ever have any feedback, feel free to reach out!