Thanks for your interest in our WWW 2023 paper A Prompt Log Analysis of Text-to-Image Generation Systems. In this repository, we provide code for data preparation and prompt analysis. For the processed data that we used in the paper (Midjourney, DiffusionDB, SAC, and LAION), please visit the Google Drive folder. For the analysis results and visualizations, please refer to the Google Drive folder and see Data File Descriptions for descriptions of the files.
In our paper, we analyze user-input text-to-image prompts from three large-scale datasets:
Besides, we analyzed the LAION-400M text-image dataset, which is often been used to train text-to-image models.
Dataset processing includes extraction of basic fields (e.g., prompt, timestamp, and user ID) from raw datasets, prompt standardization, and tokenization. For details about dataset processing, please refer to the appendix of our paper.
To process the datasets we used in the paper, run the following command:
python prepare_data.py --dataset_name <dataset_name> --raw_data_root <raw_dataset_root/path> --save_root <result_folder>
--dataset_name
supports inputting one dataset name (choosing frommidjourney
/diffusiondb
/sac
/laion
)--raw_data_root
is the root folder (for Midjourney and LAION) or file path (for DiffusionDB and SAC) of the downloaded raw datasets--save_root
is the folder to save the processed data.
To process a subset of LAION-400M, the augment --num_samples
needs to be specified as the number of samples. The default value is 1M:
python prepare_data.py --dataset_name LAION --raw_data_root <raw_dataset_root/path> --save_root <result_folder> --num_samples 1000000
Term-level analysis contains basic statistics, first-order analysis, second-order analysis, and out-of-vocabulary (OOV) word analysis. For OOV analysis, please first conduct data processing for both the user-input prompt dataset (midjourney/diffusiondb/sac) and LAION.
python term_level.py --dataset_name <dataset_name> --data_path <tokenized_file_path> --laion_path <tokenized_laion_file_path> --save_root <result_folder>
--reweighting_type
allows choosing different reweighting methods, including PMI, tstats and chi-square. We recommend using the default chi-square reweighting method, which also corresponds to the results presented in the paper.--cut_off_k
adjusts the threshold of ranking for reweighting (10000 by default), selecting a limited number of terms and term pairs.
Note that it's recommended to create a new folder for --save_root
to save the analysis result files.
Prompt-level analysis contains basic statistics, DiffusionDB prompt frequency ranking plotting, timestamp plotting, rating histogram plotting (only for SAC), and correlation calculation for ratings and prompt lengths (only for SAC).
python prompt_level.py --dataset_name <dataset_name> --data_path <tokenized_file_path> --save_root <result_folder>
--topk_prompts
adjusts the number of top frequent prompts to be saved in the file.
Session-level analysis contains session grouping, prompt popularity analysis, editing analysis and repeated prompt analysis. See more details about sessions in our paper. Note that since sessions are grouped with a threshold of time, session-level analysis only supports Midjourney and DiffusionDB.
python session_level.py --dataset_name <dataset_name> --data_path <tokenized_file_path> --threshold <session_threshold> --save_root <result_folder>
--threshold
sets the time threshold for sessions, and the default value is 30 (min).
Our analysis results and visualizations are provided in the Google Drive folder. Here are the file descriptions.
No. | Filename | Description |
---|---|---|
1 | Midjourney_tokenized.csv | Preprocessed and tokenized file for the Midjourney dataset. |
2 | DiffusionDB_tokenized.csv | Preprocessed and tokenized file for the DiffusionDB dataset. |
3 | SAC_tokenized.csv | Preprocessed and tokenized file for the SAC dataset. |
4 | LAION_tokenized.csv | Preprocessed and tokenized file for the LAION dataset (subset of 1M). |
5 | Midjourney_all_terms_by_freq.csv | List of all terms in the Midjourney dataset, ranked by frequency in descending order. Table 3 displays the top 20 terms. |
6 | DiffusionDB_all_terms_by_freq.csv | List of all terms in the DiffusionDB dataset, ranked by frequency in descending order. Table 3 displays the top 20 terms. |
7 | SAC_all_terms_by_freq.csv | List of all terms in the SAC dataset, ranked by frequency in descending order. Table 3 displays the top 20 terms. |
8 | Midjourney_reweighted_pairs.csv | List of all term pairs in the Midjourney dataset, ranked by frequency in descending order. Table 4 displays the top 20 pairs. |
9 | DiffusionDB_reweighted_pairs.csv | List of all term pairs in the DiffusionDB dataset, ranked by frequency in descending order. Table 4 displays the top 20 pairs. |
10 | SAC_reweighted_pairs.csv | List of all term pairs in the SAC dataset, ranked by frequency in descending order. Table 4 displays the top 20 pairs. |
11 | DiffusionDB_prompt_freq.csv | Frequencies of prompts in the DiffusionDB dataset (only includes entries with a frequency greater than 1). Table 5 displays the top 20 most frequent prompts. |
12 | DiffusionDB_prompt_userfreq.csv | Prompts shared across users in the DiffusionDB dataset (only includes entries with a frequency greater than 1). Some of the most shared prompts are listed in Table 7. |
13 | Midjourney_replaced_freq.csv | Frequencies of term replacements in sessions of the Midjourney dataset. Table 9 displays the top 30 most frequent replacements. |
14 | DiffusionDB_replaced_freq.csv | Frequencies of term replacements in sessions of the DiffusionDB dataset. Table 9 displays the top 30 most frequent replacements. |
15 | SAC_term_avg_rating.csv | Terms ranked by average ratings in descending order. |
16 | Midjourney_oov_terms.csv | Out-of-vocabulary (OOV) terms in the Midjourney dataset. |
17 | DiffusionDB_oov_terms.csv | Out-of-vocabulary (OOV) terms in the DiffusionDB dataset. |
18 | SAC_oov_terms.csv | Out-of-vocabulary (OOV) terms in the SAC dataset. |
19 | Midjourney_word_embedding_top5000.html | HTML file displaying the visualization of term embeddings for the most frequent 5000 terms in the Midjourney dataset. The embeddings are obtained using BERT and reduced to 2 dimensions using UMAP. Hover over data points to view corresponding terms. |
20 | DiffusionDB_word_embedding_top5000.html | HTML file displaying the visualization of term embeddings for the most frequent 5000 terms in the DiffusionDB dataset. The embeddings are obtained using BERT and reduced to 2 dimensions using UMAP. Hover over data points to view corresponding terms. |
21 | SAC_word_embedding_top5000.html | HTML file displaying the visualization of term embeddings for the most frequent 5000 terms in the SAC dataset. The embeddings are obtained using BERT and reduced to 2 dimensions using UMAP. Hover over data points to view corresponding terms. |
In our data files, all files follow a similar structure with common columns, such as prompt
(cleaned prompts) and tokenized
(tokenized prompts). Additionally, each dataset-specific file includes certain columns for analysis purposes, while other information is provided for reference. Here is a breakdown of the columns in each file:
-
Midjourney_tokenized.csv:
prompt
: Cleaned prompts.tokenized
: Tokenized prompts.url
: Links to generated images (may expire due to Discord's policy).timestamp
: Timestamp of the prompt.author_id
: ID of the prompt author.username
: Username of the prompt author.
-
DiffusionDB_tokenized.csv:
prompt
: Cleaned prompts.tokenized
: Tokenized prompts.author_id
: ID of the prompt author.timestamp
: Timestamp of the prompt.image_name
: Name of the associated image.
-
SAC_tokenized.csv:
prompt
: Cleaned prompts.tokenized
: Tokenized prompts.pid
: Prompt ID. One prompt can have multiplepid
entries.method
: Type of the method used to generate images corresponding to the prompt.iid
: Image ID associated with the prompt.path
: Path where the image is saved.rating
: Rating given to the prompt.
-
LAION_tokenized.csv:
prompt
: Cleaned prompts.tokenized
: Tokenized prompts.sample_id
: ID of the prompt sample.image_url
: URL of the associated image.
Please note that for the analysis, only the timestamp
, author_id
, and rating
columns are utilized, while the other columns provide additional information for reference.
Please feel free to contact us if you have any questions.
@inproceedings{10.1145/3543507.3587430,
author = {Xie, Yutong and Pan, Zhaoying and Ma, Jinge and Jie, Luo and Mei, Qiaozhu},
title = {A Prompt Log Analysis of Text-to-Image Generation Systems},
year = {2023},
isbn = {9781450394161},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3543507.3587430},
doi = {10.1145/3543507.3587430},
abstract = {Recent developments in large language models (LLM) and generative AI have unleashed the astonishing capabilities of text-to-image generation systems to synthesize high-quality images that are faithful to a given reference text, known as a “prompt”. These systems have immediately received lots of attention from researchers, creators, and common users. Despite the plenty of efforts to improve the generative models, there is limited work on understanding the information needs of the users of these systems at scale. We conduct the first comprehensive analysis of large-scale prompt logs collected from multiple text-to-image generation systems. Our work is analogous to analyzing the query logs of Web search engines, a line of work that has made critical contributions to the glory of the Web search industry and research. Compared with Web search queries, text-to-image prompts are significantly longer, often organized into special structures that consist of the subject, form, and intent of the generation tasks and present unique categories of information needs. Users make more edits within creation sessions, which present remarkable exploratory patterns. There is also a considerable gap between the user-input prompts and the captions of the images included in the open training data of the generative models. Our findings provide concrete implications on how to improve text-to-image generation systems for creation purposes.},
booktitle = {Proceedings of the ACM Web Conference 2023},
pages = {3892–3902},
numpages = {11},
keywords = {Query Log Analysis., Prompt Analysis, AI-Generated Content (AIGC), AI for Creativity, Text-to-Image Generation},
location = {Austin, TX, USA},
series = {WWW '23}
}