/news_clippings

NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media, EMNLP 2021

Primary LanguageShell

DOI

Our dataset with automatically generated out-of-context image-caption pairs in the news media. For inquiries and requests, please contact graceluo@berkeley.edu.

Requirements

Make sure you are running Python 3.6+.

Getting Started

  1. Request the VisualNews Dataset. Place the files under the visual_news folder.
  2. Run ./download.sh to download our matches and populate the news_clippings folder (place into news_clippings/data/).
  3. Consider doing analyses of your own using the embeddings we have provided (place into news_clippings/embeddings/).

All of the ids and image paths provided in our data/ folder exactly correspond to those listed in the data.json file in VisualNews.

Your file structure should look like this:

news_clippings
│
└── data/
└── embeddings/

visual_news
│
└── origin/
│    └── data.json
│        ...
└── ...

Data Format

The data is ordered such that every even sample is pristine, and the next sample is its associated falsified sample.

  • id: the id of the VisualNews sample associated with the caption
  • image_id: the id of the VisualNews sample associated with the image
  • similarity_score: the similarity measure used to generate the sample (i.e. clip_text_image, clip_text_text, sbert_text_text, resnet_place)
  • falsified: a binary indicator if the caption / image pair was the original pair in VisualNews or a mismatch we generated
  • source_dataset (Merged / Balanced only): the index of the sub-split name in source_datasets

Here's an example of how you can start using our matches:

    import json
    visual_news_data = json.load(open("visualnews/origin/data.json"))
    visual_news_data_mapping = {ann["id"]: ann for ann in visual_news_data}
    
    data = json.load(open("news_clippings/data/merged_balanced/val.json"))
    annotations = data["annotations"]
    ann = annotations[0]
    
    caption = visual_news_data_mapping[ann["id"]]["caption"]
    image_path = visual_news_data_mapping[ann["image_id"]]["image_path"]
    
    print("Caption: ", caption)
    print("Image Path: ", image_path)
    print("Is Falsified: ", ann["falsified"])

Embeddings

We include the following precomputed embeddings:

  • clip_image_embeddings: 512-dim image embeddings from CLIP ViT-B/32.
    Contains embeddings for samples in all splits.
  • clip_text_embeddings: 512-dim caption embeddings from CLIP ViT-B/32.
    Contains embeddings for samples in all splits.
  • sbert_embeddings: 768-dim caption embeddings from SBERT-WK.
    Contains embeddings for samples in all splits.
  • places_resnet50: 2048-dim image embeddings using ResNet50 trained on Places365.
    Contains embeddings only for samples in the scene_resnet_place split (where [PERSON] entities were not detected in the caption).

The following embedding types were not used in the construction of our dataset, but you may find them useful.

  • facenet_embeddings: 512-dim embeddings for each face detected in the images using FaceNet. If no faces were detected, returns None.
    Contains embeddings only for samples in the person_sbert_text_text split (where [PERSON] entities were detected in the caption).

All embeddings are dictionaries of {id: numpy array} stored in pickle files for train / val / test. You can access the features for each image / caption by its id like so:

    import pickle
    clip_image_embeddings = pickle.load(open("news_clippings/embeddings/clip_image_embeddings/test.pkl", "rb"))
    id = 701864
    print(clip_image_embeddings[id])

Available Upon Request

We have additional metadata available upon request, such as the spaCy and REL named entities, timestamp, location of the original article content, etc.

We also have sbert_embeddings_dissecting, which has an embedding for each token and its weighting from running the "dissecting" setting of SBERT-WK, available upon request.

Training

To run the benchmarking experiments we reported in our paper, look at the README for news_clippings_training/.

Citing

If you find our dataset useful for your research, please, cite the following paper:

@article{luo2021newsclippings,
  title={NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media},
  author={Luo, Grace and Darrell, Trevor and Rohrbach, Anna},
  journal={arXiv:2104.05893},
  year={2021}
}