Visual Semantic Relatedness Dataset for Image Captioning

Overview

We enrich COCO Captions with textual Visual Context information. We use ResNet152, CLIP, and Faster R-CNN to extract object information for each image. We use three filter approaches to ensure the quality of the dataset (1) Threshold: to filter out predictions where the object classifier is not confident enough, and (2) semantic alignment with semantic similarity to remove duplicated objects. (3) semantic relatedness score as soft-label: to guarantee the visual context and caption have a strong relation. In particular, we use Sentence-RoBERTa via cosine similarity to give a soft score, and then we use a threshold to annotate the final label (if th ≥ 0.2, 0.3, 0.4 then 1,0).

For quick start please have a look at the demo

Proposed Approach

We also propose a strategy to estimate the most closely related/not-related visual concepts using the caption description via BERT-CNN.

Resulting Dataset

We rely on COCO-Captions dataset to extract the visual context, and we propose two visual context datasets:

Visual_COCO

            visual context, caption descriptions
            umbrella dress human face, a woman with an umbrella near the sea.
            bathtub tub, this is a bathroom with a jacuzzi shower sink and toilet.
            snowplow shovel, the fire hydrant is partially buried under the snow.
            desktop computer monitor, a computer with a flower as its background sits on a desk.
            pitcher ballplayer, a baseball player preparing to throw the ball.
            groom restaurant, a black and white picture of a centerpiece to a table at a wedding.

Overlapping_COCO

            visual context, caption descriptions, overlapping information
            pole streetsign flagpole, a house that has a pole with a sign on it,{'pole'}.
            stove microwave refrigerator, an older stove sits in the kitchen next to a bottle of cleaner,{'stove'}.
            racket tennis ball ballplayer, a tennis player swinging a racket at a ball,{'tennis', 'racket', 'ball'}.
            grocery store dining table restaurant, a table is full of different kinds of food and drinks,{'table'}.

Another task that can benefit from the proposed dataset is investigating the contribution of the visual context to gender bias. We also propose gender neutral dataset.

Gender Neutral_COCO

            visual context, caption descriptions
            pizza, a person cutting a pizza with a fork and knife.
            suit, a person in a suit and tie sitting with his hands between his legs.
            paddle, a person riding a colorful surfboard in the water.
            ballplayer, a young person in a batting stance in a baseball game.

Frequency count of object + gender in the proposed dataset.
Visual	+ person	+ man	+ woman	m	w	to-m
Clothing	3950	3360	1490	.85	.37	.69
footwear	2810	1720	220	.61	.07	.88
racket	1360	440	150	.32	.11	.74
surfboard	820	80	10	.09	.01	.88
tennis	140	200	60	1.4	.42	.76
motorcycle	480	40	20	.08	.04	.66

The dataset, in most cases, has more gender-neutral person than gender bias toward men or women. However, the dataset is similar to COCO, a gender bias dataset to-ward men.