DeepFashion2 Dataset

DeepFashion2 is a comprehensive fashion dataset. It contains 491K diverse images of 13 popular clothing categories from both commercial shopping stores and consumers. It totally has 801K clothing clothing items, where each item in an image is labeled with scale, occlusion, zoom-in, viewpoint, category, style, bounding box, dense landmarks and per-pixel mask.There are also 873K Commercial-Consumer clothes pairs.
The dataset is split into a training set (391K images), a validation set (34k images), and a test set (67k images).
Examples of DeepFashion2 are shown in Figure 1.

Figure 1: Examples of DeepFashion2.

_{From (1) to (4), each row represents clothes images with different variations. At each row, we partition the images into two groups, the left three columns represent clothes from commercial stores, while the right three columns are from customers.In each group, the three images indicate three levels of difficulty with respect to the corresponding variation.Furthermore, at each row, the items in these two groups of images are from the same clothing identity but from two different domains, that is, commercial and customer.The items of the same identity may have different styles such as color and printing.Each item is annotated with landmarks and masks.}

Announcements

xx-xx-xx Validation set of DeepFashion2 is released.

Download the Data

Below provides the links to validation set. We will soon release training set and test set.

Images(val_imags.zip)
Annotations(val_annos.zip)

Data Organization

Each image in seperate image set has a unique six-digit number such as 000001.jpg. A corresponding annotation file in json format is provided in annotation set such as 000001.json.
Each annotation file is organized as below:

source: a string, where 'shop' indicates that the image is from commercial store while 'user' indicates that the image is taken by users.
pair_id: a number. Images from the same shop and their corresponding consumer-taken images have the same pair id.
- item 1
  - category_name: a string which indicates the category of the item.
  - category_id: a number which corresponds to the category name. In category_id, 1 represents short sleeve top, 2 represents long sleeve top, 3 represents short sleeve outwear, 4 represents long sleeve outwear, 5 represents vest, 6 represents sling, 7 represents shorts, 8 represents trousers, 9 represents skirt, 10 represents short sleeve dress, 11 represents long sleeve dress, 12 represents vest dress and 13 represents sling dress.
  - style: a number to distinguish between clothing items from images with the same pair id. Clothing items with different style numbers from images with the same pair id have different styles such as color, printing, and logo. In this way, a clothing item from shop images and a clothing item from user image are positive commercial-consumer pair if they have the same style number greater than 0 and they are from images with the same pair id.
  - bounding_box: [x1,y1,x2,y2]，where x1 and y_1 represent the lower left point coordinate of bounding box, x_2 and y_2 represent the upper right point coordinate of bounding box.
  - landmarks: [x1,y1,v1,...,xn,yn,vn], where v represents the visibility: v=2 visible; v=1 occlusion; v=0 not labeled. We have different definitions of landmarks for different categories. The orders of landmark annotations are listed in figure 2.
  - segmentation: [[x1,y1,...xn,yn],[ ]], where [x1,y1,xn,yn] represents a polygon and a single clothing item may contain more than one polygon.
  - scale: a number, where 1 represents small scale, 2 represents modest scale and 3 represents large scale.
  - occlusion: a number, where 1 represents slight occlusion(including no occlusion), 2 represents medium occlusion and 3 represents heavy occlusion.
  - zoom_in: a number, where 1 represents no zoom-in, 2 represents medium zoom-in and 3 represents lagre zoom-in.
  - viewpoint: a number, where 1 represents no wear, 2 represents frontal viewpoint and 3 represents side or back viewpoint.
- item 2
  ...
- item n

The definition of landmarks and skeletons of 13 categories are shown below. The numbers in the figure represent the order of landmark annotations of each category in annotation file. A total of 294 landmarks covering 13 categories are defined.

Figure 2: Definitions of landmarks and skeletons.

In validation set, we provide query image names in list_query.txt and gallery image names in list_gallery.txt. In Commercial-Consumer Clothes Retrieval benchmark, during evaluation, each query clothing item with style number greater than 0 has corresponding ground truth gallery clothing item, which has the same style and pair_id with the query clothing item. A query clothing item may have more than one ground truth gallery clothing item.

Dataset Statistics

Tabel 1 shows the statistics of images and annotations in DeepFashion2.

Table 1: Statistics of DeepFashion2.

	Train	Validation	Test	Overall
images	390,884	33,669	67,342	491,895
bboxes	636,624	54,910	109,198	800,732
landmarks	636,624	54,910	109,198	800,732
masks	636,624	54,910	109,198	800,732
pairs	685,584	query: 12,550 gallery: 37183	query: 24,402 gallery: 75,347	873,234

Figure 3 shows the statistics of different variations and the numbers of items of the 13 categories in DeepFashion2.

Figure 3: Statistics of DeepFashion2.

Benchmarks

Clothes Detection

This task detects clothes in an image by predicting bounding boxes and category labels to each detected clothing item. The evaluation metrics are the bounding box's average precision ${AP}_{box}$ , ${AP}_{box}^{IoU=0.50}$ , ${AP}_{box}^{IoU=0.75}$ .

Table 2: Clothes detection on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.

		_Scale			_Occlusion			_{Zoom_in}			_Viewpoint		_Overall
	_small	_moderate	_large	_slight	_medium	_heavy	_no	_medium	_large	_{no wear}	_frontal	_{side or back}
_AP	_0.604	_0.700	_0.660	_0.712	_0.654	_0.372	_0.695	_0.629	_0.466	_0.624	_0.681	_0.641	_0.667
_AP50	_0.780	_0.851	_0.768	_0.844	_0.810	_0.531	_0.848	_0.755	_0.563	_0.713	_0.832	_0.796	_0.814
_AP75	_0.717	_0.809	_0.744	_0.812	_0.768	_0.433	_0.806	_0.718	_0.525	_0.688	_0.791	_0.744	_0.773

Landmark and Pose Estimation

This task aims to predict landmarks for each detected clothing item in an each image.Similarly, we employ the evaluation metrics used by COCOfor human pose estimation by calculating the average precision for keypoints ${AP}_{pt}$ , ${AP}_{pt}^{OKS=0.50}$ , ${AP}_{pt}^{OKS=0.75}$ where OKS indicates the object landmark similarity.

Table 3: Landmark Estimation on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.Results of evaluation on visible landmarks only and evaluation on both visible and occlusion landmarks are separately shown in each row

		_Scale			_Occlusion			_{Zoom_in}			_Viewpoint		_Overall
	_small	_moderate	_large	_slight	_medium	_heavy	_no	_medium	_large	_{no wear}	_frontal	_{side or back}
_AP	_0.587 0.497	_0.687 0.607	_0.599 0.555	_0.669 0.643	_0.631 0.530	_0.398 0.248	_0.688 0.616	_0.559 0.489	_0.375 0.319	_0.527 0.510	_0.677 0.596	_0.536 0.456	_0.641 0.563
_AP50	_0.780 0.764	_0.854 0.839	_0.782 0.774	_0.851 0.847	_0.813 0.799	_0.534 0.479	_0.855 0.848	_0.757 0.744	_0.571 0.549	_0.724 0.716	_0.846 0.832	_0.748 0.727	_0.820 0.805
_AP75	_0.671 0.551	_0.779 0.703	_0.678 0.625	_0.760 0.739	_0.718 0.600	_0.440 0.236	_0.786 0.714	_0.633 0.537	_0.390 0.307	_0.571 0.550	_0.771 0.684	_0.610 0.506	_0.728 0.641

Figure 4 shows the results of landmark and pose estimation.

Figure 4: Results of landmark and pose estimation.

Clothes Segmentation

This task assigns a category label (including background label) to each pixel in an item.The evaluation metrics is the average precision including ${AP}_{mask}$ , ${AP}_{mask}^{IoU=0.50}$ , ${AP}_{mask}^{IoU=0.75}$ computed over masks.

Table 4: Clothes Segmentation on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.

		_Scale			_Occlusion			_{Zoom_in}			_Viewpoint		_Overall
	_small	_moderate	_large	_slight	_medium	_heavy	_no	_medium	_large	_{no wear}	_frontal	_{side or back}
_AP	_0.634	_0.700	_0.669	_0.720	_0.674	_0.389	_0.703	_0.627	_0.526	_0.695	_0.697	_0.617	_0.680
_AP50	_0.831	_0.900	_0.844	_0.900	_0.878	_0.559	_0.899	_0.815	_0.663	_0.829	_0.886	_0.843	_0.873
_AP75	_0.765	_0.838	_0.786	_0.850	_0.813	_0.463	_0.842	_0.740	_0.613	_0.792	_0.834	_0.732	_0.812

Figure 5 shows the results of clothes segmentation.

Figure 5: Results of clothes segmentation.

Consumer-to-Shop Clothes Retrieval

Given a detected item from a consumer-taken photo, this task aims to search the commercial images in the gallery for the items that are corresponding to this detected item. In this task, top-k retrieval accuracy is employed as the evaluation metric. We emphasize the retrieval performance while still consider the influence of detector. If a clothing item fails to be detected, this query item is counted as missed.

Table 5: Consumer-to-Shop Clothes Retrieval on different subsets of some validation consumer-taken images. Each query item in these images has over 5 identical clothing items in validation commercial images. Results of evaluation on ground truth box and detected box are separately shown in each row. The evaluation metrics are top-20 accuracy.

		_Scale			_Occlusion			_{Zoom_in}			_Viewpoint			_Overall
	_small	_moderate	_large	_slight	_medium	_heavy	_no	_medium	_large	_{no wear}	_frontal	_{side or back}	_top-1	_top-10	_top-20
_class	_0.513 0.445	_0.619 0.558	_0.547 0.515	_0.580 0.542	_0.556 0.514	_0.503 0.361	_0.608 0.557	_0.557 0.514	_0.441 0.409	_0.555 0.508	_0.580 0.529	_0.533 0.519	_0.122 0.104	_0.363 0.321	_0.464 0.417
_pose	_0.695 0.619	_0.775 0.695	_0.729 0.688	_0.752 0.704	_0.729 0.668	_0.698 0.559	_0.769 0.700	_0.742 0.693	_0.618 0.572	_0.725 0.682	_0.755 0.690	_0.705 0.654	_0.255 0.234	_0.555 0.495	_0.647 0.589
_mask	_0.641 0.584	_0.705 0.656	_0.663 0.632	_0.688 0.657	_0.656 0.619	_0.645 0.512	_0.708 0.663	_0.670 0.630	_0.556 0.541	_0.650 0.628	_0.690 0.645	_0.653 0.602	_0.187 0.175	_0.471 0.421	_0.573 0.529
_pose+class	_0.752 0.691	_0.786 0.730	_0.733 0.705	_0.754 0.725	_0.750 0.706	_0.728 0.605	_0.789 0.746	_0.750 0.709	_0.620 0.582	_0.726 0.699	_0.771 0.723	_0.719 0.684	_0.268 0.244	_0.574 0.522	_0.665 0.617
_mask+class	_0.679 0.623	_0.738 0.696	_0.685 0.661	_0.711 0.685	_0.695 0.659	_0.651 0.568	_0.742 0.708	_0.699 0.667	_0.569 0.566	_0.677 0.659	_0.719 0.676	_0.678 0.657	_0.214 0.200	_0.510 0.463	_0.607 0.564

Figure 6 shows queries with top-5 retrieved clothing items. The first and the seventh column are the images from the customers with bounding boxes predicted by detection module, and the second to the sixth columns and the eighth to the twelfth columns show the retrieval results from the store.

Figure 6: Results of clothes retrieval.

Citation

If you use the DeepFashion2 dataset in your work, please cite it as:

@article{DeepFashion2,
  author = {Yuying Ge and Ruimao Zhang and Lingyun Wu and Xiaogang Wang and Xiaoou Tang and Ping Luo},
  title={A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images},
}

Please note that the article is in submission at present.

stoneyang/DeepFashion2