/DeepFashion2

DeepFashion2 Dataset https://arxiv.org/pdf/1901.07973.pdf

Primary LanguageJupyter Notebook

DeepFashion2 Dataset

image

DeepFashion2 is a comprehensive fashion dataset. It contains 491K diverse images of 13 popular clothing categories from both commercial shopping stores and consumers. It totally has 801K clothing clothing items, where each item in an image is labeled with scale, occlusion, zoom-in, viewpoint, category, style, bounding box, dense landmarks and per-pixel mask.There are also 873K Commercial-Consumer clothes pairs.
The dataset is split into a training set (391K images), a validation set (34k images), and a test set (67k images).
Examples of DeepFashion2 are shown in Figure 1.

Figure 1: Examples of DeepFashion2.

image From (1) to (4), each row represents clothes images with different variations. At each row, we partition the images into two groups, the left three columns represent clothes from commercial stores, while the right three columns are from customers.In each group, the three images indicate three levels of difficulty with respect to the corresponding variation.Furthermore, at each row, the items in these two groups of images are from the same clothing identity but from two different domains, that is, commercial and customer.The items of the same identity may have different styles such as color and printing.Each item is annotated with landmarks and masks.

Announcements

  • xx-xx-xx Validation set of DeepFashion2 is released.

Download the Data

Below provides the links to validation set. We will soon release training set and test set.

  • Images(val_imags.zip)
  • Annotations(val_annos.zip)

Data Organization

Each image in seperate image set has a unique six-digit number such as 000001.jpg. A corresponding annotation file in json format is provided in annotation set such as 000001.json.
Each annotation file is organized as below:

  • source: a string, where 'shop' indicates that the image is from commercial store while 'user' indicates that the image is taken by users.
  • pair_id: a number. Images from the same shop and their corresponding consumer-taken images have the same pair id.
    • item 1
      • category_name: a string which indicates the category of the item.
      • category_id: a number which corresponds to the category name. In category_id, 1 represents short sleeve top, 2 represents long sleeve top, 3 represents short sleeve outwear, 4 represents long sleeve outwear, 5 represents vest, 6 represents sling, 7 represents shorts, 8 represents trousers, 9 represents skirt, 10 represents short sleeve dress, 11 represents long sleeve dress, 12 represents vest dress and 13 represents sling dress.
      • style: a number to distinguish between clothing items from images with the same pair id. Clothing items with different style numbers from images with the same pair id have different styles such as color, printing, and logo. In this way, a clothing item from shop images and a clothing item from user image are positive commercial-consumer pair if they have the same style number greater than 0 and they are from images with the same pair id.
      • bounding_box: [x1,y1,x2,y2],where x1 and y_1 represent the lower left point coordinate of bounding box, x_2 and y_2 represent the upper right point coordinate of bounding box.
      • landmarks: [x1,y1,v1,...,xn,yn,vn], where v represents the visibility: v=2 visible; v=1 occlusion; v=0 not labeled. We have different definitions of landmarks for different categories. The orders of landmark annotations are listed in figure 2.
      • segmentation: [[x1,y1,...xn,yn],[ ]], where [x1,y1,xn,yn] represents a polygon and a single clothing item may contain more than one polygon.
      • scale: a number, where 1 represents small scale, 2 represents modest scale and 3 represents large scale.
      • occlusion: a number, where 1 represents slight occlusion(including no occlusion), 2 represents medium occlusion and 3 represents heavy occlusion.
      • zoom_in: a number, where 1 represents no zoom-in, 2 represents medium zoom-in and 3 represents lagre zoom-in.
      • viewpoint: a number, where 1 represents no wear, 2 represents frontal viewpoint and 3 represents side or back viewpoint.
    • item 2
      ...
    • item n

The definition of landmarks and skeletons of 13 categories are shown below. The numbers in the figure represent the order of landmark annotations of each category in annotation file. A total of 294 landmarks covering 13 categories are defined.

Figure 2: Definitions of landmarks and skeletons.

image In validation set, we provide query image names in list_query.txt and gallery image names in list_gallery.txt. In Commercial-Consumer Clothes Retrieval benchmark, during evaluation, each query clothing item with style number greater than 0 has corresponding ground truth gallery clothing item, which has the same style and pair_id with the query clothing item. A query clothing item may have more than one ground truth gallery clothing item.

Dataset Statistics

Tabel 1 shows the statistics of images and annotations in DeepFashion2.

Table 1: Statistics of DeepFashion2.

Train Validation Test Overall
images 390,884 33,669 67,342 491,895
bboxes 636,624 54,910 109,198 800,732
landmarks 636,624 54,910 109,198 800,732
masks 636,624 54,910 109,198 800,732
pairs 685,584 query: 12,550
gallery: 37183
query: 24,402
gallery: 75,347
873,234

Figure 3 shows the statistics of different variations and the numbers of items of the 13 categories in DeepFashion2.

Figure 3: Statistics of DeepFashion2.

image

Benchmarks

Clothes Detection

This task detects clothes in an image by predicting bounding boxes and category labels to each detected clothing item. The evaluation metrics are the bounding box's average precision ,,.

Table 2: Clothes detection on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.

Scale Occlusion Zoom_in Viewpoint Overall
small moderate large slight medium heavy no medium large no wear frontal side or back
AP 0.604 0.700 0.660 0.712 0.654 0.372 0.695 0.629 0.466 0.624 0.681 0.641 0.667
AP50 0.780 0.851 0.768 0.844 0.810 0.531 0.848 0.755 0.563 0.713 0.832 0.796 0.814
AP75 0.717 0.809 0.744 0.812 0.768 0.433 0.806 0.718 0.525 0.688 0.791 0.744 0.773

Landmark and Pose Estimation

This task aims to predict landmarks for each detected clothing item in an each image.Similarly, we employ the evaluation metrics used by COCOfor human pose estimation by calculating the average precision for keypoints ,, where OKS indicates the object landmark similarity.

Table 3: Landmark Estimation on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.Results of evaluation on visible landmarks only and evaluation on both visible and occlusion landmarks are separately shown in each row

Scale Occlusion Zoom_in Viewpoint Overall
small moderate large slight medium heavy no medium large no wear frontal side or back
AP 0.587
0.497
0.687
0.607
0.599
0.555
0.669
0.643
0.631
0.530
0.398
0.248
0.688
0.616
0.559
0.489
0.375
0.319
0.527
0.510
0.677
0.596
0.536
0.456
0.641
0.563
AP50 0.780
0.764
0.854
0.839
0.782
0.774
0.851
0.847
0.813
0.799
0.534
0.479
0.855
0.848
0.757
0.744
0.571
0.549
0.724
0.716
0.846
0.832
0.748
0.727
0.820
0.805
AP75 0.671
0.551
0.779
0.703
0.678
0.625
0.760
0.739
0.718
0.600
0.440
0.236
0.786
0.714
0.633
0.537
0.390
0.307
0.571
0.550
0.771
0.684
0.610
0.506
0.728
0.641

Figure 4 shows the results of landmark and pose estimation.

Figure 4: Results of landmark and pose estimation.

image

Clothes Segmentation

This task assigns a category label (including background label) to each pixel in an item.The evaluation metrics is the average precision including ,, computed over masks.

Table 4: Clothes Segmentation on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.

Scale Occlusion Zoom_in Viewpoint Overall
small moderate large slight medium heavy no medium large no wear frontal side or back
AP 0.634 0.700 0.669 0.720 0.674 0.389 0.703 0.627 0.526 0.695 0.697 0.617 0.680
AP50 0.831 0.900 0.844 0.900 0.878 0.559 0.899 0.815 0.663 0.829 0.886 0.843 0.873
AP75 0.765 0.838 0.786 0.850 0.813 0.463 0.842 0.740 0.613 0.792 0.834 0.732 0.812

Figure 5 shows the results of clothes segmentation.

Figure 5: Results of clothes segmentation.

image

Consumer-to-Shop Clothes Retrieval

Given a detected item from a consumer-taken photo, this task aims to search the commercial images in the gallery for the items that are corresponding to this detected item. In this task, top-k retrieval accuracy is employed as the evaluation metric. We emphasize the retrieval performance while still consider the influence of detector. If a clothing item fails to be detected, this query item is counted as missed.

Table 5: Consumer-to-Shop Clothes Retrieval on different subsets of some validation consumer-taken images. Each query item in these images has over 5 identical clothing items in validation commercial images. Results of evaluation on ground truth box and detected box are separately shown in each row. The evaluation metrics are top-20 accuracy.

Scale Occlusion Zoom_in Viewpoint Overall
small moderate large slight medium heavy no medium large no wear frontal side or back top-1 top-10 top-20
class 0.513
0.445
0.619
0.558
0.547
0.515
0.580
0.542
0.556
0.514
0.503
0.361
0.608
0.557
0.557
0.514
0.441
0.409
0.555
0.508
0.580
0.529
0.533
0.519
0.122
0.104
0.363
0.321
0.464
0.417
pose 0.695
0.619
0.775
0.695
0.729
0.688
0.752
0.704
0.729
0.668
0.698
0.559
0.769
0.700
0.742
0.693
0.618
0.572
0.725
0.682
0.755
0.690
0.705
0.654
0.255
0.234
0.555
0.495
0.647
0.589
mask 0.641
0.584
0.705
0.656
0.663
0.632
0.688
0.657
0.656
0.619
0.645
0.512
0.708
0.663
0.670
0.630
0.556
0.541
0.650
0.628
0.690
0.645
0.653
0.602
0.187
0.175
0.471
0.421
0.573
0.529
pose+class 0.752
0.691
0.786
0.730
0.733
0.705
0.754
0.725
0.750
0.706
0.728
0.605
0.789
0.746
0.750
0.709
0.620
0.582
0.726
0.699
0.771
0.723
0.719
0.684
0.268
0.244
0.574
0.522
0.665
0.617
mask+class 0.679
0.623
0.738
0.696
0.685
0.661
0.711
0.685
0.695
0.659
0.651
0.568
0.742
0.708
0.699
0.667
0.569
0.566
0.677
0.659
0.719
0.676
0.678
0.657
0.214
0.200
0.510
0.463
0.607
0.564

Figure 6 shows queries with top-5 retrieved clothing items. The first and the seventh column are the images from the customers with bounding boxes predicted by detection module, and the second to the sixth columns and the eighth to the twelfth columns show the retrieval results from the store.

Figure 6: Results of clothes retrieval.

image

Citation

If you use the DeepFashion2 dataset in your work, please cite it as:

@article{DeepFashion2,
  author = {Yuying Ge and Ruimao Zhang and Lingyun Wu and Xiaogang Wang and Xiaoou Tang and Ping Luo},
  title={A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images},
}

Please note that the article is in submission at present.