StyleNet: Non Reproducible Results
tusharkr opened this issue · 15 comments
I would like to categorically state that this Paper "StyleNet: Generating Attractive Visual Captions with Styles" from Microsoft is non-reproducible. This is not just from the code based on this repo, but our own extensive experiments have lead us to believe that this paper is just a work of fiction put together. We have also contacted the lead authors Chuang Gan & Zhe Gan. However, we did not get any reasonable explanation about why this architecture does not work. It is unfortunate to see that this paper also have significant citations. At this point, how this was accepted at CVPR remains a big question.
Also the new dataset as mentioned in the paper, is not available as a whole. Only a part of this dataset is available, which makes this task even more questionable.
Overall, I would request readers stumbling across this not to waste their time reproducing this paper!!
Thank you for telling us. I just decide to start my work based on StyleNet and try to reproduce it. You help me save my time.
Thanks for you remind. But could you explain why you think the dataset is not available? I took a look at the dataset and it seems no problem, maybe you think it's difficult to distinguish the romantic and humorous?
The dataset should contain 10k images according to the paper. However, in reality only 7k images are available. We confronted the author regarding this and he did not give any specific reason as to why the 3k is missing. Moreover, there are 3 captions per image for the neutral captions whereas there is only 1 caption per image for humorous. This makes the training impossible. This is why I have categorically stated that this paper is just a work of fiction.
I got it, thank you.
you are welcome. I have spent close to 6 months trying to reproduce this paper. After asking a couple of confronting questions, the authors stopped responding. I would suggest not to waste your time on this or any similar paper written by the the first-author of this paper.
First of all, I want to point that this repo is not official repo. Actually, there is so much work following this paper, which focuses on limited stylized pair data by unpair training.
Two points,
Firstly, It does not matter if this is the official repo or not, technically the paper is non-reproducible and the architecture simply does not work. Since this link is where most researchers stumble on (in fact I have a mail where the second author himself asked me to try this repo), it is good to tell them in advancethat not only this repo, but the paper itself is non-reproducible.
Secondly, just because there are others who are inspired from this design (or that there are other papers referring to this paper) does not necessarily guarantee the reproducibility of this paper.
Good to know that you are trying to reproduce,
However, from our side, we fixed all the bugs in this repo. We also wrote the code from scratch by reading the paper. At the end, we wasted 7 months trying all possible combinations. But we could not reproduce even a partial result. That is why I am stating that this paper is just fiction. It is an insult to the CVPR tradition. I still wonder how the authors we able to convince the CVPR reviewers.
Wait..@Doragd So you think the FlickrStyle10K(in fact, 7K) dataset is feasible for stylish image captioning, but the result in Stylenet is exaggerated?
And by the way, what's the result in you picture? I have read MSCap, but there is no similar result.
@njucckevin First of all, 7k data is somewhat feasible to train a model for stylized image captioning, but in my opinion, StyleNet which only depends on four stylized parameter matrixs cannot learn to express style, especially its strange training method. My result is a rather rough result, and I will refine it soon. You can feel free to contact me to obtain my refine version.
I got it. Thanks~