Label dataset
marco-c opened this issue ยท 21 comments
The labeling can be performed using the label.py script.
This script will show you a couple of images, and then you can press 'y' to label them as being compatible, 'd' to label them as being compatible with content differences (e.g. on news site, two screenshots could be compatible even though they are showing two different news, simply because the news shown depends on the time the screenshot was taken and not on the different browser), 'n' to label them as not being compatible, 'RETURN' to skip them (in case you are not sure yet), 'ESCAPE' to terminate the current labeling session and store the current results.
More details about the three-labeling system are present in the documentation at https://github.com/marco-c/autowebcompat#labeling.
@marco-c A CNN learns more about the patterns in the image (Edges, Corners and their correlations) from example 2 it is evident that it will be difficult for a NN to learn the adversary and classify that both are compatible.
To detect differences, Y+D and N in a better way or even Y and D+N, I think we can focus more on, Finding ROIs (Attention based) and feed those patches to the NN. This can be our next go-to-go (alternative) if nothing works very well after training part which you suggested.
At the beginning I would start with screenshots based on equal page sources (same content), so only Y vs D+N. Furthermore I would try to normalise the device settings to bring the rendered Firefox version closer to the rendered Chrome version. And maybe we could remove the system look and feel elements by injecting a small script before the screenshot will be taken.
@marco-c i'd like to label parts of our dataset, how do you suggest i go about doing that ? because as far as i've seen there is no script which merges labels from the label_persons directory into the actual labels directory .
@Shashi456 I think you are talking about generate_labels.py
.
@sagarvijaygupta oh , i thought it wasn't updated for the new files :P , but regardless should we not spend some time labeling the dataset we may need it this summer
@marco-c i'd like to label parts of our dataset, how do you suggest i go about doing that ? because as far as i've seen there is no script which merges labels from the label_persons directory into the actual labels directory .
The script hasn't been updated yet to deal with bounding boxes, but you can already start labeling and pushing your labels file to the repo. Then, once we have the script done, we will actually combine the labeling done by you and the labeling done by other persons.
I am running label.py on my mac, and I am finding that it is slow or unresponsive on non-y images. For instance, it takes a long from when I try to drop a boundary box to when it shows up and for the 'T', resizing arrow, and movement arrow show up. Clicking on any causes everything to disappear until I release my mouse + a couple of seconds.
Is this a problem that anyone else has come up against?
It could be a Mac issue, I think nobody has tested it on a Mac yet. Could you try in a Linux VM?
@marco-c I am not having that problem on the Linux VM, so I can label a lot faster now. A couple of questions:
-
Applying labels: suppose two images seem to only be different in terms of the position on the page that has been scrolled to (ex. Image 1 looks like image 2, except that image 2 has been scrolled down and thus exposes more of the page content). Would these be considered compatible, not compatible, or compatible but different.
-
Getting my labels into the main repo: Should I open a PR for a new branch off of my forked master that is the same as the upstream master, except that it includes my new labels?
Also, how would you label a pair of images when they show the same page except that one is in English and the other in Italian?
Getting my labels into the main repo: Should I open a PR for a new branch off of my forked master that is the same as the upstream master, except that it includes my new labels?
Yes! You can open a PR that says "Add some labels from Shane Sims".
For the scroll one we have marked them as incompatible in screenshots
IIRC I've marked them as compatible, didn't I?
No maybe not, they should be incompatible (e.g. if clicking on a button causes a scroll in one browser, it should cause a scroll in the other browser too).
Line 162 in b18eae0
And if this script works differently on two browsers then also it should be an incompatibility?
And if this script works differently on two browsers then also it should be an incompatibility?
It shouldn't, but it's hard to tell whether it was this script that failed or something else.
Maybe we should just assume this always works.
Okay!
@marco-c @sagarvijaygupta so while i was labeling the dataset one of the major themes that popped up was how chrome had a scrollbar. Almost all images which have a scrollbar are very similar but the scrollbars adds a shift which makes the overlay look incompatible .
Should we update the crawler options for chrome to remove the scroll bar or suggest the user something accordingly in the labeling guide?
@Shashi456 it is already removed from the crawler.