jacobandreas/geca

Documentation for applying to new dataset

tomhosking opened this issue · 4 comments

Hi, I'm interested in applying GECA to a new dataset - could you provide some brief documentation or examples on how I might augment an arbitrary list of utterances using your implementation? Thanks!

Hi Jacob,

I've tried to put together a minimum working example to then export to my own project + framework, but I'm finding it difficult. With compute adjacency set to True (I think otherwise it doesn't actually do anything?), I tried the following:

from data.builder import OneShotDataset

train_data = [
    ((), tuple('red lorry'.split())),
    ((), tuple('red car'.split())),
    ((), tuple('yellow lorry'.split())),
]

ds = OneShotDataset(train_data, [], [])

print(ds.multiplicity)

defaultdict(<function OneShotDataset._compute_adjacency.<locals>.<lambda> at 0x7f1604c318c8>, {(1, 6, 5, 8, 2): 2, (1, 6, 5, 9, 2): 1, (1, 6, 7, 5, 2): 2, (1, 6, 10, 5, 2): 1, (1, 6, 5, 2): 3, 1: 0, 2: 0, 5: 0, 6: 0, 8: 0, 9: 0, 7: 0, 10: 0})

print(ds.templ_to_templ)

defaultdict(<class 'set'>, {1: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 2: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 5: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 6: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 8: {(1, 6, 5, 8, 2), (1, 6, 5, 9, 2)}, 9: {(1, 6, 5, 8, 2), (1, 6, 5, 9, 2)}, 7: {(1, 6, 7, 5, 2), (1, 6, 10, 5, 2)}, 10: {(1, 6, 7, 5, 2), (1, 6, 10, 5, 2)}})

print(ds.comp_pairs)

[]

Iterating through ds.sample_comp_train() then throws an error, since comp_pairs is empty.

My understanding is that this should at the very least add 'yellow car' to the dataset?

If I understand these lines correctly, comp_pairs will never get populated since the keys in templ_to_templ different types of structure to the keys in multiplicity:

comp_pairs = []
        for templ1 in self.templ_to_templ:
            if self.multiplicity[templ1] <= 1:
                continue
            for templ2 in self.templ_to_templ[templ1]:
                comp_pairs.append((templ1, templ2))

A standalone MWE would be really helpful for using GECA in other research!

Thanks

This is extremely late, but there's now a minimal example under data/colors.py. Hope you got it working!