Documentation for applying to new dataset

Question

Documentation for applying to new dataset

tomhosking opened this issue 4 years ago · 4 comments

Hi, I'm interested in applying GECA to a new dataset - could you provide some brief documentation or examples on how I might augment an arbitrary list of utterances using your implementation? Thanks!

Answer 1 · 2020-08-21T16:44:26.000Z

Hi Tom, Apologies for taking nearly a month to get to this! I've updated the README with slightly more detailed instructions; let me know if you still have questions. J

…

On Wed, Jul 29, 2020 at 7:44 AM Tom Hosking ***@***.***> wrote: Hi, I'm interested in applying GECA to a new dataset - could you provide some brief documentation or examples on how I might augment an arbitrary list of utterances using your implementation? Thanks! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABZUT6TODCZNXFBFGQBFTDR6ADP7ANCNFSM4PLQHUHQ> .

Answer 2 · 2020-09-10T15:53:02.000Z

Hi Jacob,

I've tried to put together a minimum working example to then export to my own project + framework, but I'm finding it difficult. With compute adjacency set to True (I think otherwise it doesn't actually do anything?), I tried the following:

from data.builder import OneShotDataset

train_data = [
    ((), tuple('red lorry'.split())),
    ((), tuple('red car'.split())),
    ((), tuple('yellow lorry'.split())),
]

ds = OneShotDataset(train_data, [], [])

print(ds.multiplicity)

defaultdict(<function OneShotDataset._compute_adjacency.<locals>.<lambda> at 0x7f1604c318c8>, {(1, 6, 5, 8, 2): 2, (1, 6, 5, 9, 2): 1, (1, 6, 7, 5, 2): 2, (1, 6, 10, 5, 2): 1, (1, 6, 5, 2): 3, 1: 0, 2: 0, 5: 0, 6: 0, 8: 0, 9: 0, 7: 0, 10: 0})

print(ds.templ_to_templ)

defaultdict(<class 'set'>, {1: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 2: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 5: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 6: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 8: {(1, 6, 5, 8, 2), (1, 6, 5, 9, 2)}, 9: {(1, 6, 5, 8, 2), (1, 6, 5, 9, 2)}, 7: {(1, 6, 7, 5, 2), (1, 6, 10, 5, 2)}, 10: {(1, 6, 7, 5, 2), (1, 6, 10, 5, 2)}})

print(ds.comp_pairs)

[]

Iterating through ds.sample_comp_train() then throws an error, since comp_pairs is empty.

My understanding is that this should at the very least add 'yellow car' to the dataset?

If I understand these lines correctly, comp_pairs will never get populated since the keys in templ_to_templ different types of structure to the keys in multiplicity:

comp_pairs = []
        for templ1 in self.templ_to_templ:
            if self.multiplicity[templ1] <= 1:
                continue
            for templ2 in self.templ_to_templ[templ1]:
                comp_pairs.append((templ1, templ2))

A standalone MWE would be really helpful for using GECA in other research!

Thanks

Answer 3 · 2020-10-12T13:45:41.000Z

thanks---try now? there was a bug in the case where the window size parameter was not set. if this works for you feel free to submit a PR with the MWE!

…

On Thu, Sep 10, 2020 at 1:35 PM Tom Hosking ***@***.***> wrote: Hi Jacob, I've tried to put together a minimum working example to then export to my own project + framework, but I'm finding it difficult. With compute adjacency set to True (I think otherwise it doesn't actually do anything?), I tried the following: from data.builder import OneShotDataset train_data = [ ((), tuple('red lorry'.split())), ((), tuple('red car'.split())), ((), tuple('yellow lorry'.split())), ] ds = OneShotDataset(train_data, [], []) print(ds.multiplicity) defaultdict(<function OneShotDataset._compute_adjacency.<locals>.<lambda> at 0x7f1604c318c8>, {(1, 6, 5, 8, 2): 2, (1, 6, 5, 9, 2): 1, (1, 6, 7, 5, 2): 2, (1, 6, 10, 5, 2): 1, (1, 6, 5, 2): 3, 1: 0, 2: 0, 5: 0, 6: 0, 8: 0, 9: 0, 7: 0, 10: 0}) print(ds.templ_to_templ) defaultdict(<class 'set'>, {1: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 2: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 5: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 6: {(1, 6, 10, 5, 2), (1, 6, 5, 2), (1, 6, 7, 5, 2), (1, 6, 5, 9, 2), (1, 6, 5, 8, 2)}, 8: {(1, 6, 5, 8, 2), (1, 6, 5, 9, 2)}, 9: {(1, 6, 5, 8, 2), (1, 6, 5, 9, 2)}, 7: {(1, 6, 7, 5, 2), (1, 6, 10, 5, 2)}, 10: {(1, 6, 7, 5, 2), (1, 6, 10, 5, 2)}}) print(ds.comp_pairs) [] Iterating through ds.sample_comp_train() then throws an error, since comp_pairs is empty. My understanding is that this should at the very least add 'yellow car' to the dataset? If I understand these lines correctly, comp_pairs will never get populated since the keys in templ_to_templ different types of structure to the keys in multiplicity: comp_pairs = [] for templ1 in self.templ_to_templ: if self.multiplicity[templ1] <= 1: continue for templ2 in self.templ_to_templ[templ1]: comp_pairs.append((templ1, templ2)) A standalone MWE would be really helpful for using GECA in other research! Thanks — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABZUT363D3H7IF6FSW4BITSFDZBDANCNFSM4PLQHUHQ> .

Answer 4 · 2021-01-11T16:43:31.000Z

This is extremely late, but there's now a minimal example under data/colors.py. Hope you got it working!