langcog/childesr

get types/tokens by transcript_id and target_child_id

Closed this issue · 5 comments

....would be useful. Indexing into the db with target_child_id is simpler since it's unique (I think), rather than child name + corpus name.

resolution: target_child + corpus name is unique and guaranteed across versions (more or less). In contrast, id fields are not guaranteed across versions because the db import will generate new ids each time. So we are going to discourage users from using ids as version upgrades will likely break code that relies on ids.

I'm finding that not all transcripts have a target_child_name field. E.g.,

get_transcripts(corpus = "Braunwald") %>% count(target_child_name)

this specific corpus appears to be a parsing error for the Braunwald/0diary files, but there are other corpora where there is no Target_Child marked. we should discuss further, thanks for reopening.

so it turns out the updated CHILDES xml does seems to contain a name for every child (previously these fields could be missing, but it seems the child "code" is being used as the name when name is blank). This will go into new db version when fresh xml is pulled (and should resolve this issue)

Some corpora like Garvey still have > 1 target child, and so the target_child_* fields will be NULL for all utterances / tokens / etc associated with these 2-child transcripts

e.g. Garvey: https://childes.talkbank.org/browser/index.php?url=Eng-NA/Garvey/valabe.cha

get_transcripts(corpus = "Garvey") %>% select(target_child_id, target_child_age, target_child_name, target_child_sex)

also in some cases the "Target_Child" convention is not followed, e.g. this Spanish transcript: https://childes.talkbank.org/browser/index.php?url=Spanish/Hess/d12a1ex1.cha. The relevant target_child fields will also be NULL

new db version now makes most corpus_name + child_name pairs unique, except for the cases where there is no target child