langcog/childesr

Underspecified age filtering behavior

Closed this issue · 2 comments

Elika Bergelson asked me and Charlotte Moore for an estimate of how many child-produced tokens "before 36 months" are present in CHILDES.

Charlotte ran
childesr::get_tokens(collection = “Eng-NA”, role_exclude = “Target_Child”, token = “*”, age = c(0, 37))
to get 3,363,486 tokens from 2018.1

I multiplied 36 * 30.5 to get 1098 days and used the SQL query
select count(id) from token where collection_name = "Eng-NA" and target_child_age <= 1098 and speaker_code != "CHI"
To get 3,259,979 from tokens from 2018.1

There are two things here:

  1. The age filter may be affected by the bug / deviation from expectations that Jess found that the age filter gets tokens that belong to children who produce tokens in that age range, and not necessarily ones that are in that age range. @JMankewitz @mikabr is this true?

  2. The documentation says that the age range is (inclusive, exclusive), but c(0,37) is unclear if that means 0-36 or 0-36.999999? @mikabr

The main cause for the difference here is the role param in get_tokens filters on speaker_role, not speaker_code. childesr also uses a different conversion for months -> days than you did here.

avg_month <- 365.2425 / 12

select count(id) from token where collection_name = "Eng-NA" and target_child_age <= 1126.164 and speaker_role != "Target_Child" returns 3,363,486 tokens from 2018.1

childesr::get_tokens(collection = "Eng-NA", role_exclude = "Target_Child", token = "*", age = c(0, 37)) also returns 3,363,486 tokens.

However, to answer the questions above:

  1. The behavior of the age param in get_tokens is different from the age param in get_speaker_statistics and get_participants. get_tokens should return the tokens where age 1 <= target_child_age <= age 2
    content %<>% dplyr::filter(target_child_age >= days_1,

This means that childesr::get_tokens(collection = “Eng-NA”, role_exclude = “Target_Child”, token = “*”, age = c(0, 37)) should return tokens where target_child_age is >=0 and <= 37*30.43688 (1126.164 days)

  1. It seems like we should either change the language to be (inclusive, inclusive) or change the specification here to be target_child_age <= (days_2-avg_month)? Thoughts @mikabr?
    target_child_age <= days_2)

We've changed the documentation on age to say inclusive.