Underspecified age filtering behavior
Closed this issue · 2 comments
Elika Bergelson asked me and Charlotte Moore for an estimate of how many child-produced tokens "before 36 months" are present in CHILDES.
Charlotte ran
childesr::get_tokens(collection = “Eng-NA”, role_exclude = “Target_Child”, token = “*”, age = c(0, 37))
to get 3,363,486 tokens from 2018.1
I multiplied 36 * 30.5 to get 1098 days and used the SQL query
select count(id) from token where collection_name = "Eng-NA" and target_child_age <= 1098 and speaker_code != "CHI"
To get 3,259,979 from tokens from 2018.1
There are two things here:
-
The age filter may be affected by the bug / deviation from expectations that Jess found that the age filter gets tokens that belong to children who produce tokens in that age range, and not necessarily ones that are in that age range. @JMankewitz @mikabr is this true?
-
The documentation says that the age range is (inclusive, exclusive), but c(0,37) is unclear if that means 0-36 or 0-36.999999? @mikabr
The main cause for the difference here is the role
param in get_tokens
filters on speaker_role
, not speaker_code
. childesr also uses a different conversion for months -> days than you did here.
Line 12 in d47235c
select count(id) from token where collection_name = "Eng-NA" and target_child_age <= 1126.164 and speaker_role != "Target_Child"
returns 3,363,486 tokens from 2018.1
childesr::get_tokens(collection = "Eng-NA", role_exclude = "Target_Child", token = "*", age = c(0, 37))
also returns 3,363,486 tokens.
However, to answer the questions above:
- The behavior of the
age
param inget_tokens
is different from theage
param inget_speaker_statistics
andget_participants
.get_tokens
should return the tokens whereage 1 <= target_child_age <= age 2
Line 483 in d47235c
This means that childesr::get_tokens(collection = “Eng-NA”, role_exclude = “Target_Child”, token = “*”, age = c(0, 37))
should return tokens where target_child_age is >=0 and <= 37*30.43688 (1126.164 days)
We've changed the documentation on age to say inclusive.