ASER Lorenz curves by grade
Opened this issue · 47 comments
@lcrouch1952
Following up on our conversation on Friday, I generated Lorenz curves for each grade (0 to 12) for reading in local language variable (that I estimated using ineqord command). There doesn't seem to be a lot of inequality there, some variability between grades...
"CFul" stands for Cowell-Flachaire upward-looking status
"CFdl" - Cowell-Flachaire downward-looking status
https://www.dropbox.com/home/dukeInternInequalityOutputs/Mavzuna
@lcrouch1952
We have very slightly more interesting results when reading in English. Please see the graphs in the dropbox
https://www.dropbox.com/home/dukeInternInequalityOutputs/Mavzuna
@mavzunat - we probably don't have our Dropboxes mapped the same way you do. So when you send a link, it ought to be courtesy of doing the right-click / Share or right-click/ Get Dropbox Link options. See image below for what I mean.
With the current link, @lcrouch1952 and I won't necessarily easily arrive at your intended destination. (Plus you could provide a link to the specific files themselves, if necessary.)
Ok, on the grades, why are we graphing by poorest, poor, etc? That's making it difficult to see the differences between grades. Also would it be possible to put all grades in one big file so that is it super easy to scroll up and down?
@mavzunat - also, you could just embed the images by dragging/dropping into this comment field, if that's easier. Here I've pulled in the Cowell-Flachaire upward-looking index results for grade 12 in...Urdu?
...and English?
My first gut reaction is that I wonder whether we've calculated something wrong. If I recall correctly @lcrouch1952 was theorizing that inequality would diminish as you get into the upper grades...but to this extent? It looks like it has almost entirely disappeared!
So would a proper interpretation of these kinds of results be that whatever the cross-income-quartile differences in learning outcomes, within quartiles there's almost perfect distribution of learning outcomes along the line of equality? That feels improbable, although I suppose it's not impossible.
Same for the english. Can we have just one graph per grade (not poor, rich, etc.), all graphs in one file?
@mavzunat @TSSlade I think we need to get away from the poorest, poor, middle stuff and look at all groups together... but ideally in one huge graph with all grades (and maybe with one with all ages?).... Also, and I am stupid not to have had this insight before, my sense is that for an ordinal variable like this, the inequality should be calculable, at least approximately, by looking at the frequency distributions.... so, if someone can calculate the frequency distributions of the five or six or whatever reading levels, by grade, and dump that into an excel sheet, along with the "big graph" with all the grades one by one but no longer the poorest, poor stuff.... then I could take a look and see whether one can have an intuition about what's going on by loooking at those data???? Or, if it would provide you with pleasure (!), you guys do it and show me the results...
@mavzunat @TSSlade Actually, Tim, looking at the discussion, it is possible that things would get very flat by grade 12. After all, we are talking about Grade 2 reading materials!! Here is another idea. Do the Grade 2 calculations in the two states of India with the highest per capital GDP, and in the two lowest. Here are the GDP data by state: https://en.wikipedia.org/wiki/List_of_Indian_states_and_union_territories_by_GDP_per_capita
Then, also, consider doing it for the 2 states with the lowest and the 2 wth the highest income gini coefficients. here are some income gini estimates: https://www.indiastat.com/SOCIO_PDF/91/fulltext.pdf
Before I address everything else in this thread, just a quick clarification request: we are working on data from Pakistan, not India, right? I have been running all my analyses based on data from Pakistan, which I retrieved from a link provided in dukeintern_Analysis_SOW_2020... word file. So, I am a little confused by this last request...
Ok, on the grades, why are we graphing by poorest, poor, etc? That's making it difficult to see the differences between grades. Also would it be possible to put all grades in one big file so that is it super easy to scroll up and down?
I hope this is what you wanted to see. Please follow the link https://www.dropbox.com/sh/1kodjty5v63gw91/AADN4lOABkjymGM10lU7TrOqa?dl=0
@lcrouch1952
or, alternatively, this gives a better overview, maybe?
Before I address everything else in this thread, just a quick clarification request: we are working on data from Pakistan, not India, right?
Correct. This is Pakistan data you should be working on. A lot of the core ASER work was done in India by the NGO Pratham, and @lcrouch1952 was recently facilitating a workshop with one of their leaders. I suspect he has India on the brain as a result. :-) Also, India's data is among the broader set of data we'd apply this work to once we've got the initial proof of concept worked out.
I think we need to get away from the poorest, poor, middle stuff and look at all groups together... but ideally in one huge graph with all grades (and maybe with one with all ages?).... Also, and I am stupid not to have had this insight before, my sense is that for an ordinal variable like this, the inequality should be calculable, at least approximately, by looking at the frequency distributions.... so, if someone can calculate the frequency distributions of the five or six or whatever reading levels, by grade, and dump that into an excel sheet, along with the "big graph" with all the grades one by one but no longer the poorest, poor stuff.... then I could take a look and see whether one can have an intuition about what's going on by loooking at those data???? Or, if it would provide you with pleasure (!), you guys do it and show me the results...
@lcrouch1952. I think I understand what you are trying to do. If I do understand it correctly, then Cowell-Flachaire downward-looking status is already doing it. In the attached file I show the distribution of reading_local, and calculated CFdl and CFul variables.
ASER-mt-S01-V01-distributions.pdf
What CFdl does, it assigns a value to each student corresponding to where a student falls in the cumulative distribution of the reading_local levels. I provided an example of a student who is in level 1, "letters" of reading_local variable, which corresponds to 35.5% of cumulative distribution. CFdl assigns value 35.5 to the students. CFul is way off.
@lcrouch1952 Please let me know if that's what you had in mind. If not, please elaborate further. Thank you!
And, yes, that is precisely what I had in mind. But, what I think I'd like to look at is simply the data in the first table in that PDF, but for all grades, grade by grade, and put in Excel? Is that possible? I'd like to fool around with it for a bit.
Good! But, I am not able to export it out of Stata into an Excel document. Too many observations... I could break it up into separate grades and send you each grade separately. Please let me know. But I would have to do it tomorrow. @lcrouch1952
@lcrouch1952 You are absolutely right! I did not think this through. Will post it here shortly.
@lcrouch1952
The table is a little longer than expected as children within the same grade can be in different levels of reading. I hope that's what you had in mind.
ASER-mt-S01-V01-distr_noduplicates.xlsx
@TSSlade @lcrouch1952
As I am looking at these data, I see that each child is tested in both the local and English languages, which, If I am not mistaken was not the case in Kenya study. Looking closer at the data, I see that a child can achieve a drastically different levels of reading across the languages. Example, read no word in English, but sentences in the local language, which is a huge gap. I am thinking maybe we should use the average of the score across the two languages, rather than assessing each language individually? As we are interested in measuring learning inequality overall, not in a specific language, right?
grade reading_local reading_English CFdllocal CFullocal CFdlEnglish CFulEnglish
0 Beginner/Nothing Beginner/Nothing 0.1807231 1.0000000 0.3409618 1.0000000
0 letters capital Letters 0.3550061 0.819277 0.456943 0.659038
I think, I can explain this part, which will also test my understanding of what's going on - good exercise! As I mentioned earlier, the same student can achieve different levels of reading across the languages. "CFdllocal CFullocal CFdlEnglish CFulEnglish" are calculated based on reading_local and reading_English variables, respectively, using ineqord command. So, whenever a student is at the "Beginner/Nothing" level in English, ineqord command assigns the student a number 0.3409618, which is the proportion of all students (across all grades), who achieve this level (which tend to be in lower grades anyway)....The reason it repeats so many times, is because around 1/3 of the sample is reading on Beginner/Nothing level in English. As I am writing this, I see that, maybe the correct way of doing it is calculating "CFdllocal CFullocal CFdlEnglish CFulEnglish" by each grade separately??? Is it so???
Local language
Grade 0 Grade 1 Grade 2 Etc.
Beginner/Nothing 0.8 0.6 0.5 Etc.
letters 0.1 0.2 0.3 Etc.
words 0.1 0.1 0.2 Etc.
sentences 0 0.1 0 Etc.
story 0 0 0 Etc.
total 1 1 1
English
Grade 0 Grade 1 Grade 2 Etc.
Beginner/Nothing 0.9 0.8 0.7 Etc.
letters 0.05 0.1 0.1 Etc.
words 0.05 0.1 0.1 Etc.
sentences 0 0 0.1 Etc.
story 0 0 0 Etc.
total 1 1 1
I think what you want me to do is transform the data from the "long" format into the "wide" format. Sorry, my mistake. I didn't quite think of this way when you mentioned you expect much fewer number of observations...
And note this particular input, bolded here: "Are these [numbers] in any
way already modeled? If so what we need is the raw [relative
frequencies], unmodeled in any way."
Got got!
@lcrouch1952 Is this what you had in mind?
ASER-mt-S01-V01-distr_both.xlsx
look at this one instead, please. I missed a column in the previous one. Both languages
As I am writing this, I see that, maybe the correct way of doing it is calculating "CFdllocal CFullocal CFdlEnglish CFulEnglish" by each grade separately??? Is it so???
Following up on my own comment here. I went ahead and recalculated the CFdl and CFul indices by each grade separately and generated new Lorenz curves. They don't look much different from what we had before...
@lcrouch1952 @TSSlade , I don't find the Excel file, Louis is referring to...
@lcrouch1952 I received your comments left on June 26th, 10:42 am, with the good and bad news, but I don't see an Excel file still... the word "here" in your previous comment is not "hyperlinked"
@lcrouch1952 - unfortunately, I believe the trade-off for the convenience of responding by email is that attachments don't get transmitted. Hyperlinks, however, should. (I think?) Did your response contain a hyperlink, or an attached document?
The first is to apply the algorithm to ordinalized Kenya PRIMR data. I had
set out some ideas for doing that in github or e-mail or both.
I gather you meant to apply ineqord command to ordinalized Kenya data. I did so for Tusome. Please see Lorenz curves generated based on "CFdl" - Cowell-Flachaire downward-looking status. It doesn't look good....
ASER-mt-S01-V01-distr_English + lcrouch simulations.xlsx
So this is the ASER file. What I did was to "cardinalize" the ordinal data. And that loooks more like Kenya. And when you take Kenya and ordinalized it, it looks like ASER. So that's bad news for the ordinal approach UNLESS we find a defensible way to cardinalize ordinal data. But that's science. Some things don't work. Therefore you create reasonable approximations.
@TSSlade @lcrouch1952
How about we predict the probability that a student possessing certain characteristics falls into one of the 5 reading levels, via a "oprobit" model: a probability model for ordinal response variable in stata. We fit the model and calculate the inequality based on the predicted values...Is it completely crazy???
I went ahead and tested this on Tusome data. Attached you will see the original Lorenz curve, and the curve generated on the predicted probabilities of the ordinalized Tusome data (midline), a sort of a robustness check...
The second is for me to finally (!!) read more about how the algorithm
works. At the risk of making you feel like I am a very old man who forgets
everyhing, may I ask you to re-send me the PDF, or the URL, with the
mathematical explanation of the algorithm? I can promise to do this over
the weekend.
@lcrouch1952, my apologies, I completely missed this request. I will send you the log file containing details behind how the Lorenz curves were generated, and an excerpt from Stata manual with description of the ineqord command. Is that what you are looking for? I don't have details on mathematical derivations that go into inqord, but I will look around.
@lcrouch1952 , the promised material:
CS20-mt-S02-V02-log_bygr.pdf
ineqord.pdf
@TSSlade @lcrouch1952 , I have given another try, before throwing in the towel, and I think I may be onto something here.
First of all, note that the comparison in the following graphs is with the Midline curve on the original Lorenz graph. As we have found out already, the inequality based on the ordinalized variables much lower (a flatter curve) than on the original. The conclusion: the method is not robust to changes tot he structure of the variable. If it were, we would see similar levels of inequality. Bad news! The reason appears to be that we can not map the original variable perfectly, 1-to-1, to the ordinalized variable...
After further deliberation, mostly because I don't want to admit a defeat by the data, I came to realization that, yes, we cannot map the variables perfectly, but we can do exactly that for students who read 0 words!
So the problem is that we allowed ineqord command to assign actual value to students who read zero words. We can perfectly identify those students in Kenya and ASER data.
In the attached graphs you will see a comparison between the Lorenz graphs based on the original (continuous) variable, versus the graph based on the ordinalized variable, turned continuous (CFdl_lang) by ineqord command, versus the graph based on the ordinalized variable turned continuous by ineqord variable, where the values of CFdl_lang variables are replaced with 0 for those students in level 1 reading (Beginner/no words).
English Language, Grade 1: Original, ordinalized, modified ordinalized.
English grade 2 ...
Kiswahili, grade 1: Original, ordinalized, modified ordinalized
Please let me know what you think.