Arabic-Learner-Corpus-Considerations

Anthony Verardi | a.verardi@pitt.edu | University of Pittsburgh

Project completed 4/24/2020

About the Project

This project explores the contents of the Arabic Learner Corpus (ALC) to assess how they might be applied to Second Language Acquisition/Teaching. The ALC is a collection of written and spoken texts collected from learners of Modern Standard Arabic (MSA) in Saudi Arabia, including both native speaker learners (learning MSA as a prestige variant) and non-native speaker learners. The XML files also accompanied by metadata about each participant and each observation of their data.

Directory

Folders

Notebooks: Jupyter Notebooks that contain all of the coding and preliminary analysis done for this project
- ALC Data Organization: a Notebook containing my data (re)organization and cleaning process for the ALC dataset
  - ALC_Data_Organization (GitHub Style)
  - ALC_Data_Organization (Jupyter Style)
- ALC Data Analysis: a Notebook containing the actual analysis performed on my restructured version of the ALC
  - ALC_Data_Analysis (GitHub Style)
  - ALC_Data_Analysis (Jupyter Style)
- ALC Scrap Code: a "code graveyard" for ideas that didn't pan out and code that didn't work out quite right
  - ALC_Scrap_Code (GitHub Style)
  - ALC_Scrap_Code (Jupyter Style)
Presentation: a short presentation outlining the preliminary findings of this project, available as both a full PowerPoint presentation with voiceover or .pdf slides
Data: samples of the dataset used for this project, namely the first 1000 original XML files (GitHub won't allow me to upload > 1000 files). Note: none of the original XML files have been altered! The cleaning process was done entirely on imported data in my Organization Notebook, leaving the originals untouched.
Visualizations: image file copies of all visualizations created over the course of this project

Files

.gitignore: a list of filetypes my repository is set to ignore on my local rig
final_report.md: the final report for this project containing full analysis and conclusions
LICENSE.md: the license under which this project has been made publicly available; you can find a quick overview of the license on this page
README.md: the document you are currently reading!
progress_report.md: markdown file documenting the development of this project
project_plan.md: markdown file containing the original and revised project plans for this work

Licensing

This project is licensed under a Creative Commons Attribution-NonCommercial (CC BY-NC 4.0). Choose this license if you want to permit others to share (mirror) and adapt (borrow and alter) your mod content, providing that they credit you and don't use your work for commercial purposes.

Original corpus credit to:

Alfaifi, A., Atwell, E. and Hedaya, I. (2014). Arabic Learner Corpus (ALC) v2: A New Written and Spoken Corpus of Arabic Learners. In the proceedings of the Learner Corpus Studies in Asia and the World (LCSAW) 2014, 31 May - 01 Jun 2014. Kobe, Japan. http://www.arabiclearnercorpus.com.

Have a comment? Visit my guest book here!

Data-Science-for-Linguists-2020/Arabic-Learner-Corpus-Considerations