In this repo I describe how we could easily do basic data cleaning using:
- Bash Commands-Shell Scripting
- Python Scripts
- The Pandas Library
My inspiration was a data cleaning proccess I implemented in order to create a data warehouse for data analysis and data mining.
Assume we have 20 csv files that we want to merge into one csv file. The file names are like: example_1.csv etc..
-
Create a new file with the first row (columns names) of our files:
head -1 example_1.csv > data.csv
-
Remove the first row from all csv files:
for i in example_{1..20}.csv; do sed -i '' -e 1d $i; done
-
Merge the files:
cat example_{1..20}.csv >> data
-
Remove duplicates(identical rows):
cat data.csv > temp_data cat temp_data | sort -r | uniq > data.csv
Let's say we want to create a new table with the columns 2,3,5 of data.csv.
awk -F, {'OFS=",";print $2,$3,$5'} data.csv > table1.csv
Let's say we want to create a new table with the unique values of line 4 of data.csv.
cut -d ',' -f 4 data.csv | sort -r | uniq > table2.csv
We have a table (table3.csv) and we want to check if there are any duplicate ids (column 1).
-
To check if there are duplicates we must check if the two results are the same:
cut -d ',' -f 1 table3.csv | sort | uniq | wc -l cat table3.csv | sort | uniq | wc -l
-
If the results are the same then there aren't duplicates. If not then we use the below command to find the duplicates:
cut -d ',' -f 1 table3.csv | sort | uniq -c
- create-date-table.ipynb
- In our example we create a new table (dates.csv) from the bday column of data.csv. The columns of the new table are: id, month, year (we could also include a day column if we wanted to).
- A python script that matches and changes the values of a column with another table's id.
- In our example we match the bday column of data.csv with a row of dates.csv and then we update the bday column with the appropriate id.
- Any contribution that will enchance the current project is welcome.
- If you have any questions about this project you can open an issue.
- If you want to contribute to this project just create a pull request.
This project is licensed under the GNU Lesser General Public License v3.0 Licence — see LICENSE.md file for details.