Welcome! This repo was created to upload our data analisys project.
This project had the intention of covering all steps of an ELT data integration, from extraction, passing through a creation of a data lake, uploaded to a database (data warehouse) and being transformed for data visualization.
The data was acquired through the TSE (Brazilian Superior Electoral Court) repository.
To do so, it was used scrapy library to create a crawler that look up for HTML tags and download the correspondent files.
file_url = response.css('.resource-url-analytics::attr(href)').get()
Using the following Python libraries:
import os
import csv
import pandas
import random
import zipfile
import pathlib
import sqlalchemy
The data was handled in a way to minimize and optmize memory and storage, so the steps were the following:
- Extracting file
- Deleting original zip file
- Creating a list with files path
- Creating and uploading table to postgreSQL
It was used only a 10% sample of the data, chosen randomly.
💡 Tip: If you want to use the totality of the datasets, then you should use copy_table
function combined with pandas .head
method. Once COPY TABLE
statement require less memory usage.
Docker was used to easy the applications setup by creating containers as shown on the diagram:
PostgreSQL was chosen for its open source characteristics and easy conection to Dremio.
And lastly Dremio was used as a mid man between PostgreSQL and PowerBI because of its PowerBI integration and interactive analytics directly on the data lake.
Dremio was mainly used to create virtual datasets, known as views in a SQL database. A series of SQL queries were made to format data type, remove the missing values from the data source and direct connecting those views to PowerBI.
Select
Cast(Replace("ANO_ELEICAO", '"', '') AS varchar(100)) AS "ANO_ELEICAO",
Cast(Replace("SG_UF", '"', '') AS varchar(100)) AS "SG_UF",
Cast(Replace("CD_MUNICIPIO", '"', '') AS varchar(100)) AS "CD_MUNICIPIO",
Cast(Replace("NM_MUNICIPIO", '"', '') AS varchar(100)) AS "NM_MUNICIPIO",
Cast(Replace("CD_GENERO", '"', '') AS varchar(100)) AS "CD_GENERO",
Cast(Replace("DS_GENERO", '"', '') AS varchar(100)) AS "DS_GENERO",
Cast(Replace("QT_ELEITORES_PERFIL", '"', '') AS bigint) AS "QT_ELEITORES_PERFIL",
Cast(Replace("QT_ELEITORES_BIOMETRIA", '"', '') AS bigint) AS "QT_ELEITORES_BIOMETRIA",
Cast(Replace("QT_ELEITORES_DEFICIENCIA", '"', '') AS bigint) AS "QT_ELEITORES_DEFICIENCIA",
Cast(Replace("QT_ELEITORES_INC_NM_SOCIAL", '"', '') AS bigint) AS "QT_ELEITORES_INC_NM_SOCIAL"
From "perfil_eleitorado_1994.csv"
Prior to the formatting queries, a verification of the columns was done:
Select
CASE
WHEN (Select COUNT("CD_MUN_SIT_BIOMETRICA")
FROM "perfil_eleitorado_1994.csv"
Where "CD_MUN_SIT_BIOMETRICA" = '-3') = COUNT("CD_MUN_SIT_BIOMETRICA") THEN TRUE
ELSE FALSE
END AS "CD_MUN_SIT_BIOMETRICA",
CASE
WHEN (Select COUNT("NR_ZONA")
FROM "perfil_eleitorado_1994.csv"
Where "NR_ZONA" = '-1') = COUNT("NR_ZONA") THEN TRUE
ELSE FALSE
END AS "NR_ZONA",
CASE
WHEN (Select COUNT("CD_ESTADO_CIVIL")
FROM "perfil_eleitorado_1994.csv"
Where "CD_ESTADO_CIVIL" = '-3') = COUNT("CD_ESTADO_CIVIL") THEN TRUE
ELSE FALSE
END AS "CD_ESTADO_CIVIL",
CASE
WHEN (Select COUNT("CD_FAIXA_ETARIA")
FROM "perfil_eleitorado_1994.csv"
Where "CD_FAIXA_ETARIA" = '-3') = COUNT("CD_FAIXA_ETARIA") THEN TRUE
ELSE FALSE
END AS "CD_FAIXA_ETARIA",
CASE
WHEN (Select COUNT("CD_GRAU_ESCOLARIDADE")
FROM "perfil_eleitorado_1994.csv"
Where "CD_GRAU_ESCOLARIDADE" = '0') = COUNT("CD_GRAU_ESCOLARIDADE") THEN TRUE
ELSE FALSE
END AS "CD_GRAU_ESCOLARIDADE"
FROM "perfil_eleitorado_1994.csv"
PowerBI is a widely used collection of software services, easy to use and with a self-service approach to Business Inteligence. The data visualization of this project was made with PowerBI.
As an example the following images are some of the graphics created with PowerBI