Hoang Anh Tuan - A Data Engineering Portfolio

Exploring Data Realms: Anh Tuan embarks on a journey through data engineering after a 3-year education in Software Development, followed by 2 years specializing in Data Engineering

Update: February 2nd, 2024

My CV / Resume: DE Resume Version 10

About Me

Based: Hanoi, Vietnam

Status: Currently Open for a Data Engineer Job

Contact me: Zalo (+84) 38 230 4536

Email: anhtuanhoang2001@gmail.com

Linkedin: Click Here

Hello, my name is Anh Tuan! I am a data addict with a solid background in computer science and a love of numbers, statistics and data in general. With a degree in "Computer Science - Software Development," I bring to the field of data a unique blend of technical expertise and an analytical approach. As a candidate seeking the role of Data Engineer, I am delighted to demonstrate my technical, analytical and logical talents in the field of Data Engineering and Data Science.

During my time at university, I was the best at developing software and constructing websites that met the needs of the subjects. But then I realized that I might be better at dealing with data, so I applied my coding and software development skills to the Data Science playground. Now I can effectively declare that I can work with complex data and proficient in constructing efficient data pipelines, developing expansive data lakes, and optimizing data warehouses using a blend of programming expertise and cloud technologies. Furthermore, while taking classes at university laboratories, I obtained expertise in leadership, data management, statistical analysis, machine learning and data manipulation, all of which I believe will be useful for my next adventure as a Data Engineer.

In my spare time, I am always excited investigate new data analysis / manipulation tools and approaches, always looking for ways to broaden my knowledge and sharpen my skills. If you're seeking for someone with a unique combination of software development and data engineering skills, I'd love to talk about how I might help your projects and organization. Please contact me and let's begin on a data-driven journey together!

The CV is included within the repository

This is a repository in which I can show off my skills, projects, and progression in Data Engineering / Data Science.

Why Data Engineer?

At the beginning of my journey into the realm of data, I initially aspired to acquire a career in Data Analysis. However, as my university education progressed, I dug deeply into various subjects and consistently be the best at, achieving the highest distinctions across disciplines such as Object-oriented Programming and Software Testing. While my focus wasn’t solely on excelling in these areas, my commitment to excellence resulted in commendable performance. As I approached the end of my university time, I realized the potential for synergy between my aptitude in Data Analysis and my proficiency in Software Development. This realization led me to the combination role of the both 2 fields: As a Data Engineer - which was the best decision I have made in my lifetime.

Why not Data Scientist or AI Scientist?

In fact, while I was taking data engineering courses, I did consider becoming a Data Scientist, who would spend their entire working day researching ways to make machine learning algorithms more accurate and identifying key learning characteristics for a dataset, or an Artificial Intelligence Scientist, who would create extremely sophisticated language models and dedicate their entire life to refining them. Yep, I had some thought, but eventually, I just love being a Data Engineer, perhaps an angel has led me this career path, who knows.

I possessed skills that were applicable across all three roles, enabling me to pivot between them effortlessly. But at the end of the day, technology is technology, I just simply do not want to switch at all.

Why did I attend 2 Universities but only has one Bachelor's Degree?

For your information, Hanoi University of Science and Technology (HUST) is literally the best university of Vietnam, ranked 350 globally (According to IDP Education, 2022), anyone that graduates this university is guaranteed to have a good career.

In the other hand, Swinburne University of Technology Hanoi is an Australia-based Technology university, ranked #43 in the world across all disciplines, ranked #2 in the world for Mathematics and Computer Science (According to IDP Education, 2022).

Why would I transfer to a different university after only four months of study? You might wonder. The studying environment and discipline may hold the key to the answer. I had a really difficult time adjusting to the constant state of enslavement during my time as a student at HUST, and I had a lot of trouble communicating. Then I took the decision to look for a university across the globe, and lucky for me, one of my best friends who was studying in Australia at the time mentioned Swinburne University. That was at the beginning of 2020, and once I took my very first class at this university, everything changed. Since then, I have been in love with the institution; the atmosphere for education is excellent, people are interacting with one another, there is no backstabbing, no trash-talking, no arguing, just peace.

What did I prepare for the Data Engineering Career?

What did I prepared? Everything. But if you are curious, here is the breakdown of all the knowledge I have gathered to be a part of Data Engineering community:

0. English Profiency --------------------------------- [✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸]

1. SQL ------------------------------------------------ [✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸※※]

2. PowerBI / Tableu --------------------------------- [✸✸✸✸✸✸✸✸✸✸※※※※※※※※※※]

3. Python -------------------------------------------- [✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸✸※※※※]

4. Pandas -------------------------------------------- [✸✸✸✸✸✸✸✸✸✸✸✸✸※※※※※※※]

5. Matplotlib ---------------------------------------- [✸✸✸✸✸✸✸✸✸✸✸✸✸✸※※※※※※]

6. Machine Learning Models ----------------------- [✸✸✸✸✸✸✸※※※※※※※※※※※※※]

7. Spark ---------------------------------------------- [✸✸✸✸✸✸✸✸※※※※※※※※※※※※]

8. Hadoop Ecosystem ------------------------------ [✸✸✸✸✸※※※※※※※※※※※※※※※]

9. MongoDB / Redis -------------------------------- [✸✸✸✸✸✸※※※※※※※※※※※※※※]

10. Git ------------------------------------------------ [✸✸✸✸✸※※※※※※※※※※※※※※※]

11. Amazon Web Services --------------------------- [✸✸✸✸✸※※※※※※※※※※※※※※※]

12. C++ ---------------------------------------------- [✸✸✸✸✸※※※※※※※※※※※※※※※]

13. Kotlin -------------------------------------------- [✸✸✸✸✸✸✸※※※※※※※※※※※※※]

14. Airflow ------------------------------------------- [✸✸✸✸✸※※※※※※※※※※※※※※※]

15. HTML / CSS -------------------------------------- [✸✸✸✸✸✸✸✸✸✸✸✸✸※※※※※※※]

16. Object-Oriented Programming ---------------------- [✸✸✸✸✸✸✸※※※※※※※※※※※※※]

17. Data Structures and Patterns ------------------- [✸✸✸✸✸※※※※※※※※※※※※※※※]

18. Agile Methodology ----------------------------- [✸✸✸✸✸✸✸✸✸※※※※※※※※※※※]

19. Tkinter UI

20. Natural Language Processing

21. OpenAI API

22. Deep Learning Models

23. Google Cloud Platform

24. Prompt Engineering

25. CI/CD

26. TensorFlow

27. Docker

Education

Hanoi University of Science and Technology, Computer Science, Sep 2019 - Jan 2020
- Status: Postponed
Swinburne University of Technology Hanoi, Software Development, Jan 2020 - Sep 2023
- Status: Graduated
- Public Link to my Degree: Bachelor of Information and Communication Technology (Software Technology)

Awards

Related Awards:

Best performance in the subject "INF10002 - Database Analysis and Design" with the academic score of 99/100
Championship of FPT Hanoi Debate Contest, organized by Swinburne Alliance Program
Best Performance on Subject "INF10002 - Database Analysis and Design" with the score of 99/100
Qualified High Distinction in the subject "COS20001 - User-Centred Design"
Qualified High Distinction in the subject "COS10022 - Introduction to Data Science"
Qualified High Distinction in the subject "COS20019 - Cloud Computing Architecture"
Best performance of Subject "INF20016 - Big Data Management" with the score of 94/100
Best performance of Subject "SWE30009 – Software Testing and Reliability" with the score of 90/100
Qualified High Distinction in the subject "ICT30005-Professional Issues in Information Technology"

Portfolio Projects

In my portfolio, you'll find a collection of diverse data science and data analysis projects that showcase my skills in extracting valuable insights from data, solving real-world problems, and making data-driven decisions. The below parts is where I placed the files and their details.

1. Programming (Python)

Google Cloud Platform Data Engineer Certificate Assistant Tool

Code: Professional Cloud Data Engineer Certification Assistance Tool

Description: The GCP Quiz Application is a comprehensive tool designed for users to enhance their knowledge and proficiency in Google Cloud Platform (GCP). It offers a diverse range of quizzes covering various GCP services and concepts, including compute, storage, networking, security, and more. The application facilitates effective learning through interactive quizzes, detailed explanations for correct and incorrect answers, and progress tracking features. Additionally, it incorporates customizable quiz options, enabling users to tailor quizzes based on their specific learning objectives and preferences.

Goal: To help users deepen their understanding of GCP services and concepts, ultimately empowering them to excel in GCP-related certifications, projects, and professional endeavors. By offering a platform for practicing and assessing knowledge, the application aims to foster continuous learning and skill development within the GCP ecosystem.

Skills: Object-oriented Programming, Quiz customization, interactive learning, progress tracking, GCP service knowledge, exam preparation, skill enhancement

Technology: Python, Tkinter, Pandas, Time Management Modules

Results: Users can effectively test and improve their knowledge of GCP services and concepts through interactive quizzes, achieving a deeper understanding and proficiency in Google Cloud Platform.

2. Data Science using Python

Public Repository: Python DA

Customer Churn Prediction

Code: Customer Churn Prediction

Description: The project encompasses a multifaceted approach, including crucial tasks such as data cleaning, in-depth data analysis, and comprehensive data visualization to glean insights. Additionally, it employs the creation of a confusion matrix to evaluate model performance, addressing the data imbalance through techniques like data balancing. Furthermore, the project incorporates data scaling methods to enhance model efficiency and employs a range of machine learning algorithms to build predictive models, facilitating accurate predictions regarding customer churn behavior.

Goal: To forecast customer behavior using a dataset comprising 19 feature columns and a target column, predicting customer return likelihood

Skills: data cleaning, data analysis, data visualization, confusion matrix, balancing data, data scaling, machine learning, predictive models

Technology: Python, Pandas, Numpy, Matplotlib, Histplot, Correlation Heatmap, Confusion Matrix, Classification Report, Logistic Regression, Decision Tree, Random Forest, K-Neighbors Classifier, GaussianNB, Support Vector Machine, SMOTE, GridSearch

Results: predicted customer return behavior using machine learning models trained on feature-rich data, achieving an accuracy of 97% in determining potential churn.

Travel Insurance Prediction

Code : Travel Insurance Prediction

Description: By using machine learning models to forecast outcomes based on consumers who have already purchased insurance, this project assists the insurance company in focusing on new clients based on the qualities and characteristics that they gave. By focusing on the right consumers, the company may cut costs, utilize data, and increase profits by utilizing the computer's intelligence.

Goal: To predict customer who would want to buy the Travel Insurance and who do not.

Skills: data cleaning, data analysis, data visualization, confusion matrix, balancing data, data scaling, machine learning, predictive models

Technology: Python, Pandas, Numpy, Matplotlib, Histplot, Correlation Heatmap, Confusion Matrix, Classification Report, ROC Curve, Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, K-Neighbors Classifier, GaussianNB, Multi-layer Perceptron, XGBoost, CatBoost, SMOTE, time, random

Results: Utilized various machine learning models to predict customers' likelihood to purchase travel insurance, enhancing the company's ability to target potential clients and optimize profits.

Australia Rain Prediction

Code : Australia Rain Prediction

Description: This project is centered around predicting tomorrow's rain in Australia. The dataset used comprises approximately 10 years of daily weather observations from multiple locations across the country. The primary objective of this project is to develop classification models that can determine whether it will rain tomorrow (Yes or No) based on historical weather data.

Goal: To predict if there will be rain in the next day based on today's weather features

Skills: data cleaning, data analysis, data visualization, data scaling, machine learning, predictive models

Results: Developed and evaluated classification models, including Random Forest, for predicting rain in Australia based on historical weather data, with the Random Forest model achieving the highest accuracy and precision.

Student Performance in Exam Analysis

Code : Student Performance in Exam Analysis

Description: This project conducted an extensive data analysis project focusing on the factors influencing students' test scores. Utilizing data visualization and statistical techniques, I uncovered key insights into the role of poverty, parental education, ethnicity, and test preparation in student performance. This project not only provided valuable insights for educational improvement but also emphasized the importance of data-driven decision-making in addressing academic disparities.

Goal: To analyze a dataset containing information about students' test scores and various demographic factors to gain insights into the relationships between these variables

Skills: data cleaning, data analysis, data visualization, storytelling.

Technology: Python, Pandas, Numpy, Matplotlib, Histplot, Math, Seaborn

Results: A comprehensive analysis of the factors impacting students' test scores, highlighting the significance of socioeconomic status, parental education, ethnicity, and test preparation, which can inform targeted educational interventions and policy decisions

Stellar Classification

Code : Stellar Classification

Description: Our data analysis and data science project is a groundbreaking endeavor that aims to uncover the mysteries of the universe by classifying celestial objects such as stars, quasars, and galaxies. Equipped with a dataset containing 18 curated feature columns, our project holds the promise of advancing astronomical science and discovery in several remarkable ways.

Goal: To classify the type of stars for further astronomical purposes

Skills: data cleaning, data analysis, data visualization, machine learning, predictive models, confusion matrix, balancing data.

Results: Applied machine learning models to classify celestial objects, including stars, quasars, and galaxies, enhancing our understanding of the universe and supporting data-driven advancements in astronomical research.

Mobile Price Classification

Code : Mobile Price Classification

Description: The project involved working with a dataset encompassing 3000 rows and 20 distinct features representing various attributes of phones, including RAM, battery life, dimensions (width, height, depth), pixel specifications, color, and more. Each phone's data was utilized to predict its price class, categorized into four classes: lowest price (Class 0), normal price (Class 1), above-average price (Class 2), and expensive phones (Class 3).

Goal: To leverage machine learning techniques on a dataset comprising 3000 phone entries with 20 features to accurately predict the price class of each phone, categorized into distinct pricing tiers.

Skills: data cleaning, data analysis, data visualization, machine learning, predictive models, confusion matrix, data scaling

Technology: Python, Pandas, Numpy, Matplotlib, Histplot, Correlation Heatmap, Confusion Matrix, Classification Report, MinMaxScaler, Logistic Regression, Decision Tree, Random Forest, XGBoost, Support Vector Machine ,GaussianNB

Results: The project successfully developed predictive models that accurately categorized phones into price classes, enabling effective classification of phones based on their attributes into different pricing tiers.

Fatalities in Israel-Palestines War Analysis

Code : Fatalities in Israel-Palestines War Analysis

Description: The "Fatalities in Israel-Palestine War Analysis" project is a comprehensive data analysis and visualization endeavor aimed at shedding light on the grim and complex topic of fatalities within the Israel-Palestine conflict from 2000 to 2023. This project utilizes a wealth of data to provide a deep and insightful exploration of the various aspects of casualties, including their numbers, causes, the involved forces, geographical distribution, and more. The project starts by collecting and organizing a vast dataset containing statistics related to fatalities during the Israel-Palestine conflict. This data is meticulously curated and processed to ensure accuracy and relevance.

Goal: To analyse the number of deaths and their causes in order to see the destruction happening within the warzone from year 2000 to 2023

Skills: data cleaning, data analysis, data visualization, time series, geological analysis.

Technology: Python, Pandas, Matplotlib, Histplot, Time Series.

Results: Conducted a comprehensive analysis of fatalities within the Israel-Palestine conflict from 2000 to 2023, exploring casualties, their causes, geographic distribution, and more, to shed light on the complex and grim topic of war-related deaths.

Airline Passenger Satisfaction Analysis

Code : Airline Passenger Satisfaction Analysis

Description: This project assists an airline in identifying areas for improvement based on feedback information from each client who utilized its services. The collection of data includes a wide range of information, including consumer gender, age, and type as well as numerous rating criteria like Inflight service and Cleanliness. It comprises 23 fields in total that are used to determine whether or not the customer is satisfied. My EDA was followed by various recommendations for successful problem-solving tactics. These suggestions intended to improve my strategy and were supported by data-driven insights. Then, in order to advise the person who required it, I utilized 10 different machine learning models to determine which one was the best and most effective.

Goal: To determine in which fields do the airline need improvement

Skills: data cleaning, data analysis, data visualization, machine learning, predictive models, confusion matrix, balancing data.

Technology: Python, Pandas, Matplotlib, Histplot, Correlation Heatmap, Confusion Matrix, Classification Report, RobustScaler, ROC Curve, CatBoost, Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, Ada Boost, K-Neighbors Classifier, GaussianNB, Multi-layer Perceptron

Results: Utilized various machine learning models to identify areas for improvement in an airline's services based on customer feedback, enhancing the airline's strategy and customer satisfaction

Diabetes Prediction

Code : Diabetes Analysis and Prediction

Description: The dataset includes hundreds of thousands of entries on each patient to test and scan for Diabetes. Each record contains diabetes-related statistics (age, gender, hypertension history, heart disease history, smoking history, bmi, HbA1c level, blood glucose level), and the last column is the Diabete-infected Classifier; anyone with a 1 on this field is tested Positive for Diabetes, and anyone with a 0 is tested No-Diabete. The project's workflow is as follows: Data Collection, Data Cleaning, Data Preprocessing, Exploratory Data Analysis, Disease Prediction Using Machine Learning Algorithms.

Goal: To determine what factors could be the cause for Diabetes, and to predict the disease in the future based on the statistics.

Skills: data cleaning, data analysis, data visualization, machine learning, predictive models, confusion matrix, balancing data.

Technology: Python, Pandas, Matplotlib, Seaborn, MinMaxScaler, GridSearchCV, Confusion Matrix, Classification Report, Random Forest, Logistic Regression, Gaussian Naive Bayes, K-Nearest Neighbors, ROC Curve

Results: Identified key factors contributing to Diabetes, developed predictive models, and gained insights into disease causation, enabling early detection and improved health outcomes.

Water Potability Prediction

Code : Water Potability Prediction

Description: This collection provides measurements and predicts of water quality related to potability, which is the acceptability of water for human consumption. The major goal of the dataset is to provide insights into water quality characteristics and aid in deciding whether or not the water is potable. On this project, I employed machine learning techniques to anticipate the outcome of water sample records for following usage. The outcomes of models such as Categorical Boosting Classifier and Gaussian Naive Bayes Classifier may not be optimal due to the instability and disconnection between data fields.

Goal: To determine the water if it is drinkable or not based on the statistics given

Skills: data cleaning, data analysis, data visualization, machine learning, predictive models, confusion matrix, balancing data.

Technology: Python, Pandas, Matplotlib, Histplot, Correlation Heatmap, Confusion Matrix, Classification Report, Gaussian Naive Bayes, CatBoost, SMOTE, Binning, Clustering

Results: Successfully predicted water potability, enhancing the understanding of water quality and suitability for human consumption using machine learning techniques.

Breast Cancer Prediction

Code :Breast Cancer Prediction

Description: I designed and executed a data-driven project focused on predicting the diagnosis of breast cancer. Leveraging a dataset comprising 27 feature columns related to health, I conducted a comprehensive analysis to forecast the likelihood of malignancy ("M") or benignity ("B"). The project encompassed crucial stages, including data cleaning, exploratory data analysis (EDA), and the application of various machine learning classifier models.

Goal: To determine the diagnosis of the patients, to see if it is "B" (low chance) or "M" (positive for breast cancer)

Skills: data cleaning, data analysis, data visualization, machine learning, confusion matrix, scaling.

Technology: Python, Pandas, Numpy, Matplotlib, Correlation Heatmap, Confusion Matrix, Classification Report, ROC Curve, Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, K-Neighbors Classifier, GaussianNB, MinMaxScaler, StandardScaler

Results: Effectively utilized machine learning to predict breast cancer diagnosis, distinguishing between benign and malignant cases.

3. Apache Spark (Scala Spark, PySpark)

Public Repository: Apache Spark

Billionaire Research And Exploration

Code: Billionaire Research And Exploration

Description: A comprehensive analysis project utilizing PySpark and the Billionaires Dataset to delve into the world of extreme wealth. The project involves data cleaning, exploration, and in-depth analysis of various attributes associated with billionaires worldwide. It encompasses observing trends, applying data scaling techniques, and implementing machine learning algorithms, such as K-means clustering and logistic regression using MLlib.

Goal: To leverage PySpark and the Billionaires Dataset to conduct comprehensive data analysis, implement machine learning algorithms, and derive actionable insights into wealth distribution, demographic trends, and industry patterns among billionaires worldwide

Skills: data cleaning, data analysis, data visualization, machine learning, predictive models, confusion matrix, data scaling.

Technology: Apache Spark, PySpark, SQL, Python, Pandas, Matplotlib, Correlation Heatmap, Confusion Matrix, MLlib, K-Means Clustering, Logistic Regression

Results: Yielded insightful findings, including distinct clusters of billionaires based on wealth attributes identified through K-means clustering, alongside predictive analysis of wealth accumulation factors using Logistic Regression, offering a great understanding of wealth distribution and demographic trends among billionaires globally.

Big Data Fundamentals Journey with PySpark

Description: Embark on a transformative journey into the world of Big Data using PySpark. This project is a progressive exploration, divided into four stages, reflecting my evolution from a novice to an intermediate PySpark enthusiast.

Part 1 - Introduction to PySpark

URL: Part 1 - Getting used to beginner level of SQL commands
Description: In the initial phase, dive into the foundational concepts of PySpark. Learn its architecture, core functionalities, and basic data manipulation techniques. Gain hands-on experience with small-scale datasets, mastering RDDs (Resilient Distributed Datasets) and DataFrame APIs.

Part 2 - Data Transformation and Processing

URL: Part 2 - DataFrame in Pyspark
Description: Progress to a more sophisticated stage where data transformation becomes key. Explore PySpark's powerful capabilities in processing large volumes of data efficiently. Work with transformations, aggregations, and complex operations on diverse datasets.

Part 3 - Data Visualization using PySpark

URL: Part 3 - Data Visualization using PySpark
Description: Dive into the world of data visualization with PySpark. Discover techniques to visualize and comprehend data at scale. Leverage PySpark's visualization libraries and tools to create insightful visual representations of complex datasets.

Part 4 - Machine Learning Application using MLlib

URL: Part 4 - Machine Learning Application using MLlib
Description: Take the leap into machine learning with PySpark. Explore the application of various machine learning algorithms using PySpark's MLlib library. Build and evaluate models for regression, classification, and clustering, empowering you to extract meaningful insights and predictions from Big Data.

4. Structured Query Language (SQL)

Public Repository: SQL

AdventureWorks Practical SQL - Knowledge Showcase Project

Note: Taking advantage of the SQL Server's capability to import and open .bak files, I strategically used this feature by employing the AdventureWorks 2022 Database to create a significant project. This project serves as a comprehensive showcase, including a range of SQL commands from fundamental to advanced levels, covering intricate aspects such as Dynamic SQL and user functions.

Repository: AdventureWorks Practical Knowledge Showcase Project

Technology: SQL - SQL Server

Description: This initiative utilizes the AdventureWorks 2022 Database as the foundational data source for knowledge acquisition. It is compartmentalized into 20 distinct sections, each designed to cover specific SQL concepts. Within each section, there are 50 SQL requirements, each articulated in a concise format ranging from a single line to a maximum of 50 lines, depending on the complexity of the task. The complexity of requirements intensifies progressively, ensuring comprehensive coverage from foundational to advanced levels. Each requirement is uniquely numbered from 1 to 1000, incorporating distinctive elements to prevent replication. The following is an exhaustive list detailing each segment of this comprehensive project, along with their respective specifics.

Part 1 - Getting used to beginner level of SQL commands (Query 1 to 50)

URL: Part 1 - Getting used to beginner level of SQL commands
Description: This section serves as a foundation for the entire project, focusing on elementary SQL commands like SELECT, WHERE, LIKE, and ORDER BY. Challenges here are concise, typically solvable within five lines of code. While predominantly straightforward, a few tasks introduce beginners to slightly more complex queries involving multiple subqueries and/or CTEs, offering a gentle transition to more intricate SQL operations.
Length: 50 Problems - 314 Lines
Main Knowledge: TOP, Subqueries, CTE, Temporary Tables, DATEPART, DATEDIFF, GROUP BY, Aggregate Functions

Part 2 - Diving Deeper into Data Manipulation (Query 51 to 100)

URL: Part 2 - Diving Deeper into Data Manipulation
Description: Part 2 marks an elevation in SQL proficiency, delving into an intermediate level of database manipulation. Explore a diverse array of challenges featuring a blend of SQL commands, including GROUP BY, JOIN, HAVING, and complex conditional queries. Tasks vary in complexity, from analyzing data distributions to manipulating multiple datasets through joins and aggregations. This segment aims to solidify foundational knowledge while introducing nuanced SQL operations, laying the groundwork for advanced data querying and analysis.
Length: 50 Problems - 392 Lines
Main Knowledge: Subqueries, CTE, Multi-CTE, Temporary tables, DATEPART, DATEDIFF, ROUND, CAST AS Data-type

Part 3 - SQL Metamorphosis: Crafting with REPLACE, CAST, and NTILE (Query 101 to 150)

URL: Part 3 - SQL Metamorphosis: Crafting with REPLACE, CAST, and NTILE
Description: Part 3 delves into intermediate SQL operations, building on foundational concepts. It explores various functionalities like REPLACE, Common Table Expressions (CTEs), and conditional operations using CASE/WHEN statements. The tasks in this section involve grouping data using NTILE, conditional aggregations, and creating calculated columns based on specific conditions. These exercises challenge learners with tasks like analyzing credit ratings, implementing conditional logic for grouping data, and performing segmented computations. While not overly complex, this section presents a bridge between basic and more intricate SQL operations, offering learners a chance to expand their skills within a comfortable yet engaging environment.
Length: 50 Problems - 556 Lines
Main Knowledge: CTE, Multi-CTE, CAST, CASE/WHEN, NTILE, NEWID (Shuffle Random Order), REPLACE

Part 4 - SQL Shapeshifters: Introduction to Data Ranking and Splitting (Query 151 to 200)

URL: Part 4 - SQL Shapeshifters: Introduction to Data Ranking and Splitting
Description: This set of exercises delves into multifaceted queries leveraging a combination of SQL features. It involves filtering and extracting specific data sets based on various conditions using different SQL techniques like JOIN, EXISTS, and complex calculations involving date functions, mathematical operations, and conditional statements. Tasks range from utilizing JOIN operations to derive specific information about employees, employing EXISTS for complex filtering, to utilizing PIVOT for aggregating data based on different criteria. The use of CTEs and window functions like RANK(), PARTITION BY, and NTILE() adds complexity by manipulating data partitions and performing ranking operations.
Length: 50 Problems - 738 Lines
Main Knowledge: CTE, Multi-CTE, Subqueries, EXISTS, PIVOT, CASE/WHEN, RANK, PARTITION BY, NTILE

Part 5 - Introduction to Functions: Learning about Table-Returned Functions (Query 201 to 250)

URL: Part 5 - Introduction to Functions: Learning about Table-Returned Functions
Description: Part 5 introduces the concept of SQL functions returning tables. These functions are designed to retrieve specific sets of data from the employee database based on various input parameters. They employ a range of SQL techniques such as string manipulation, numeric comparisons, and conditional filtering to extract information. The functions are diverse, tackling different aspects of data retrieval. These functions offer a deeper understanding of how to use SQL functions to create tailored queries, leveraging parameters to extract precise datasets. This set of exercises demonstrates the versatility of SQL functions in isolating targeted information from complex databases.
Length: 50 Problems - 957 Lines
Main Knowledge: FUNCTIONS Return Tables

Part 6 - Dynamic SQL and Stored Procedures (Query 251 to 300)

URL: Part 6 - Dynamic SQL and Stored Procedures
Description: In this section, the focus shifts towards leveraging the power of dynamic SQL, stored procedures, and variable manipulation within SQL Server. The exercises delve into the realm of dynamic SQL, allowing for the creation and execution of SQL statements at runtime. Participants will engage with stored procedures, a powerful feature that encapsulates SQL logic for reuse and efficient management. The exercises showcase the usage of DECLARE and SET to handle variables within SQL Server procedures. These variables allow for the storage and manipulation of values, enabling dynamic control over queries and stored procedure behavior. Additionally, the concept of EXEC (Execute) is introduced, demonstrating its use to run dynamic SQL statements or execute stored procedures within other procedures. The tasks within this part emphasize the importance and utility of dynamic SQL in constructing adaptable and flexible queries, especially in scenarios where SQL statements need to be generated based on varying conditions or user inputs.
Length: 50 Problems - 761 Lines
Main Knowledge: Dynamic SQL, Stored Procedures, DECLARE, SET Variables, EXEC

Part 7 - Advanced Procedures and Function Problems (Query 301 to 350)

URL: Part 7 - Advanced Procedures and Function Problems
Description: Part 7 explores SQL functions' power to return tables or specific values and delves into dynamic SQL concepts. It introduces COALESCE for handling NULL values effectively and demonstrates TRY_CAST's safe data type conversion. Additionally, it covers CROSS APPLY, enabling the combination of rows from multiple tables. These exercises deepen understanding of functions' versatility, NULL value management, safe type conversions, and advanced querying techniques like CROSS APPLY.
Length: 50 Problems - 1051 Lines
Main Knowledge: FUNCTIONS Return Tables/Values, Dynamic SQL, COALESCE, TRY_CAST, CROSS APPLY

Part 8 - The Art of SQL Logic: Looping Techniques and Conditional Structures (Query 351 to 400)

URL: Part 8 - The Art of SQL Logic: Looping Techniques and Conditional Structures
Description: While loops and basic logics were barely mentioned in the previous parts, part 9 is dedicated to helping learners grasp logic and develop a problem-solving mindset about loops and conditions. This section shows all the knowledge required to understand the concepts of WHILE and IF, but it does so with a smaller focus on tables and data than the previous sections. Despite the fact that many challenges were developed uniquely and that task requirements may have been similar, solving them required a great deal of critical thinking. Furthermore, the BIT, also known as "Booleans" in other programming languages, which are either 1 (True) or 0 (False), also caused some of the very first problems with this section. In combination with the complex use of Subqueries and 'WHERE EXISTS,' this section enhances the problem solving mechanism.
Length: 50 Problems - 1079 Lines
Main Knowledge: Subqueries, Loops using WHILE, Conditions using IF, WHERE EXISTS, BIT

THANKS FOR YOUR ATTENTION!

Sincerely,

tuanx18

tuanx18/data-engineer-portfolio

Hoang Anh Tuan - A Data Engineering Portfolio

About Me

Why Data Engineer?

Why not Data Scientist or AI Scientist?

Why did I attend 2 Universities but only has one Bachelor's Degree?

What did I prepare for the Data Engineering Career?

Education

Awards

Portfolio Projects

Table of Contents

1. Programming (Python)

Google Cloud Platform Data Engineer Certificate Assistant Tool

2. Data Science using Python

Customer Churn Prediction

Travel Insurance Prediction

Australia Rain Prediction

Student Performance in Exam Analysis

Stellar Classification

Mobile Price Classification

Fatalities in Israel-Palestines War Analysis

Airline Passenger Satisfaction Analysis

Diabetes Prediction

Water Potability Prediction

Breast Cancer Prediction

3. Apache Spark (Scala Spark, PySpark)

Billionaire Research And Exploration

Big Data Fundamentals Journey with PySpark

Part 1 - Introduction to PySpark

Part 2 - Data Transformation and Processing

Part 3 - Data Visualization using PySpark

Part 4 - Machine Learning Application using MLlib

4. Structured Query Language (SQL)

AdventureWorks Practical SQL - Knowledge Showcase Project

Part 1 - Getting used to beginner level of SQL commands (Query 1 to 50)

Part 2 - Diving Deeper into Data Manipulation (Query 51 to 100)

Part 3 - SQL Metamorphosis: Crafting with REPLACE, CAST, and NTILE (Query 101 to 150)

Part 4 - SQL Shapeshifters: Introduction to Data Ranking and Splitting (Query 151 to 200)

Part 5 - Introduction to Functions: Learning about Table-Returned Functions (Query 201 to 250)

Part 6 - Dynamic SQL and Stored Procedures (Query 251 to 300)

Part 7 - Advanced Procedures and Function Problems (Query 301 to 350)

Part 8 - The Art of SQL Logic: Looping Techniques and Conditional Structures (Query 351 to 400)

THANKS FOR YOUR ATTENTION!