/EDA

Primary LanguageHTML

Corona Virus and Wealth

DataAnalytics(Course) Project

Intro:

The agency is interested to know if population wealth is an indicator of per capita COVID cases and deaths, death rate from confirmed cases, rate of change of these factors, and spread to adjacent principalities. The key to such an approach is an understanding of where COVID-19 is most likely to spread, most likely to spread the fastest, and cause the most death. At the state level, medical assets such as people (doctors, nurses, and respirator technical staff), medical equipment, beds, supplies, and medicine could be allocated proactively to predicted hotspots across. One hypothesis, among many, is that population demographics related to wealth might provide insight into COVID-19 spread.

Modules:

The data includes information on 1) County-level indicators for over 60 populations including population density, race, poverty level, housing size, sources of income, employment status, whether living alone, language barriers, immigration status, and disability status.

Data Analytics:

Aggregate() Function in R Splits the data into subsets, computes summary statistics for each subsets, and returns the result in a group by form. The aggregate () function is useful in performing all the aggregate operations like sum,count, mean, minimum and Maximum. Aggregated the rows using SUM function by states.We have mutated the 24 new columns to know the percentage between the features like how many percent people are black, white, underage, employed, graduates are there in the population who can get affected by a coronavirus and we also saw the summarization of newly mutated 24 columns.

Machine Learning:

KNN algorithms use data and cluster new data points based on similarity measures (e.g. distance function). clustering is done by a majority vote to its neighbors.

SVM module:

SVM or Support Vector Machine is a linear model for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems. The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes. Observation: With a training accuracy of ~65%. The testing accuracy is 69%.

Random Forest:

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean/average prediction of the individual trees.

Observation:

With a training accuracy of 74% that is performed on the cross validation of 10 fold, repeated 3 times.

Conclusion:

This level of feature analysis and prediction can affect early inhibition of the flow and save the economy, life of individuals with a statistical fool proof system.