phase-1-project: A Jupyter Notebook repository from Okodoimonicah

#ANALYSIS OF BEST PERFORMING MOVIES CURRENTLY Author: Monicah Iwagit Okodoi Client: Microsoft Project Overview Seeing all big companies creating original video content, Microsoft wants to also join and have decided to create a new movie studio, but do not have the knowledge about virtual video creation. I have been assigned the task by Microsoft to figure out what are the measures that they are going to take for them to venture in this field. I was provided with several data files for the task, to analyze and give the head of Microsoft’s new movie studio recommendations based on my findings to succeed in the field of movie creation. Business Problem Microsoft as a company wants to start on creating original video content but do not have enough knowledge about movie creation to move forward with their plan. Objectives Microsoft has the following objectives: • Finding which genres of the movies perform well in the dataset to receive the most public attention. • Determining the best time to release a movie. • Which director is associated with the most popular genre? Using several data frames read from the Box Office it helped in discovering patterns and relationships in the data in order to make better business decisions. Data mining will aid in spotting movie trends depending on various attributes, develop smarter methods for movie creation and accurately predict the movie performance. METHOD: CRISP DM I will be following the CRISP DM process for this task The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process. It has six sequential phases:

Business understanding – To venture into movie production.
Data understanding - Data was obtained from top movie wesites of which it was already provided.
Data preparation – cleaning data,removing unwanted columns, removing outliers changing to prefered data types.
Modeling – visualization with matplotlib.
Evaluation.
Deployment. Data and Analysis Overview In this analysis, I will perform an analysis on large data sets containing different types of movies. The data includes many different types of information about each movie, ranging from the release date, the director, the studio, average rating, rating, gross domestic and foreign and many other information obtained from different movie sites, we see this when reading the separate data files. I utilized three different data sources for my analysis in order to have the most comprehensive view of the current movie performance. • The Box Office Mojo Data: which was provided as a zipped data in csv format, containing 5 columns and 3387 movies in total. The data set was obtained from the Box Office website which ranges from 2010-2018. From the mojo data we see that most movies were filmed in IFC studio. • Rotten Tomatoes Data: The data obtained was in a csv format with 1560 rows and 12 columns. From the data we see that the most produced genre from value counts is Drama followed by comedy. • IMDB Data: this is a sql data and I preferred working with movie basics and movie ratings table so as to compare movies performance using average rating and genre. I did my analysis by performing some descriptive analysis on each data set. Through this I will be able to obtain trends in the data pertaining to what needs to be known for a movie to be successful. This analysis will mainly be done through examination of graphs of particular attributes.

Okodoimonicah/phase-1-project