### https

https://www.overleaf.com/pro ject/5bbdd48863057b0809cda168

https://www.overleaf.com/pro ject/5bbdd48863057b0809cda168

1

Your Pro ject Title Goes Here

A

Mini Project Report Submittedby

Ms. Nikita S. Nale 1841045

Ms. Harshada S. Mane 1841042

In partial fulfillment for the requirement of Laboratory Practice-II of

Ba…elor of Computer Engineering

Under the guidance of

Prof. Padulkar D. M (designation of guide) Department of Computer Engineering

Vidya Pratishthan’s Kamalnayan Ba ja j Institute of Engineering and

Technology

Bhigawan Road, Vidyanagari

Baramati-4131332018-2019

Vidya Pratishthan’s

Kamalnayan Ba ja j Institute of Engineering and Technology, Baramati

Department of Computer Engineering

Certificate

This is to certify that following students Ms. Nikita S. Nale 1841045

Ms. Harshada S. Mane 1841042

have successfully completed their project work on TITLE OF YOUR PROJECT GOES HERE

during the academic year 2018-2019in the partial fulfillment towards

the completion of Laboratory Practice-II inComputer Engineering.

Pro ject Guide HoD Deptt. of Comp. Engg.

(Prof. Padulkar D. M) (Prof. Mrs. S. S. Nandgaonkar)

Principal

( Dr. R. S. Bichkar)

Internal Examiner External Examiner

Acknowledgments

We feel happy in forwarding this pro ject report as an image of sincere eort. We are

pleased to acknowledge Prof. Padulkar D. M for their invaluable guidance during this

pro ject work. We also equally indebted to our principal Dr. R. S. Bichkarfor his valuable help

whenever needed.

Ms. Nikita S. Nale

Ms. Harshada S. Mane

i

Abstract

As we know, predicting a movie’s success is a dicult problem. Movie’s success

doesn’t depends on only its quality, some external factors such as competing movies,

time of the year aect the success. As these factors impact the BoxOce sales for the

movie opening. We introduce a simple solution for predicting movie success in terms

of Rating and Revenue. As a result this approach achieved decent appraisal, allowing

theatre planning to a certain extent, even for small studios. So the prediction of movie

success is of great importance to the industry. So in this pro ject we focus a detailed study

of logistic regression, Naive Bayes and K- Nearest Neighbours on movie to predict movie

success rate.

ii

Contents

Acknowledgmentsi

Abstract ii

List of Tablesv

List of Figuresvi

1 Introduction1 1.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.2 Brief Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.3 Problem Denition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

2 Literature Survey3

3 Dataset Description4 3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 3.1.1 Purpose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

3.1.2 Pro ject Scope. . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

3.1.3 Assumptions and Dependencies. . . . . . . . . . . . . . . . . . .5

4 Data Preprocessing and visualization6 4.1 Steps in Data Preprocessing. . . . . . . . . . . . . . . . . . . . . . . . .6

4.2 Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

5 Classication7 5.1 Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

5.2 Naive Bayes Classier. . . . . . . . . . . . . . . . . . . . . . . . . . . .7

5.3 K-Nearest Neighbours. . . . . . . . . . . . . . . . . . . . . . . . . . . .8

6 Confusion Matrix9 6.1 Analyse Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . .9 6.1.1 Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . .9

6.1.2 Naive Bayes Classier. . . . . . . . . . . . . . . . . . . . . . . .10

6.1.3 KNN Classier. . . . . . . . . . . . . . . . . . . . . . . . . . . .10

CONTENTS

6.2 Compare Classiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

7 Result Analysis11 7.1 Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

7.2 Naive Bayes Classier. . . . . . . . . . . . . . . . . . . . . . . . . . . .11

7.3 KNN Classier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

8 Conclusion and Future Work12

A Glossary13 Bibliography16 Identifying the Movie Success Rate

ivVPKBIET, Baramati

List of Tables

6.1 Confusion matrix of Logistic Regression. . . . . . . . . . . . . . . . . . .9

6.2 Confusion matrix of Naive Bayes Classier. . . . . . . . . . . . . . . . .9

6.3 Confusion matrix of KNN Classier. . . . . . . . . . . . . . . . . . . . .9

v

List of Figures

4.1 Movie with year vs rating. . . . . . . . . . . . . . . . . . . . . . . . . .6

vi

1

Introduction

1.1 Overview

The lm industry is one of the largest sources of entertainment in the world. The industry

produces thousands of lms annually and rakes in billions of dollars in revenue. So

the large production houses control the lm industry, with billions of dollars spent on

promotions of movies. Advertising contribute heavily to the total budget of the movies.

Sometimes the failure of is the heavy loss of producer as investment result. If it was

possible to know the success rate of movies, the production houses could adjust the

release of their movies to gain maximum proï¬t. They make a prediction when the

market is on and when it is not. So predicting the opening of movies on BoxOce is very

important to increasing the success rate of movie.

1.2 Brief Description

The ob jective of our pro ject is to predict the success rate of a movie based on attributes

such as the actors involved, directors, year in which they were released, movie genre,

total runtime of movie, user rating, number of votes, total revenue generated by movie,

the overall metascore, age of the users watching and recording the votes or rating, the

geographical areas where movie was released, any other inuences such as political move-

ments, ongoing trends, and so on. The most dicult task to get the dataset for such kind

of predictions and analysis.

Following are steps we performed:

1. Searched for available dataset, pick and create suitable dataset.

2. Lableling the attributes

2. Apply data preprocessing

3. We used this data as an input to the machine learning and data mining algorithms for

prediction of movie success rate.

4. split the data into training and testing data. 5. We have used are Logistic Regression,

KNearest Neighbor, NaÃ¯ve Bayes Classier

1

CHAPTER 1. INTRODUCTION

6. We have computed the results of algorithms by means of confusion matrix, accuracy,

recall, precision rate

7. Analysis of eect of various attributes on the success rate of movie. These attributes

include rating, votes, actors, directors, revenue and metascore.

1.3 Problem Denition

Identifying the movie success rate based on ratings and revenues of movies to prevent the

loss of production houses and increasing the prot. Identifying the Movie Success Rate

2VPKBIET, Baramati

2

Literature Survey

Darin Im and Minh Thao 1 talk about how they follow the functional steps of data

extraction, data preprocessing, data integration and transformation, feature selection

and nally classication like in 1. They also used an Movie dataset like in 2 and based

on an algorithm designed by them, set parameters to classify the movie as a success or

failure. Although their implementation has shown a high rate of accuracy in prediction,

their algorithm has had drawbacks of bad time complexity,as the initial data retrieval

takes a long time to create a training data set for even a few tuples of data. We plan

on incorporating their idea and taking it ahead by adding our own algorithm to convert

the string value of a classifying parameter, like actor name or revnue of movie and rating

so on, to a numerical value which will then be put into a broader formula in relation to

all classifying parameters of the test data, and hence decide whether the movie will be

successful or not.

3

3

Dataset Description

3.1 Introduction

The movie dataset contains the following attributes:

1) Rank – Rank of the movie

2) Title – Title of the movie

3) Genre – Genre of the movie

4) Description – Description of the movie

5) Director – Director of the movie

6) Actors – Actors of the movie

7) Year – Year of the movie

8) Runtime (Minutes) – Runtime of the movie

9) Rating – Rating of the movie

10) Votes – Votes of the movie

11) Revenue (Millions) – Revenue of the movie

12) Metascore – Metascore of the movie

3.1.1 Purpose

The Purpose of our pro ject is to predict the success rate of a movie based on attributes

such as the actors involved, directors, year in which they were released, movie genre, total

runtime of movie, user rating, number of votes, total revenue generated by movie, the

overall metascore.

3.1.2 Pro ject Scope

This pro ject focuses on prediction of movie success rate based on rating and revenues of

movies. As a result this approach achieved decent appraisal, allowing theatre planning

to a certain extent, even for small studios. So the prediction of movie success is of great

importance to the industry. If it was possible to beforehand the likelihood of success of

4

CHAPTER 3. DATASET DESCRIPTION

the movies, production houses could adjust the release of their movies to gain maximum

prot.

3.1.3 Assumptions and Dependencies

Assumption of movie success rate on opening of Boxoce based on rating and revenue of

movies. Identifying the Movie Success Rate

5VPKBIET, Baramati

4

Data Preprocessing and visualization

4.1 Steps in Data Preprocessing

1.Import the libraries

2.Import the data-set

3.Check out the missing values

4.See the Categorical Values

5.Splitting the data-set into Training and Test Set

4.2 Visualization Figure 4.1: Movie with year vs rating

6

5

Classication

5.1 Logistic Regression

Logistic regression is a statistical method for analyzing a dataset in which there are one or

more independent variables that determine an outcome. The outcome is measured with

a dichotomous variable in which there are only two possible outcomes. The dependent

variable is binary or dichotomous, i.e. it only contains data coded as 1 or 0. The binary

logistic model is used to estimate the probability of a binary response based on one or

more predictor variables 4.

The goal of logistic regression is to nd the best tting model to describe the re-

lationship between the dichotomous characteristic of interest and a set of independent

(predictor or explanatory) variables. Logistic regression equation – Here p is the probability of presence of the characteristic

of interest.

The logistic transformation is dened as the logged odds:

Odds = p/(1-p) and Logit(p) = ln(p/(1-p))

5.2 Naive Bayes Classier

NaÃ¯ve Bayes Algorithm is a classication technique based on Bayesâ€™ Theorem with

an assumption of independence among predictors. In simple terms, a Naive Bayes clas-

sier assumes that the presence of a feature in a class is unrelated to the presence of

any other feature. Naive Bayes model is easy to build and particularly useful for very

large data sets. Along with simplicity, Naive Bayes is known to outperform even highly

sophisticated classication methods.

Formula:

P(A jB ) = P

(B jA )P (A ) P

(B )

7

CHAPTER 5. CLASSIFICATION

5.3 K-Nearest Neighbours

In the classication setting, the K-nearest neighbor algorithm essentially boils down to

forming a ma jority vote between the K most similar instances to a given unseen obser-

vation. Similarity is dened according to a distance metric between two data points. A

popular choice is the Euclidean distance given by

q P

n

i =1 (

x

i

y

i) 2 Identifying the Movie Success Rate

8VPKBIET, Baramati

6

Confusion Matrix

Table 6.1: Confusion matrix of Logistic Regression positive negative

TP=69 FP=6

FN=8 TN=14 Table 6.2: Confusion matrix of Naive Bayes Classier

positive negative

TP=71 FP=4

FN=16 TN=6 Table 6.3: Confusion matrix of KNN Classier

positive negative

TP=74 FP=1

FN=15 TN=7 6.1 Analyse Confusion Matrix

6.1.1 Logistic Regression Accuracy : 85.56 perPrecision : 0.7

Recall : 0.636

9

CHAPTER 6. CONFUSION MATRIX

6.1.2 Naive Bayes Classier

Accuracy : 79.38 perPrecision : 0.6

Recall : 0.272

6.1.3 KNN Classier Accuracy : 83.5 perPrecision : 0.875

Recall : 0.318

6.2 Compare Classiers

The success percentage for all models were nearly the same however the Logistic Regres-

sion and KNN Neighbours model had the highest accuracy in our case for predicting the

movies success. Identifying the Movie Success Rate

10VPKBIET, Baramati

7

Result Analysis

7.1 Logistic Regression

When we consider binary values as input the Logistic regression classier has a good

accuracy of 85.5 percentage. The predictions are quite high, and this algorithm is very

stable when we consider the dataset with more than one independent. variable.

7.2 Naive Bayes Classier

The accuracy for Naive Bayes Classier is 79.3 percentage.

7.3 KNN Classier

Based on the above results, it can be inferred that the K-Nearest Neighbor classier at k

is equal to 5 has a good accuracy of 83.5 percentage.

11

8

Conclusion and Future Work

A larger training set is the key to improving the performance of the model. We need to

consider additional features such as geographic location, age of viewers and voters, current

trends, news analysis, movie plot analysis and social networks data analysis could be done

and the information thus obtained could be added to the training set. We can also use

Google trends result to improve the result.

12

A

Glossary

Denes Terms, Acronyms and abbreviations used in the FRD

13

Annex A

Dene terms, acronyms, and abbreviations used in the FRD

14

Annex B

Dene terms, acronyms, and abbreviations used in the FRD

15

Bibliography

1Darin Im, Minh Thao, Dang Nguyen, Predicting Movie Success in the U.S. market, Dept.Elect.Eng, Stanford Univ., California, December,2011

2Haiyi Zhang, Di Li Jodrey School of Computer Science Acadia University, Canada, NaÃ¯ve Bayes Text Classier (2007)

16