### Interim Report Data Analytics Using R

Interim Report

Data Analytics Using R

(Clustering)

Submitted to- Prof. R.K. Jena

Submitted by- Group:4 (ABCD)

Pragya Mishra- 2017111033

Abhinav Jain- 2017132131

Abhiroop Dey Sarkar- 2017133004

Kaira Dhanda- 2017141084

1. CLUSTERING:

Introduction

Clustering is a broad set of techniques for finding subgroups of observations within a data set. When we cluster observations, we want observations in the same group to be similar and observations in different groups to be dissimilar. Because there isn’t a response variable, this is an unsupervised method, which implies that it seeks to find relationships between the n observations without being trained by a response variable. Clustering allows us to identify which observations are alike, and potentially categorize them therein.

. Clusters have the following properties:

We find them during the operation and their number is also not always fixed in advance.

They are the combination of objects having similar characteristics.

Clustering is one of the most widespread descriptive methods of data analysis and data mining. We use it when data volume is large to find homogeneous subsets that we can process and analyze in different ways.

For example, a food product manufacturing company can categorize its customers on the basis of purchased items and cost of those items.

Applications of Clustering

Following are the main Clustering applications:

Marketing – In this field, clustering is useful in finding customer profiles that make customer base. After detecting clusters, a business can develop a specific strategy for each cluster base. We can use clusters to keep track of customers over months and detect a number of customers who moved from 1 cluster to other.

Retail – In the retail industry, we use clustering to divide all stores of a particular company into groups of establishments on basis of type of customer, turnover etc.

Medical Science – In medical, we use clustering discover a group of patients suitable for particular treatment protocols. Each group comprises all patients who react in the same way. Formation of these groups is on basis of age, type of disease etc. We can also us clustering in the classification of the protein sequence, CT-scans etc.

Sociology – We use Clustering in performing data mining operations here. We divide the population into groups of individuals who are homogeneous in terms of social demographics, lifestyle, expectations etc. We can then use the categorization for purposes like polls, identifying criminals etc.

Data Preparation

To perform a cluster analysis in R, generally, the data should be prepared as follows:

Rows are observations (individuals) and columns are variables

Any missing value in the data must be removed or estimated.

The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.

Clustering Distance Measures

The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix. There are many methods to calculate this distance information; the choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters.

The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters. In R, the Euclidean distance is used by default to measure the dissimilarity between each pair of observations .

Euclidean Distance – It is the most common method used. It is the geometric measure of distance between objects in a multidimensional space.

The total sum of squares or inertia is the weighted mean of squares of the distances of each point from the center of gravity of that cluster. Thus we calculate it by adding the within-cluster sum of squares with the between-cluster sum of squares.We calculate Sum of Squares of Clusters on its center of gravity as below:

Total Sum of Squares (I) = Between-Cluster Sum of Squares (IR) + Within-Cluster Sum of Squares (IA)

We calculate Between-Cluster Sum of Squares by finding the square of difference from the center of gravity for each cluster and then adding them. As it increases, the separation between clusters also increases indicating satisfactory clustering.

We calculate Within-Cluster Sum of Squares by finding the square of difference from the center of gravity for each cluster and then adding them within in a single cluster. As it diminishes, clustering of the population becomes better.

R2 (RSQ) is the proportion of the sum of squares explained by the clusters (between-cluster sum of squares/total sum of squares). The nearer it is to 1, the better the clustering will be, but we should not aim to maximize it at all costs because this would result in the largest number of clusters: there would be one cluster per individual. So we need an R2 that is close to 1 but without too many clusters. A good rule is that, if the last significant rise in R2 occurs when we move from k to k + 1 clusters, the partition into k+1 clusters is correct.

The properties of Efficient Clustering:

Detection of the structures present in the data

Easy determination of optimal number of clusters

Yielding of clearly differentiated clusters

Yielding of clusters that remain stable with minor changes in data

Processing of large data volumes efficiently

Handling of all types of variables if required

2. CLUSTERING TECHNIQUES:

A. K-Means Clustering

K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.

The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:

K-means algorithm :

Specify the number of clusters (K) to be created (by the analyst)

Select randomly k objects from the data set as the initial cluster centers or means

Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid

For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; p is the number of variables.

Iteratively minimize the total within sum of square . That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. By default, the R software uses 10 as the default value for the maximum number

K-means clustering is a very simple and fast algorithm. Furthermore, it can efficiently deal with very large data sets. However, there are some weaknesses of the k-means approach. One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters.

B. Hierarchical Clustering

Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset. It does not require us to pre-specify the number of clusters to be generated as is required by the k-means approach. Furthermore, hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram.

Hierarchical Clustering Algorithms

Hierarchical clustering can be divided into two main types: agglomerative and divisive.

Agglomerative clustering: It’s also known as AGNES (Agglomerative Nesting). It works in a bottom-up manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root) (see figure below). The result is a tree which can be plotted as a dendrogram.

Divisive hierarchical clustering: It’s also known as DIANA (Divise Analysis) and it works in a top-down manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster (see figure below).

323850579120Agglomerative clustering is good at identifying small clusters. Divisive hierarchical clustering is good at identifying large clusters.

A number of different cluster agglomeration methods (i.e, linkage methods) have been developed to answer to this question. The most common types methods are:

Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.

Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.

Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.

Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.

Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.

-37147574295000-486410498157500We can see the differences these approaches in the following dendrograms:

-372110-5334000-44767538766750

Case Study Implementation:

i. Data Set Information:

This data contains a list of mammals, and the percentage composition of the various constituents, like water, Protein, fat, Lactose and Ash present in their milk.

Attribute information-

Name- name of the animal (categorical)

Water- the percentage of water present in the milk.

Protein- the percentage of protein present in the milk.

Fat- the percentage of fat present in the milk.

Lactose- the percentage of lactose present in the milk.

Ash-the percentage of ash present in the milk.

ii. Objective of the study- The objective is to perform clustering algorithm and divide the data into the clusters based on the composition of protein and fat present in the milk of the mammals, using different methods.

iii. R- Code is attached with the mail

iv. Result Analysis and performance comparisons-

RESULT-According to the majority rule, the best number of clusters comes out to be 3.

Using K- Means, the first cluster consists of 13 mammals, second cluster consists of 9 mammals and third cluster consists of 3 mammals, on the basis of the percentage composition of Fat and protein present in their milk.

The centroid of the cluster comes out to be-

Protein Fat

1-0.8414622 -0.5408944619

2 0.8626544 -0.0008532224

3 1.0583733 2.3464356686

The cluster plot obtained is shown here-(fig-1)

Using hierarchical clustering, the mammals again gets divided into three clusters, the first cluster consists of 18 mammals, second cluster consists of 6 mammals and the third cluster consists of 3 mammals, on the basis of the percentage composition of Fat and protein present in their milk.

The graphical representation of the same is shown through the dendrogram.(fig-2)

Performance Comparisons-

Both the methods of clustering are giving same number of clusters (3), but subgrouping in clusters is somewhat different. Thus, both clustering techniques are giving good performance.

2352675180975Plot-1

00Plot-1

-666750000 F fiF

20764500Fig-2

00Fig-2

-409575390525