Analytics with K Nearest Neighbor Classification

23:46 Amoeba Technologies 0 Comments

K Nearest Neighbor Classification

K Nearest Neighbor Classification is a pattern recognition algorithm. It is a non-parametric method used for classification and regression. In both cases, the input consists of the K closest examples.

we can consider each of the characteristics in our set as a different dimension in some space, and take the value an observation has for this characteristic to be its coordinate in that dimension, so getting a set of points in space. We can then consider the similarity of two points to be the distance between them in this space under some appropriate metric. The way in which the algorithm decides which of the points from the training set are similar enough to be considered when choosing the class to predict for a new observation is to pick the k closest data points to the new observation, and to take the most common class among these, thus called the k Nearest Neighbors algorithm.

The Algorithm

The algorithm can be summarized as:
1.    A positive integer k is specified, along with a new sample
2.    We select the k entries in our database which are closest to the new sample
3.    We find the most common classification of these entries
4.    This is the classification we give to the new sample.

The closeness can be identified by various distance measurements. Hence the distance method we choose effects the final results.

Example:

Let us Consider the following data concerning credit default. Age and Loan are two numerical variables (predictors) and Default is the target.

We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with Default=Y.

D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.

K-Nearest Neighbor algorithm in case of high number of dimensions and low number of training samples, "nearest" neighbor might be very far and in high dimensions "nearest" becomes meaningless. It is an easy to understand algorithm and handling of missing values is effective (restrict distance calculation to subspace).

Thanks and Regards,