Our next-door neighbor
We can often relate to our neighbors. They tend to have things in common with us. That’s partly a consequence of them living next door. They have a similar house, probably a similar income, and maybe a similar car. And sometimes we have even more in common, like sports clubs, schools the kids go to, and hobbies. Still, there is no uniformity. The odds are good that my neighbor’s neighbors are more different from me than my own neighbors, and the people at the end of the street are even more different.
Neighbors across the street
It can be more difficult to say something meaningful about the neighbors across the street or two streets over. Sometimes they live in different kinds of houses, which means they probably have different incomes and different lives, but there may still be similarities. It seems like as long as the houses are similarly sized, the similarities decrease as distance increases.
This means that the characteristics of people living in a random house can be determined by looking at which part of town they live in. Then the characteristics of the residents can be compared to those of the nearest house of which we know the characteristics, the nearest neighbor.
Pattern recognition using nearest neighbor
Nearest neighbor is mostly used to recognize patterns. It’s essentially a combination of clustering, using elements from linear regression. To illustrate this, first we’ll clarify the simple example of the suburb, and we’ll make the connection with image pattern recognition.
The problem with nearest neighbor
The key question in the introduction was how to describe the neighbor’s characteristics using those of the residents of a random, different house. We assume that the similarities between the residents of the different houses increase as the distance between them decreases. Common sense tells us that we can’t accurately describe every single characteristic, but it should certainly be possible to sort them into groups.
Depending on which group they are sorted into, we know the characteristics of the residents. Essentially, this means we’re clustering the residents of the suburb based on certain properties. However, we have to determine the groups based on the properties of a select few residents, sometimes based on data from other suburbs or neighborhoods.
Which factors are relevant?
The next step is determining the nearest neighbor of every house. In other words, which house is closest? The question becomes how to measure the distance and which factors are relevant. Is the measurable distance the only factor, or do other factors, like the street, play a role too? In other words, the term “distance” is subject to some interpretation. The measurable distance is based on the same method as that of linear regression. This measured distance is supplemented with other factors when necessary.
Schematizing nearest neighbor
This problem is easily schematized. Assume that we know the properties of several points (1 through 5). Each of these points falls into a certain group (grey scale) of indeterminate size based on its properties. The division of the (known) points across the groups is:
- Grey for points 1 and 2
- white for point 3
- black for points 4 and 5
The three points a, b, and c have to be divided into one of the three groups. When looking at this scheme it makes sense to put point A in the white group and point B in the grey group. Point C seems to be a problem because it’s equidistant between points 1, 3 and 5. Our gut tells us it should be possible to place point C after A and B. The question is how such an (intuitive) solution can be wrapped up in a formal solution.
Summed up, solving this problem means finding the answers to two questions:
- Based on which properties are the groups divided?
- How to determine which group a random new element belongs to?
Take note that that answer strongly depends on the used data and the desired properties. Also, use of this technique requires a priori information. Something has to be known about the form of the eventual solution.
Solutions for nearest neighbor
The answer to the first question is a division of the five households into groups based on their specific properties. After clustering, the five points with known properties have been divided into three groups.
In this example, we measure distance based on measurable distance in meters, as used in linear regression. This is the distance we can measure using a ruler. By measuring the distance between the known points (1 through 5), it’s possible to sort the unsorted points (a, b, and c) into groups. Based on the found distances, point A is closes to the white point and point B to the grey point. Point C is impossible to sort because it’s equidistant to three separate known points (1, 3, and 5). There are two options regarding point C:
- The status stays undefined.
- Calculate the status again, counting the distance of points A and B.
Which choice to make depends on the application and the data, and has to be decided on a case-by-case basis.
One of the best qualities of this method is the flexibility of measuring distance. It’s possible to consider a shape or pattern when measuring distance. This is especially useful when processing image information like facial recognition in security footage, and detecting organs and bones in medical images. In this last case, the nearest neighbor is easily translated into multiple dimensions.
The limitations of nearest neighbor
Nearest neighbor is a powerful technique when it comes to pattern recognition, given the right settings. And because nearest neighbor is essentially a combination of two simple techniques, we’re talking about the right settings of two techniques at the same time. In other words, the settings for clustering and measuring distance have to be correct, That means that the method is very dependent on the quality of the a priori knowledge of the situation, and the use of said knowledge.
Let’s say a company wants to sell more cars from the lower middle-class segment. To find potential customers in a certain suburb, the sales department wants to use nearest neighbor. As a starting point they use the customer file of a nearby suburb and assume that neighbors often have the same kind of car. This means that the following settings are used:
- Clustering the customers who own a car from the lower middle-class.
- Distance is based on houses that are no more than two houses away from a known house.
This analysis provides a series of addresses where the company sends invitations to visit showrooms, along with an enticing offer. To their surprise, the response is minimal.
Some obvious mistakes
Although the settings of the analysis seem good at first glance, there are some mistakes:
- Mistake 1: the motivation for the campaign is the expansion of the amount of customers in the suburb in question. This proves that coverage isn’t sufficient, meaning there aren’t enough known points.
- Mistake 2: Neighboring houses don’t necessarily share the same characteristics (consider across the street neighbors).
- Mistake 3: The chosen distance measurement, combined with the limited coverage of the suburb, isn’t sufficient to reach every possible customer.
This means that a successful campaign needs to incorporate adjustments, such as:
- Adjustment 1: increasing the amount of known points, improving the clustering.
- Adjustment 2: Altering the distance measurement to limit the definition of nearest neighbor to next-door neighbors.
A successful campaign using nearest neighbor
After incorporating the adjustments, the campaign proves to be a success for the following reasons:
- Better cluster analysis of the known information. The distance definition more closely matches the concept of neighbor. Further improvements can be made by incorporating more information in the clustering, such as brand loyalty.
- Given that the most important application is pattern recognition, nearest neighbor may seem like a competitor of neural network. But despite the overlap in application, they’re two totally different techniques.
- Neural network determines a certain connection based on a data set, which produces an answer, along with a probability.
- Nearest neighbor estimates properties of an unknown element by using the distance function.
Both methods employ a priori knowledge: the neural network to learn, the nearest neighbor for clustering. An important difference is that the neural network behaves as a black box, meaning the user has little to no influence on the process. Meanwhile, a wrong choice by the user in nearest neighbor has clear consequences. Which method is most suitable depends on the situation, and this is, again, the user’s choice.
Want to know more about nearest neighbor techniques?
The specialists of Passionned Group know the pros and cons of nearest neighbor and other predictive algorithms like no others. Make a free appointment with our predictive analytics experts and learn what AI, machine learning, and Big Data can mean for your organization.