개발로 하는 개발

[CS231n] Assignment 1 - KNN 본문

Study

[CS231n] Assignment 1 - KNN

jiwon152 2023. 5. 3. 14:57

- KNN ( K nearest neighbor)

hyperparameter : k, L1 or L2 (distance calculating formula)

Basically, you are trying to figure out which dot belongs to what region. And you are determining this by calculating the distance between the test point and train points. You get the value of k nearest points, and decide whichever the majority is.

 

There are two ways to calculate this. L1 and L2.

in the assignment,

you implement the function compute_distances_two_loops and function predict_labels.

num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))

1. compute_distance_two_loops

In compute_distance_two_loops, you are using L2 distance, so you sub X [i] and X_train [j], and square it, then you calculate the root of it.

dists[i][j] = np.sqrt(np.sum( (X[i]- self.X_train[j]) **2 ))

If you don't implement the predict_labels function, you get only 11.4% of accuracy.

2. predict_labels

 closest_y = dists[i].argsort()[:k]
 
 # get the most frequent one in k closest dists.
 unique, counts = np.unique(closest_y, return_counts=True)
 index = np.argmax(counts)
 y_pred[i] = unique[index]

First, I made a mistake by saving the values of dists in closest_y. Then, of course, accuracy was 0. So I tried printing the values in the ipynb file with

 print(y_test_pred)

then the result was like this.

 You have to remember that you have to save values of y_test. That is why you need the index. And that is why you used argsort.

 

Also, you have to cut the array of closest_y with [0:k] range because it is searching for its nearest neighbor within k. So you have to find for the most frequent y value in closest_y. If done correctly, it looks like this code. 

 num_test = dists.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            closest_y = []

            # the index of the one with the smallest dist(closest to zero) will be on closest_y[0]
            # grab the value of y in the y_train with indices
            
            closest_y = self.y_train[dists[i].argsort()][:k]
            
            # get the most frequent one in k closest dists.
            
            unique, counts = np.unique(closest_y, return_counts=True)
            index = np.argmax(counts)
            y_pred[i] = unique[index]

When this code is run along with the ipynb file, it looks like this. Which is the desired result. 

3. Why One Loop didn't work ( a.k.a. why matrices are different)

 

            dists[i, :] = np.sqrt(np.sum( (X[i] - self.X_train) ** 2 ))

With this code, it runs but it is faulty. How do you know? You run the following code on colab and it will tell you the distance matrices are different whereas they should be the same. It is just the implementation that is different and it shouldn't effect the distance matrix.

 So, I figured out what was the problem with my code. I ran some code using numpy array & python to visually see what was going on.

 

import numpy as np

# List1 = [[1, 2, 3, 4, 5, 6], [3, 4, 5, 6, 7, 8], [1, 2, 3, 4, 5, 7]]
List1 = [[1, 2, 3, 4, 5, 6]]
List2 = [[1, 2, 3, 4, 5, 6], [3, 4, 5, 6, 7, 8], [1,3,5,7,9,11]]

arr1 = np.array(List1)
arr2 = np.array(List2)

print(arr1 - arr2)

print(arr1**2)

print((arr1 - arr2)**2)

print(np.sum((arr1 - arr2)**2))

print(np.sqrt(np.sum((arr1 - arr2)**2)))

The results were like this. So, I needed to get distance matric (500, 5000) but all I was getting was (500, 1).

I need to do sum of elements of a row, but was doing the entire sum. 

X[i] : (1,3072) | self.X_train : (5000, 3072) arr1 : (1, 6) | arr2 : (3, 6)
X[i] - self.X_train : (5000, 3072) arr1 - arr2 : (3, 6)
(X[i] - self.X_train) ** 2 : (5000, 3072) (arr1 - arr2) ** 2 : (3, 6)
np.sum ( (X[i] - self.X_train) ** 2) : (1, 1) np.sum ( (arr1 - arr2) ** 2) : (1, 1) -> (1, 3)
np.sqrt ( np.sum ( (X[i] - self.X_train) ** 2) ) : (1, 1) np.sqrt ( np.sum ( (arr1 - arr2) ** 2) ) : (1, 1) -> (1, 3)
desired : (1, 5000) desired : (1, 3)

 

So, I need to get the sum of elements in each rows and (가로줄의 합을 구해서 그 합이 각자 한 줄을 차지하게) save them.

 

 dists[i, :] = np.sqrt(np.sum( (X[i] - self.X_train) ** 2 , axis = 1))

 With the axis = 1, the numpy function sum adds the rows and returns an array of sum of rows.

4. compute_distance_no_loops

 The main problem with using no loops is that the arrays didn't broadcast when it wasn't (1,3072) + (5000, 3072). So we needed to use matrix multiplication and broadcast sums to make a distance matrix.

import numpy as np

List1 = [[0,0,0,0,0,0],[1,1,1,1,1,1]]
List2 = [[1, 2, 3, 4, 5, 6], [2,3,4,5,6,7], [3, 4, 5, 6, 7, 8]]

arr1 = np.array(List1)
arr2 = np.array(List2)

print(arr1[0] - arr2)
print(arr1[1] - arr2)
print("\n")

print((arr1[0] - arr2)**2)
print((arr1[1] - arr2)**2)
print("\n")

print(np.sum((arr1[0] - arr2)**2, axis = 1))
print(np.sum((arr1[1] - arr2)**2, axis = 1))
print("\n")

print(np.sqrt(np.sum((arr1[0] - arr2)**2, axis = 1)))
print(np.sqrt(np.sum((arr1[1] - arr2)**2, axis = 1)))
print("\n")


print(arr1 ** 2)
print(arr2 ** 2)
print("\n")

print(np.sum(np.square(arr1), axis = 1))
print(np.sum(np.square(arr1), axis = 1).reshape(arr1.shape[0], 1))
print("\n")

A_2 = np.sum(np.square(arr1), axis = 1)
B_2 = np.sum(np.square(arr2), axis = 1)
print(A_2)
print(B_2)
print("\n")

dot_product = np.dot(arr1, np.transpose(arr2))
print(dot_product)
print("\n")

dists = np.zeros((2, 3))
print(A_2.reshape(2, 1) + B_2 -2*dot_product)
print(dists - 2*dot_product)
print(dists - 2*dot_product + B_2)

print("\n")
print(-dot_product + B_2)

 

 

 

- Cross Validation

For example, in 5-fold cross-validation, we would split the training data into 5 equal folds, use 4 of them for training, and 1 for validation. We would then iterate over which fold is the validation fold, evaluate the performance, and finally average the performance across the different folds.
"https://cs231n.github.io/classification/"
import numpy as np
num_folds = 5
test_arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])
X_train_folds = []
X_train_folds = np.array_split(test_arr, num_folds)
print(X_train_folds)

for i in range(num_folds):
    print(X_train_folds[i])
    print(np.concatenate(X_train_folds[:i] + X_train_folds[i+1:]))