Data Mining

2020-05-06

HW1 Implement DecisionTree, KNN, RF and Adaboost by Python

文章目录

1. Q1 Comparison of Classifiers
2. Q2 Implementation of Adaboost

Source code please in Github repository

Q1 Comparison of Classifiers

Data Description

We use the letter dataset from Statlog. The statistics of dataset is shown in Table 1. The class number of the data is 26.

Dataset	Size	Feature
Train	15000	16
Test	5000	16

Comparison of Classifiers

You are required to implement the following classifiers and compute the performance achieved by different classifiers.

Decision Tree

You should form decision trees on dataset in terms of entropy and gini criterions. For each criterion, you should set the depth as [5,10,15,20,25] separately. You need to compare the performance (accuracy, precision, recall, f1 score and training time) and give a brief discussion.
KNN, Random Forest

Apply three different classifiers KNN and Random Forest on the dataset. For each classifier, evaluate the performance (accuracy, precision, recall, f1 score and training time) . You are required to compare the performance of different classifiers and give a brief discussion.

import pandas as pd
import numpy as np

def preprocess(path):
    """
    Preprocess the txt file
    
    Input: the file path
    Output: 
        X: a list contains all feature vectors
        y: a list of labels
    """
    
    with open(path) as f:
        X, y = list(), list()
        for line in f.readlines():
            data = line.strip().split(" ")
            i = 0
            vec = list()
            for item in data:
                if i == 0:
                    y.append(item) # the first item is the label
                else:
                    vec.append(item.split(":")[1]) # traversal 16 features and form features vector
                i += 1
            X.append(vec)
    return X, y

X_train, y_train = preprocess("Q1_dataset/letter_train")
X_test, y_test = preprocess("Q1_dataset/letter_test")

def evaluate(y_predict, y_test):
    """
    Evaluate the classifier performance by accuracy, percision, recall and f1
    
    Input: y_predict and the ground truth y_test
    Output: accuracy, precision, recall, f1
    """
    accuracy = metrics.accuracy_score(y_test, y_predict)
    precision = metrics.precision_score(y_test, y_predict, average="macro")
    recall = metrics.recall_score(y_test, y_predict, average="macro")
    f1 = metrics.f1_score(y_test, y_predict, average="macro")
    print("accuracy = ", accuracy)
    print("precision = ", precision)
    print("recall = ", recall)
    print("f1 = ", f1)
    
    return accuracy, precision, recall, f1

Decision Tree

import time
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

criterion = ["gini", "entropy"]
max_depth = [5, 10, 15, 20, 25]

for c in criterion:
    for m in max_depth:
        t0 = time.time()
        DT = DecisionTreeClassifier(criterion=c, max_depth=m)
        DT.fit(X_train, y_train)
        t1 = time.time()
        y_predict = DT.predict(X_test)
        print("criterion = %s, max_depth = %s" % (c, m))
        evaluate(y_predict, y_test)
        print("trainning time = ", t1 - t0)

According to the above evaluation result, we can summarize the performance of decision tree with different parameters in this table (keep 3 significant digits)

Criterion	Max Depth	Accuracy	Precision	Recall	F1	Training Time
gini	5	0.369	0.392	0.368	0.327	0.0942s
gini	10	0.713	0.763	0.716	0.726	0.103s
gini	15	0.832	0.842	0.833	0.835	0.125s
gini	20	0.867	0.869	0.867	0.867	0.126s
gini	25	0.872	0.872	0.872	0.872	0.126s
entropy	5	0.501	0.561	0.501	0.498	0.086s
entropy	10	0.799	0.804	0.799	0.800	0.118s
entropy	15	0.874	0.875	0.875	0.875	0.132s
entropy	20	0.871	0.871	0.872	0.871	0.151s
entropy	25	0.874	0.874	0.874	0.874	0.157s

From this table, we can see that with the same parameter of max_depth, entropy always performs better than gini. It’s obvious that the larger the max_depth, the better performance of the classifier with higher accuracy, precison, recall and F1 but also with longer time to train. The best accuracy that the clssifier can achieve is about 0.874.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

n_neighbors = [2, 5, 8]
for n in n_neighbors:
    t0 = time.time()
    neigh = KNeighborsClassifier(n_neighbors=n, radius=1, algorithm="auto")
    neigh.fit(X_train, y_train)
    t1 = time.time()
    y_predict = neigh.predict(X_test)
    print("n_neighbors = ", n)
    evaluate(y_predict, y_test)
    print("trainning time = ", t1 - t0)

n_estimators = [50, 100, 150]
for n in n_estimators:
    t0 = time.time()
    RF = RandomForestClassifier(n_estimators=n, criterion="gini", max_depth=20)
    RF.fit(X_train, y_train)
    t1 = time.time()
    y_predict = RF.predict(X_test)
    print("n_estimators = ", n)
    evaluate(y_predict, y_test)
    print("trainning time = ", t1 - t0)

We summarize the performance of different classifiers in the following table. Notice we tune the n_neighbors(number of neighbors to use for K nearest neighbors queries) in the KNN and adjust the n_estimators(number of trees in the forest) in RandomForest. Then we get 6 different classifiers.

Classifier	Accuracy	Precision	Recall	F1	Training Time
KNN -> n_neighbors = 2	0.947	0.948	0.949	0.947	0.198s
KNN -> n_neighbors = 5	0.952	0.952	0.952	0.952	0.189s
KNN -> n_neighbors = 8	0.948	0.949	0.948	0.948	0.187s
RandomForest-> n_estimators = 50	0.957	0.958	0.958	0.958	0.932s
RandomForest-> n_estimators = 100	0.960	0.961	0.961	0.961	1.813s
RandomForest-> n_estimators = 150	0.962	0.963	0.963	0.963	2.734s

Because RandomForest is an ensemble methods that it need to train multiple base classifiers to combine, it needs much more time to train compared with KNN (2.734s is 15X of 0.187s). More number of trees in the forest, it needs more training time but can achieve better performance. However, in KNN, the training time becomes smaller when n_neighbors becomes larger. The best n_neighbors is 5 and the best accuracy that KNN achieves is about 0.952 which is lower than the performance of RandomForest’s best accuracy 0.962.

In a word, RandomForest classifiers reduces the variance and has better performnce than KNN but need more time to train.

Q2 Implementation of Adaboost

The following table shows the training dataset. It consists of 10 data and 2 labels.


x	0	1	2	3	4	5	6	7	8	9
y	1	1	1	-1	-1	-1	1	1	1	-1

We assume the weak classifier is produced by $x < v$ or $x > v$ where $v$ is the threshold and makes the classifier get the best accuracy on the dataset. You should implement the AdaBoost algorithm to learn a strong classifier. Notice that you CANNOT use Adaboost library. You need to implement it manually.

You should also report the final expression of the strong classifier, such as $C^∗(x) = sign [\alpha_1 C_1(x) + \alpha_2 C_2(x) + \alpha_3 C_3(x) + \cdots]$, where $C_i(x)$ is the base classifier and $\alpha_i$ is the weight of base classifier. You are also required to describe each basic classifier in detail.

For simplicity, the threshold $v$ should be the multiple of 0.5, i.e., $v\%0.5==0$. For example, you can set $v$ as 2, 2.5, or 3, but you cannot set $v$ as 2.1.

import math
import numpy as np

class BaseClassifier:
    """
    The Base Classifier Class
    
    Initial Parameters:
        v: the threshold of the base classifier
        name: (string) the name of the base classifier
        lower_v_label: (+1 or -1) the label when x < v
    """
    
    
    def __init__(self, v, name, lower_v_label):
        self.v = v
        self.name = name
        self.lower_v_label = lower_v_label
        
    def predict(self, X):
        y_predict = list()
        for x in X:
            if x < self.v:
                y_predict.append(self.lower_v_label)
            else:
                y_predict.append(-self.lower_v_label)
        return y_predict
        
    def get_error_rate(self, X, y, weights):
        weight_count = 0
        total = len(y)
        y_predict = self.predict(X)
        for i in range(total):
            if y_predict[i] != y[i]:
                weight_count += weights[i]
        return weight_count / total

We consider from two sides to find all best base classifiers, firstly we consider this situation,

$C(x) = \left\{ \begin{aligned} +1 & , x < v \\ -1 & , \ x \geq v \end{aligned} \right.$

With $v$ satisfying $v\%0.5==0$, it’s obvious when $v=2.5$ or $v=8.5$, classifier achieves the lowest error rate 0.3, only misclassfying 3 samples. We set the two classifiers as $C_1$ and $C_2$:

$C_1(x) = \left\{ \begin{aligned} +1 & , x < 2.5 \\ -1 & , \ x \geq 2.5 \end{aligned} \right. \ \ \ \ \ C_2(x) = \left\{ \begin{aligned} +1 & , x < 8.5 \\ -1 & , \ x \geq 8.5 \end{aligned} \right.$

Then we consider a classifier with this form,

$C(x) = \left\{ \begin{aligned} -1 & , x < v \\ +1 & , \ x \geq v \end{aligned} \right.$

Similarly, we can find that the classifier with $v=5.5$ has the smallest error rate of 0.4. We denote it by $C_3$:

$C_3(x) = \left\{ \begin{aligned} -1 & , x < 5.5 \\ +1 & , \ x \geq 5.5 \end{aligned} \right.$

# Training Dataset
X = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [1, 1, 1, -1, -1, -1, 1, 1, 1, -1]

weights = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# Initialize three best basic classifiers C1, C2, C3
# v = 2.5, lower_v_label = 1
C1 = BaseClassifier(v=2.5, name="C1", lower_v_label=1)
print("Classifier 1: v < 2.5 => y = +1, v >= 2.5 => y = -1")
print("prediction:",C1.predict(X))
print("error rate: ",C1.get_error_rate(X, y, weights))
print()

# v = 8.5, lower_v_label = 1
C2 = BaseClassifier(v=8.5, name="C2", lower_v_label=1)
print("Classifier 1: v < 8.5 => y = +1, v >= 8.5 => y = -1")
print("prediction:", C2.predict(X))
print("error rate: ", C2.get_error_rate(X, y, weights))
print()

# v = 5.5, lower_v_label = -1
C3 = BaseClassifier(v=5.5, name="C3", lower_v_label=-1)
print("Classifier 1: v < 2.5 => y = -1, v >= 2.5 => y = +1")
print("prediction:",C3.predict(X))
print("error rate: ",C3.get_error_rate(X, y, weights))
print()

Classifier 1: v < 2.5 => y = +1, v >= 2.5 => y = -1
prediction: [1, 1, 1, -1, -1, -1, -1, -1, -1, -1]
error rate:  0.3

Classifier 1: v < 8.5 => y = +1, v >= 8.5 => y = -1
prediction: [1, 1, 1, 1, 1, 1, 1, 1, 1, -1]
error rate:  0.3

Classifier 1: v < 2.5 => y = -1, v >= 2.5 => y = +1
prediction: [-1, -1, -1, -1, -1, -1, 1, 1, 1, 1]
error rate:  0.4

class Adaboost:
    """
    Adaboost Classifier Class
    
    Initial Parameters:
        base_classifiers: [Classifier1, Classifier2, ...] the list of base classifiers used to train
        n_classifiers: The maximum number of classifiers at which boosting is terminated
    """
    
    def __init__(self, base_classifiers, n_classifiers):
        self.base_classifiers = base_classifiers
        self.n_classifiers = n_classifiers
    
    # train on data X -> y
    def fit(self, X, y):
        weights = [1/len(X)] * len(X) # initialize weight
        self.alphas = list()
        self.classifiers = list()
        for n in range(self.n_classifiers):
            min_error_rate = 1
            for cf in self.base_classifiers:
                e = cf.get_error_rate(X, y, weights)
                print(cf.name, "error rate = ", e)
                if e < min_error_rate:
                    best_base_classifier = cf
                    min_error_rate = e
            
            # calculate the importance of the classifier
            alpha = 1/2 * math.log((1-min_error_rate)/min_error_rate) 
            
            # save alpha and the best_base_classifier at each iteration
            self.alphas.append(alpha)
            self.classifiers.append(best_base_classifier)
            
            # update weights
            y_predict = best_base_classifier.predict(X)
            for i in range(len(weights)):
                if y_predict[i] == y[i]:
                    weights[i] = weights[i] / (2 * (1 - min_error_rate))
                else:
                    weights[i] = weights[i] / (2 * min_error_rate)
            print("weights = ", weights)
            print()
            
        # the information to print to describe the final ensemble classifier
        info = "final ensemble classifier = "
        i = 0
        for i in range(len(self.alphas)):
            if i < len(self.alphas) - 1:
                info += (str(self.alphas[i]) + " * " + self.classifiers[i].name + " + ")
            else:
                info += (str(self.alphas[i]) + " * " + self.classifiers[i].name)
        print(info)

    def predict(self, X):
        scores = np.array([0.0]*len(X))
        for i in range(len(self.alphas)):
            scores += self.alphas[i] * np.array(self.classifiers[i].predict(X))
        
        return np.sign(scores)
    
    def get_error_rate(self, X, y, weights):
        weight_count = 0
        total = len(y)
        y_predict = self.predict(X)
        for i in range(total):
            if y_predict[i] != y[i]:
                weight_count += weights[i]
        return weight_count / total

1 2	ada = Adaboost(base_classifiers=[C1, C2, C3], n_classifiers=3) ada.fit(X, y)

C1 error rate =  0.030000000000000006
C2 error rate =  0.030000000000000006
C3 error rate =  0.04
weights =  [0.051546391752577324, 0.051546391752577324, 0.051546391752577324, 0.051546391752577324, 0.051546391752577324, 0.051546391752577324, 1.6666666666666665, 1.6666666666666665, 1.6666666666666665, 0.051546391752577324]

C1 error rate =  0.5
C2 error rate =  0.015463917525773196
C3 error rate =  0.02061855670103093
weights =  [0.02617801047120419, 0.02617801047120419, 0.02617801047120419, 1.6666666666666667, 1.6666666666666667, 1.6666666666666667, 0.8464223385689353, 0.8464223385689353, 0.8464223385689353, 0.02617801047120419]

C1 error rate =  0.25392670157068065
C2 error rate =  0.5
C3 error rate =  0.010471204188481676
weights =  [1.25, 1.25, 1.25, 0.8421516754850089, 0.8421516754850089, 0.8421516754850089, 0.42768959435626097, 0.42768959435626097, 0.42768959435626097, 1.25]

final ensemble classifier = 1.7380493449176364 * C1 + 2.07683056968926 * C2 + 2.2742999172498486 * C3

1	ada.predict(X)

array([ 1.,  1.,  1., -1., -1., -1.,  1.,  1.,  1., -1.])

1	ada.get_error_rate(X, y, weights)

0.0

In the end, we get a strong classifier $C^*(x)$

$C^*(x) = sign(1.7380493449176364 * C_1(x) + 2.07683056968926 * C_2(x) + 2.2742999172498486 * C_3(x))$

that can achieve 0 error of classification where

$C_1(x) = \left\{ \begin{aligned} +1 & , x < 2.5 \\ -1 & , \ x \geq 2.5 \end{aligned} \right. \ \ \ \ \ C_2(x) = \left\{ \begin{aligned} +1 & , x < 8.5 \\ -1 & , \ x \geq 8.5 \end{aligned} \right. \ \ \ \ \ C_3(x) = \left\{ \begin{aligned} -1 & , x < 5.5 \\ +1 & , \ x \geq 5.5 \end{aligned} \right.$

The classification result of $C^*(x)$ is [1, 1, 1, -1, -1, -1, 1, 1, 1, -1]