Decision Tree란?

개념

Decision Tree는 데이터를 분류하거나 회귀 분석하는 데 사용하는 트리 구조 기반의 예측 모델이다.

1. 분류(Classification) 목적의 Decision Tree

입력 데이터를 분류(classify)하여 특정 클래스(label)를 예측하는 모델

ex) Refund, Matrial Status, Taxable Income 속성을 의사 결정 기준으로 하여 Cheat를 예측

Training Data

Decision Tree (속성을 나누는 기준에 따라 여러 가지 형태로 만들 수 있다. )

Test Data

Decision Tree의 구조를 따라가며 Cheat를 예측한다. 위의 2가지 Decision Tree에서는 동일하게 Cheat의 값이 No이다.

2. Regression 목적의 Decision Tree

입력 데이터를 기반으로 연속적인 숫자값을 예측

ex) 면적, 방의 개수로 집의 예상 가격 예측

Training Data

Decision Tree

                [면적 < 90㎡ ?]
               /            \
            Yes              No
        [방 < 2 ?]          [가격 = 5500]
        /      \
    [가격=2500] [가격=3500]

트리 생성 과정 (Tree Induction)

1. 분할 기준 속성 선택 : 불순도(impurity)가 가장 많이 감소하는 속성을 선택

2. 속성 값을 기준으로 데이터 분할

3. 자식 노드에서 반복

4. 정지 조건 도달 시 멈춤 (ex. 정보 이득이 거의 없음)

분할 기준 속성 선택 방법

분할 할 때 기준은 가장 순수한(homogeneous) 자식 노드를 만드는 것을 목표로 한다.

순수하다 == 불순도(impurity)가 낮다라는 의미이다. 여기서 불순도의 측정 방법은 여러가지가 있다.

불순도 측정 방법

1. Entropy ( 엔트로피 )

2. Information Gain ( 정보 이득 )

3. Gain Ratio (C4.5 방식)

4. Gini Index (CART 사용)

속성 유형

또한 속성 유형에 따른 분할 방식도 고려 해야한다.

트리 생성 예시

위의 샘플을 기반으로 소설의 히트 여부를 예측하는 Decision Tree를 생성해보자

step 1. split 전 전체 엔트로피 계산

step 2-1. Novelist를 기준으로 split 후 엔트로피 계산

step 2-2. Novel Genre를 기준으로 split 후 엔트로피 계산

step 3. Information Gain 선택

Novelist를 기준으로 했을 때 엔트로피는 0.857, Novel Genre를 기준으로 했을 때 엔트로피는 0.963

Information Gain을 구해보면

0.985-0.857=0.128

0.985-0.963=0.022

이므로 Information이 더 큰 Novelist를 루트 노드로 선택한다.

Decision Tree의 장단점

Decision Tree는 규칙이 직관적이고 연속형, 범주형 데이터가 모두 처리 가능하며 전처리가 거의 필요 없고 변수의 중요도를 시각화 가능하다는것이 장점이다.

그에 반해 그떄그때의 최적을 찾아가는 Greedy algorithm 방식을 사용하기 때문에 전체 최적이 아닌 경우가 있고 과적합 가능성, 학습 시간이 오래걸린다는 단점이 존재한다.

주요 알고리즘

ID3	Entropy & Information Gain 사용
C4.5	ID3 개선, Gain Ratio 사용
CART	Gini Index 사용, 이진 분할
CHAID	카이제곱 통계 기반

Decision Stump

Decision Tree중 루트 노드와 한 단계의 리프 노드로 이루어진 깊이가 1인 모양을 Decision Stump라 한다.

단일 속성만 사용하여 단독으로는 성능이 낮지만 수십,수백개의 decision stump를 결합하여 사용하면 강력한 성능의 모델이 된다.

Decision Tree 구현

outcome을 예측하는 Decision Tree를 만들고 예측해보기

import numpy as np
import pandas as pd

# Create the dataset as a pandas DataFrame
data=[
    ["Suburban","Detached","High","No","Not responded"],
    ["Suburban","Detached","High","Yes","Not responded"],
    ["Rural","Detached","High","No","responded"],
    ["Urban","semi-Detached","High","No","responded"],
    ["Urban","semi-Detached","Low","No","responded"],
    ["Urban","semi-Detached","Low","Yes","Not responded"],
    ["Rural","semi-Detached","Low","Yes","responded"],
    ["Suburban","Terrace","High","No","Not responded"],
    ["Suburban","semi-Detached","Low","No","responded"],
    ["Urban","Terrace","Low","No","responded"],
    ["Suburban","Terrace","Low","Yes","responded"],
    ["Rural","Terrace","High","Yes","responded"],
    ["Rural","Detached","Low","No","responded"],
    ["Urban","Terrace","High","Yes","Not responded"]    
]

columns=["District","House Type","Income","Previous Customer","Outcome"]
df=pd.DataFrame(data,columns=columns)

# Function to calculate entropy of a column 
def entropy(target_col):
    values,counts=np.unique(target_col,return_counts=True)
    probabilities=counts/counts.sum()
    return -np.sum(probabilities*np.log2(probabilities))


# Function to calculate information gain of a split
def info_gain(data,split_attribute_name,target_name="Outcome"):
    total_entropy=entropy(data[target_name])
    values,counts=np.unique(data[split_attribute_name],return_counts=True)
    weighted_entropy=0
    for i in range(len(values)):
        subset=data[data[split_attribute_name]==values[i]]
        weight=counts[i]/np.sum(counts)
        weighted_entropy+=weight*entropy(subset[target_name])
    return total_entropy-weighted_entropy


# Function to find the best attribute to split on
def best_attribute(data,attributes,target_name="Outcome"):
    best_gain=0
    best_attr=None
    for attr in attributes:
        gain=info_gain(data,attr,target_name)
        if gain>best_gain:
            best_gain=gain
            best_attr=attr
    return best_attr
    
# Recursive function to build the decision tree
def build_tree(data,attributes,target_name="Outcome"):
    # If all target values are the same, return that value (leaf node)
    if len(np.unique(data[target_name]))==1:
        return np.unique(data[target_name])[0]
     # If no attributes are left to split on, return the most common class
    if len(attributes)==0:
        return data[target_name].mode()[0]
     
    # Select the best attribute based on information gain
    best_attr=best_attribute(data,attributes,target_name)
    tree={best_attr:{}}

    # For each unique value of the chosen attribute, build a subtree
    for val in np.unique(data[best_attr]):
        subset=data[data[best_attr]==val]
        # Exclude the used attribute and recursively build the tree
        subtree = build_tree(subset, [attr for attr in attributes if attr != best_attr], target_name)
        tree[best_attr][val]=subtree

    return tree

# Build the decision tree 
attributes=["District","House Type","Income","Previous Customer"]
tree=build_tree(df,attributes)


# Function to predict the outcome for a new sample using the decision tree
def predict(tree,sample):
    # If this is a leaf node, return the class label
    if not isinstance(tree,dict):
        return tree
     # Otherwise, follow the branch based on the sample's value for the current attribute
    attribute=next(iter(tree))
    value=sample.get(attribute)
    subtree=tree[attribute].get(value)
    if subtree is None:
        return "UnKnown"
    return predict(subtree,sample)

# Predict the outcome for a new customer
sample_customer = {
    "District": "Suburban",
    "House Type": "Detached",
    "Income": "Low",
    "Previous Customer": "Yes"
}

result = predict(tree, sample_customer)

# Print the result
print("Decision Tree:", tree)
print("Prediction for sample:", result)

개념