手把手實(shí)戰(zhàn)機(jī)器學(xué)習(xí)系列: 隨機(jī)森林

昵稱32937624 2019-02-08

展開全文

我們將探索決策樹，并且拓展它到隨機(jī)森林。這種類型的模型，和我們之前見過的線性和邏輯回歸不同，他們沒有權(quán)重但是有很好的可解釋性。

概述

手把手實(shí)戰(zhàn)機(jī)器學(xué)習(xí)系列: 隨機(jī)森林

目標(biāo):
給出一些數(shù)據(jù)，選擇特征并且決定以什么樣的方式分裂數(shù)據(jù)來做出預(yù)測(cè)。
優(yōu)點(diǎn)
:
決策樹可以做成分類樹和回歸樹。
具有很強(qiáng)的可解釋性.
僅需很少的數(shù)據(jù)預(yù)處理。
缺點(diǎn):
當(dāng)訓(xùn)練數(shù)據(jù)少于分類類別的時(shí)候表現(xiàn)很差。
其他:
一組決策樹可以構(gòu)成隨機(jī)森林，預(yù)測(cè)結(jié)果也會(huì)由所有的決策樹所決定。

訓(xùn)練

我們來看上方的決策樹樣例，它用來決策外面的天氣是否可以在外玩耍。數(shù)據(jù)有三個(gè)特征(weather, humiditiy和wind) 和結(jié)果 (yes or no).

步驟:

基于每一個(gè)特征進(jìn)行分割(例如. 根據(jù)三個(gè)特征來判斷結(jié)果是yes還是no）
計(jì)算每個(gè)特征分裂時(shí)候的損失。一些熱門的算法比如使用Gini系數(shù)來計(jì)算的CART算法，還有使用熵和信息增益來計(jì)算的ID3。它們都基本上測(cè)量了預(yù)測(cè)值中的雜質(zhì)或者無序。詳細(xì)請(qǐng)見 blog post 一個(gè)詳盡的信息增益計(jì)算步驟。

手把手實(shí)戰(zhàn)機(jī)器學(xué)習(xí)系列: 隨機(jī)森林

H(X): 數(shù)據(jù)集X的熵
C: 類別集合
p(c): 在c類別中和所有實(shí)例的占比

對(duì)于一個(gè)二分類任務(wù)來說，如果所有的樣例在一個(gè)類別下都是相同的，那么它的熵值為0，如果僅有一半是正確的，那么它的熵值則為1（也是最差的情況等同于瞎猜）。一旦我們決定了熵值，我們需要計(jì)算出信息增益(IG)(比如. 在我們把數(shù)據(jù)X基于特征F分裂后不確定的樣本減少了多少)。

手把手實(shí)戰(zhàn)機(jī)器學(xué)習(xí)系列: 隨機(jī)森林

IG(F. X): 數(shù)據(jù)X基于特征F分裂后的信息增益
H(X): 數(shù)據(jù)集X的熵
T: 基于分裂F后的子集
p(t): 所有實(shí)例中t的實(shí)例數(shù)的比例
H(t): 子集t的熵

注意: 對(duì)于回歸問題，你可以用標(biāo)準(zhǔn)偏差（standard deviation）來取代信息增益。

3.在所有的特征分裂后，信息增益最高的分裂將作為第一個(gè)特征的分裂(也就是決策樹的根).

4.基于第一次分裂后，重復(fù)上述的步驟在余下的所有特征中。最后，我們將分裂到葉子結(jié)點(diǎn)，在葉子結(jié)點(diǎn)中大部分樣本將會(huì)來自同一類。

數(shù)據(jù)

加載一些使用的庫

from argparse import Namespace //用來解析參數(shù)

import matplotlib.pyplot as plt //用來進(jìn)行可視化

import numpy as np

import pandas as pd

import urllib

設(shè)置參數(shù)

# 參數(shù)args = Namespace( seed=1234, data_file='titanic.csv', train_size=0.75, test_size=0.25, num_epochs=100, max_depth=4, min_samples_leaf=5, n_estimators=10, # 隨機(jī)森林中包含的決策樹個(gè)數(shù))# 設(shè)置隨即種子來保證實(shí)驗(yàn)結(jié)果的可重復(fù)性。np.random.seed(args.seed)

使用pandas讀取文件

# 把CSV文件內(nèi)容讀到DataFrame中df = pd.read_csv(args.data_file, header=0)df.head()

手把手實(shí)戰(zhàn)機(jī)器學(xué)習(xí)系列: 隨機(jī)森林

from sklearn.tree import DecisionTreeClassifier

預(yù)處理內(nèi)容

# 預(yù)處理def preprocess(df): # 刪除掉含有空值的行 df = df.dropna() # 刪除基于文本的特征 (我們以后的課程將會(huì)學(xué)習(xí)怎么使用它們) features_to_drop = ['name', 'cabin', 'ticket'] df = df.drop(features_to_drop, axis=1) # pclass, sex, 和 embarked 是類別變量 # 我們將把字符串轉(zhuǎn)化成浮點(diǎn)數(shù)，不再是邏輯回歸中的編碼變量 df['sex'] = df['sex'].map( {'female': 0, 'male': 1} ).astype(int) df['embarked'] = df['embarked'].dropna().map( {'S':0, 'C':1, 'Q':2} ).astype(int) return df

數(shù)據(jù)預(yù)處理結(jié)果:

# 數(shù)據(jù)預(yù)處理df = preprocess(df)df.head()

手把手實(shí)戰(zhàn)機(jī)器學(xué)習(xí)系列: 隨機(jī)森林

# 劃分?jǐn)?shù)據(jù)到訓(xùn)練集和測(cè)試集mask = np.random.rand(len(df)) < args.train_sizetrain_df = df[mask]test_df = df[~mask]print ('Train size: {0}, test size: {1}'.format(len(train_df), len(test_df)))

劃分訓(xùn)練集和測(cè)試集

# 分離 X 和 yX_train = train_df.drop(['survived'], axis=1)y_train = train_df['survived']X_test = test_df.drop(['survived'], axis=1)y_test = test_df['survived']

注意: 你可以隨意改動(dòng) max_depth 和 min_samples 來觀察決策樹表現(xiàn)好壞的變化。我們?cè)趺粗朗裁磿r(shí)候可以停止分裂？如果我們有一個(gè)很多特征的數(shù)據(jù)集，我們的決策樹也會(huì)非常大。如果我們一直去分裂，我們終究會(huì)導(dǎo)致過擬合。所以這里有一些處理辦法可以參考：

設(shè)置在葉子節(jié)點(diǎn)中的最少樣本個(gè)數(shù)。
設(shè)置一個(gè)最大的深度(也就是從樹根到葉子節(jié)點(diǎn)的最大距離)。
通過刪除幾乎沒有信息增益的特征對(duì)決策樹進(jìn)行剪枝。

創(chuàng)建模型:

# 初始化模型dtree = DecisionTreeClassifier(criterion='entropy', random_state=args.seed, max_depth=args.max_depth, min_samples_leaf=args.min_samples_leaf)

訓(xùn)練模型

# 訓(xùn)練

dtree.fit(X_train, y_train)

模型預(yù)測(cè)

# 預(yù)測(cè)pred_train = dtree.predict(X_train)pred_test = dtree.predict(X_test)

模型評(píng)估

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

計(jì)算模型的評(píng)估指標(biāo)

# 正確率

train_acc = accuracy_score(y_train, pred_train)

test_acc = accuracy_score(y_test, pred_test)

print ('train acc: {0:.2f}, test acc: {1:.2f}'.format(train_acc, test_acc))

train acc: 0.82, test acc: 0.70

# 計(jì)算其他的模型評(píng)估指標(biāo)

precision, recall, F1, _ = precision_recall_fscore_support(y_test, pred_test, average='binary')

print ('precision: {0:.2f}. recall: {1:.2f}, F1: {2:.2f}'.format(precision, recall, F1))

可解釋性

安裝必要的包

# 安裝必要的包!apt-get install graphviz!pip install pydotplusfrom sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphvizimport pydotplus# 可解釋性dot_data = StringIO()export_graphviz(dtree, out_file=dot_data,  feature_names=list(train_df.drop(['survived'], axis=1)),  class_names = ['died', 'survived'], rounded = True, filled= True, special_characters=True)graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png(), width=500, height=300)

畫出特征重要性

# 特征重要性

features = list(X_test.columns)

importances = dtree.feature_importances_

indices = np.argsort(importances)[::-1]

num_features = len(importances)

# 畫出樹中的特征重要性

plt.figure()

plt.title('Feature importances')

plt.bar(range(num_features), importances[indices], color='g', align='center')

plt.xticks(range(num_features), [features[i] for i in indices], rotation='45')

plt.xlim([-1, num_features])

plt.show()

# 打印值

for i in indices:

print ('{0} - {1:.3f}'.format(features[i], importances[i]))

如圖所示

手把手實(shí)戰(zhàn)機(jī)器學(xué)習(xí)系列: 隨機(jī)森林

隨機(jī)森林

隨機(jī)森林由一組，或者說一個(gè)集成的決策樹在一起構(gòu)建。它的意圖是，與單個(gè)決策樹相比，一組不同的樹將產(chǎn)生更準(zhǔn)確的預(yù)測(cè)。但是如果我們?cè)谙嗤臄?shù)據(jù)下用相同的分裂條件比如說信息增益，那么怎么保證每棵樹又是不同的呢？這里的解決方法是隨機(jī)森林中的不同決策樹由不同的數(shù)據(jù)子集組成，甚至不同的特征閾值。

手把手實(shí)戰(zhàn)機(jī)器學(xué)習(xí)系列: 隨機(jī)森林

Scikit-learn 實(shí)現(xiàn)

from sklearn.ensemble import RandomForestClassifier

創(chuàng)建模型

# 初始化隨機(jī)森林forest = RandomForestClassifier( n_estimators=args.n_estimators, criterion='entropy',  max_depth=args.max_depth, min_samples_leaf=args.min_samples_leaf)

訓(xùn)練模型

# 訓(xùn)練forest.fit(X_train, y_train)

模型預(yù)測(cè)

# 預(yù)測(cè)pred_train = forest.predict(X_train)pred_test = forest.predict(X_test)

模型評(píng)估

# 正確率train_acc = accuracy_score(y_train, pred_train)test_acc = accuracy_score(y_test, pred_test)print ('train acc: {0:.2f}, test acc: {1:.2f}'.format(train_acc, test_acc))# 計(jì)算其他評(píng)估指標(biāo) precision, recall, F1, _ = precision_recall_fscore_support(y_test, pred_test, average='binary')print ('precision: {0:.2f}. recall: {1:.2f}, F1: {2:.2f}'.format(precision, recall, F1))train acc: 0.80, test acc: 0.68precision: 0.65. recall: 0.87, F1: 0.75

可解釋性

# 特征重要性features = list(X_test.columns)importances = forest.feature_importances_std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)indices = np.argsort(importances)[::-1]num_features = len(importances)# 畫出樹中的特征重要性plt.figure()plt.title('Feature importances')plt.bar(range(num_features), importances[indices], yerr=std[indices],  color='g', align='center')plt.xticks(range(num_features), [features[i] for i in indices], rotation='45')plt.xlim([-1, num_features])plt.show()# 打印for i in indices: print ('{0} - {1:.3f}'.format(features[i], importances[i]))

手把手實(shí)戰(zhàn)機(jī)器學(xué)習(xí)系列: 隨機(jī)森林

超參數(shù)搜索: 網(wǎng)格搜索

在隨機(jī)森林中，會(huì)有許多不同的超參數(shù)(criterion, max_depth, n_estimators)等，那么如何去選擇什么樣的超參數(shù)的值，使得模型的效果達(dá)到最好，常見的方法有網(wǎng)格搜索，貝葉斯搜索等，在這里我們調(diào)用sklearn的GridSearchCV進(jìn)行超參數(shù)的尋找

1. from sklearn.model_selection import GridSearchCV

2.創(chuàng)建網(wǎng)格參數(shù)

# 創(chuàng)建網(wǎng)格的參數(shù)

param_grid = {

'bootstrap': [True],

'max_depth': [10, 20, 50],

'max_features': [len(features)],

'min_samples_leaf': [3, 4, 5],

'min_samples_split': [4, 8],

'n_estimators': [5, 10, 50] # of trees

}

3.初始化隨機(jī)森林

# 初始化隨機(jī)森林forest = RandomForestClassifier()

4.實(shí)例化網(wǎng)格搜索

# 實(shí)例化網(wǎng)格搜索grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3, n_jobs=-1, verbose=1)# 網(wǎng)格搜索擬合數(shù)據(jù)grid_search.fit(X_train, y_train)

5.查看最佳參數(shù)組合

# 查看最佳參數(shù)組合grid_search.best_params_# 使用最佳參數(shù)訓(xùn)練best_forest = grid_search.best_estimator_best_forest.fit(X_train, y_train)

6.模型預(yù)測(cè)

# 預(yù)測(cè)pred_train = best_forest.predict(X_train)pred_test = best_forest.predict(X_test)

7.計(jì)算正確率

# 正確率train_acc = accuracy_score(y_train, pred_train)test_acc = accuracy_score(y_test, pred_test)print ('train acc: {0:.2f}, test acc: {1:.2f}'.format(train_acc, test_acc))# 計(jì)算其他評(píng)價(jià)指標(biāo)precision, recall, F1, _ = precision_recall_fscore_support(y_test, pred_test, average='binary')print ('precision: {0:.2f}. recall: {1:.2f}, F1: {2:.2f}'.format(precision, recall, F1))結(jié)果:train acc: 0.90, test acc: 0.70precision: 0.70. recall: 0.79, F1: 0.75

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：昵稱32937624 > 《待分類》

舉報(bào)/認(rèn)領(lǐng)