用Python解決一個(gè)簡(jiǎn)單的水果分類問題

leafcho 2019-01-12

展開全文

在本文中，我們將使用Python中最流行的機(jī)器學(xué)習(xí)工具Scikit-learn在Python中實(shí)現(xiàn)幾種機(jī)器學(xué)習(xí)算法。使用簡(jiǎn)單的數(shù)據(jù)集來訓(xùn)練分類器以區(qū)分不同類型的水果。

本文的目的是確定最適合手頭問題的機(jī)器學(xué)習(xí)算法; 因此，我們想要比較不同的算法，選擇效果最好的算法。

數(shù)據(jù)

水果數(shù)據(jù)集由愛丁堡大學(xué)的Iain Murray博士創(chuàng)建。他買了幾十個(gè)不同品種的桔子、檸檬和蘋果，并記錄了他們的尺寸。

讓我們看一下數(shù)據(jù)的前幾行。

%matplotlib inlineimport pandas as pdimport matplotlib.pyplot as pltfruits = pd.read_table('fruit_data_with_colors.txt')fruits.head()

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

數(shù)據(jù)集的每一行表示水果的一個(gè)部分，由列表中的幾個(gè)特征表示。

我們的數(shù)據(jù)集中有59個(gè)水果和7個(gè)特征：

print(fruits.shape)

（59,7）

我們的數(shù)據(jù)集中有四種類型的水果：

print(fruits['fruit_name'].unique())

['蘋果''橘子（mandarin）''橙子''檸檬']

除橘子外，數(shù)據(jù)非常平衡。我們必須堅(jiān)持下去。

print(fruits.groupby('fruit_name').size())

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

import seaborn as snssns.countplot(fruits['fruit_name'],label='Count')plt.show()

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

可視化

每個(gè)數(shù)字變量的方形圖將使我們更清楚地了解輸入變量的分布：

fruits.drop('fruit_label', axis=1).plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(9,9),  title='Box Plot for each input variable')plt.savefig('fruits_box')plt.show()

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

顏色分?jǐn)?shù)近似于高斯分布。

import pylab as plfruits.drop('fruit_label' ,axis=1).hist(bins=30, figsize=(9,9))pl.suptitle('Histogram for each numeric input variable')plt.savefig('fruits_hist')plt.show()

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

一些屬性對(duì)是相關(guān)的（質(zhì)量和寬度）。這表明高度相關(guān)性和可預(yù)測(cè)的關(guān)系。

from pandas.tools.plotting import scatter_matrixfrom matplotlib import cmfeature_names = ['mass', 'width', 'height', 'color_score']X = fruits[feature_names]y = fruits['fruit_label']cmap = cm.get_cmap('gnuplot')scatter = pd.scatter_matrix(X, c = y, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap = cmap)plt.suptitle('Scatter-matrix for each input variable')plt.savefig('fruits_scatter_matrix')

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

統(tǒng)計(jì)摘要

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

我們可以看到數(shù)值沒有相同的比例。我們需要對(duì)我們?yōu)橛?xùn)練集計(jì)算的測(cè)試集擴(kuò)展應(yīng)用。

創(chuàng)建訓(xùn)練和測(cè)試集擴(kuò)展到應(yīng)用。

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

構(gòu)建模型

Logistic回歸

from sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression()logreg.fit(X_train, y_train)print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(logreg.score(X_train, y_train)))print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(logreg.score(X_test, y_test)))

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

Logistic回歸分類器在訓(xùn)練集上的準(zhǔn)確率:0.70

Logistic回歸分類器在測(cè)試集上的準(zhǔn)確率:0.40

決策樹

from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier().fit(X_train, y_train)print('Accuracy of Decision Tree classifier on training set: {:.2f}' .format(clf.score(X_train, y_train)))print('Accuracy of Decision Tree classifier on test set: {:.2f}' .format(clf.score(X_test, y_test)))

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

決策樹分類器在訓(xùn)練集上的準(zhǔn)確率:1.00

決策樹分類器在測(cè)試集上的準(zhǔn)確率:0.73

K-Nearest Neighbors

from sklearn.neighbors import KNeighborsClassifierknn = KNeighborsClassifier()knn.fit(X_train, y_train)print('Accuracy of K-NN classifier on training set: {:.2f}' .format(knn.score(X_train, y_train)))print('Accuracy of K-NN classifier on test set: {:.2f}' .format(knn.score(X_test, y_test)))

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

K-NN分類器在訓(xùn)練集上的準(zhǔn)確率:0.95

K-NN分類器在測(cè)試集上的準(zhǔn)確率:1.00

線性判別分析

from sklearn.discriminant_analysis import LinearDiscriminantAnalysislda = LinearDiscriminantAnalysis()lda.fit(X_train, y_train)print('Accuracy of LDA classifier on training set: {:.2f}' .format(lda.score(X_train, y_train)))print('Accuracy of LDA classifier on test set: {:.2f}' .format(lda.score(X_test, y_test)))

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

LDA分類器在訓(xùn)練集上的準(zhǔn)確率:0.86

LDA分類器在測(cè)試集上的準(zhǔn)確率:0.67

高斯樸素貝葉斯

from sklearn.naive_bayes import GaussianNBgnb = GaussianNB()gnb.fit(X_train, y_train)print('Accuracy of GNB classifier on training set: {:.2f}' .format(gnb.score(X_train, y_train)))print('Accuracy of GNB classifier on test set: {:.2f}' .format(gnb.score(X_test, y_test)))

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

GNB分類器在訓(xùn)練集上的準(zhǔn)確率:0.86

GNB分類器在測(cè)試集上的準(zhǔn)確率:0.67

支持向量機(jī)

from sklearn.svm import SVCsvm = SVC()svm.fit(X_train, y_train)print('Accuracy of SVM classifier on training set: {:.2f}' .format(svm.score(X_train, y_train)))print('Accuracy of SVM classifier on test set: {:.2f}' .format(svm.score(X_test, y_test)))

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

SVM分類器在訓(xùn)練集上的準(zhǔn)確率:0.61

SVM分類器在測(cè)試集上的準(zhǔn)確率:0.33

KNN算法是我們嘗試過的最準(zhǔn)確的模型?；煜仃嚤硎緶y(cè)試集沒有發(fā)生錯(cuò)誤。但是，測(cè)試集非常小。

from sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrixpred = knn.predict(X_test)print(confusion_matrix(y_test, pred))print(classification_report(y_test, pred))

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

繪制k-NN分類器的決策邊界

import matplotlib.cm as cmfrom matplotlib.colors import ListedColormap, BoundaryNormimport matplotlib.patches as mpatchesimport matplotlib.patches as mpatchesX = fruits[['mass', 'width', 'height', 'color_score']]y = fruits['fruit_label']X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)def plot_fruit_knn(X, y, n_neighbors, weights): X_mat = X[['height', 'width']].as_matrix() y_mat = y.as_matrix()# Create color maps cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF']) cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights) clf.fit(X_mat, y_mat)# Plot the decision boundary by assigning a color in the color map # to each mesh point. mesh_step_size = .01 # step size in the mesh plot_symbol_size = 50 x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1 y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size), np.arange(y_min, y_max, mesh_step_size)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure() plt.pcolormesh(xx, yy, Z, cmap=cmap_light)# Plot training points plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black') plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max())patch0 = mpatches.Patch(color='#FF0000', label='apple') patch1 = mpatches.Patch(color='#00FF00', label='mandarin') patch2 = mpatches.Patch(color='#0000FF', label='orange') patch3 = mpatches.Patch(color='#AFAFAF', label='lemon') plt.legend(handles=[patch0, patch1, patch2, patch3])plt.xlabel('height (cm)')plt.ylabel('width (cm)')plt.title('4-Class classification (k = %i, weights = '%s')' % (n_neighbors, weights)) plt.show()plot_fruit_knn(X_train, y_train, 5, 'uniform')

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

k_range = range(1, 20)scores = []for k in k_range: knn = KNeighborsClassifier(n_neighbors = k) knn.fit(X_train, y_train) scores.append(knn.score(X_test, y_test))plt.figure()plt.xlabel('k')plt.ylabel('accuracy')plt.scatter(k_range, scores)plt.xticks([0,5,10,15,20])

用Python解決一個(gè)簡(jiǎn)單的水果分類問題

對(duì)于這個(gè)特定的數(shù)據(jù)集，當(dāng)k = 5時(shí)，我們獲得最高的準(zhǔn)確度

總結(jié)

在本文中，我們關(guān)注預(yù)測(cè)的準(zhǔn)確性。我們的目標(biāo)是學(xué)習(xí)具有良好泛化性能的模型。這種模型使預(yù)測(cè)精度最大化。我們確定了最適合手頭問題的機(jī)器學(xué)習(xí)算法（即水果類型分類）; 因此，我們比較了不同的算法并選擇了性能最佳的算法。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： leafcho > 《辦公》

舉報(bào)/認(rèn)領(lǐng)