|
高斯混合模型是一種流行的無(wú)監(jiān)督學(xué)習(xí)算法。GMM方法類(lèi)似于K-Means聚類(lèi)算法,但是由于其復(fù)雜性,它更健壯,因此更有用。 K-means聚類(lèi)使用歐式距離函數(shù)來(lái)發(fā)現(xiàn)數(shù)據(jù)中的聚類(lèi)。只要數(shù)據(jù)相對(duì)于質(zhì)心呈圓形分布,此方法就可以很好地工作。但是,如果數(shù)據(jù)是非線(xiàn)性的呢?或者數(shù)據(jù)具有非零的協(xié)方差呢?如果聚類(lèi)具有不同的均值和協(xié)方差怎么辦? 這就要用到高斯混合模型了! GMM假設(shè)生成數(shù)據(jù)的是一種混合的高斯分布。與將數(shù)據(jù)點(diǎn)硬分配到聚類(lèi)的K-means方法(假設(shè)圍繞質(zhì)心的數(shù)據(jù)呈圓形分布)相比,它使用了將數(shù)據(jù)點(diǎn)軟分配到聚類(lèi)的方法(即概率性,因此更好)。
簡(jiǎn)而言之,GMM效果更好,因?yàn)椋海ˋ)通過(guò)使用軟分配捕獲屬于不同聚類(lèi)的數(shù)據(jù)點(diǎn)的不確定性,(B)對(duì)圓形聚類(lèi)沒(méi)有偏見(jiàn)。即使是非線(xiàn)性數(shù)據(jù)分布,它也能很好地工作。 GMMGMM的目標(biāo)函數(shù)是最大化數(shù)據(jù)X、p(X)或?qū)?shù)似然值L的似然值(因?yàn)閷?duì)數(shù)是單調(diào)遞增函數(shù))。通過(guò)假設(shè)混合了K個(gè)高斯來(lái)生成數(shù)據(jù),我們可以將p(X)寫(xiě)為邊緣概率,對(duì)所有數(shù)據(jù)點(diǎn)的K個(gè)聚類(lèi)求和。  似然值  對(duì)數(shù)似然值 利用上面對(duì)數(shù)函數(shù)的求和,我們不能得到解析解??雌饋?lái)很討厭,但這個(gè)問(wèn)題有一個(gè)很好的解決方案:Expectation-Maximization(EM)算法。 數(shù)學(xué)EM算法是一種迭代算法,用于在無(wú)法直接找到參數(shù)的情況下尋找模型的最大似然估計(jì)(MLE)。它包括兩個(gè)步驟:期望步驟和最大化步驟。 1.期望步驟:計(jì)算成員值r_ic。這是數(shù)據(jù)點(diǎn)x_i屬于聚類(lèi)c的概率。 2. 最大化步驟:計(jì)算一個(gè)新參數(shù)mc,該參數(shù)確定屬于不同聚類(lèi)的點(diǎn)的分?jǐn)?shù)。 通過(guò)計(jì)算每個(gè)聚類(lèi)c的MLE來(lái)更新參數(shù)μ,π,Σ。 重復(fù)EM步驟,直到對(duì)數(shù)似然值L收斂。 Python編碼讓我們從頭開(kāi)始用python編寫(xiě)GMM的基本實(shí)現(xiàn)。 生成一維數(shù)據(jù)。 x = np.linspace(-5, 5, 20)x1 = x*np.random.rand(20)x2 = x*np.random.rand(20) + 10x3 = x*np.random.rand(20) - 10xt = np.hstack((x1,x2,x3))
初始化GMM的參數(shù):μ,π,Σ。 max_iterations = 10pi = np.array([1/3, 1/3, 1/3])mu = np.array([5,6,-3])var = np.array([1,3,9])r = np.zeros((len(xt), 3))
運(yùn)行EM算法的第一次迭代 import matplotlib.pyplot as pltimport numpy as npfrom scipy.stats import normgauss1 = norm(loc=mu[0], scale=var[0])gauss2 = norm(loc=mu[1], scale=var[1])gauss3 = norm(loc=mu[2], scale=var[2]) # E-Stepfor c,g,p in zip(range(3), [gauss1, gauss2, gauss3], pi): r[:,c] = p*g.pdf(xt[:])for i in range(len(r)): r[i,:] /= np.sum(r[i,:])fig = plt.figure(figsize=(10,10))ax0 = fig.add_subplot(111)for i in range(len(r)): ax0.scatter(xt[i],0,c=r[i,:],s=100) for g,c in zip([gauss1.pdf(np.linspace(-15,15)),gauss2.pdf(np.linspace(-15,15)),gauss3.pdf(np.linspace(-15,15))],['r','g','b']): ax0.plot(np.linspace(-15,15),g,c=c,zorder=0)ax0.set_xlabel('X-axis')ax0.set_ylabel('Gaussian pdf value')ax0.legend(['Gaussian 1', 'Gaussian 2', 'Gaussian 3'])plt.show() # M-Stepmc = np.sum(r, axis=0)pi = mc/len(xt)mu = np.sum(r*np.vstack((xt, xt, xt)).T, axis=0)/mcvar = []for c in range(len(pi)): var.append(np.sum(np.dot(r[:,c]*(xt[i] - mu[c]).T, r[:,c]*(xt[i] - mu[c])))/mc[c])
將此代碼放在for循環(huán)中,并將其放在類(lèi)對(duì)象中。 class GMM1D: '''Apply GMM to 1D Data''' def __init__(self, X, max_iterations): '''Initialize data and max_iterations''' self.X = X self.max_iterations = max_iterations def run(self): '''Initialize parameters mu, var, pi''' self.pi = np.array([1/3, 1/3, 1/3]) self.mu = np.array([5,8,1]) self.var = np.array([5,3,1]) r = np.zeros((len(self.X), 3)) for itr in range(self.max_iterations): gauss1 = norm(loc=self.mu[0], scale=self.var[0]) gauss2 = norm(loc=self.mu[1], scale=self.var[1]) gauss3 = norm(loc=self.mu[2], scale=self.var[2]) # E-Step for c,g,p in zip(range(3), [gauss1, gauss2, gauss3], self.pi): r[:,c] = p*g.pdf(xt[:]) for i in range(len(r)): r[i,:] /= np.sum(r[i,:]) fig = plt.figure(figsize=(10,10)) ax0 = fig.add_subplot(111) for i in range(len(r)): ax0.scatter(xt[i],0,c=r[i,:],s=100) for g,c in zip([gauss1.pdf(np.linspace(-15,15)),gauss2.pdf(np.linspace(-15,15)),gauss3.pdf(np.linspace(-15,15))],['r','g','b']): ax0.plot(np.linspace(-15,15),g,c=c,zorder=0) plt.show() # M-Step mc = np.sum(r, axis=0) self.pi = mc/len(self.X) self.mu = np.sum(r*np.vstack((self.X, self.X, self.X)).T, axis=0)/mc self.var = [] for c in range(len(self.pi)): self.var.append(np.sum(np.dot(r[:,c]*(self.X[i] - self.mu[c]).T, r[:,c]*(self.X[i] - self.mu[c])))/mc[c])gmm = GMM1D(xt, 10)gmm.run()
 我們已經(jīng)建立并運(yùn)行了一個(gè)一維數(shù)據(jù)模型。同樣的原理也適用于更高維度(≥2D)。唯一的區(qū)別是我們將使用多元高斯分布。讓我們?yōu)?D模型編寫(xiě)Python代碼。 讓我們生成一些數(shù)據(jù)并編寫(xiě)我們的模型 from sklearn.datasets.samples_generator import make_blobsfrom scipy.stats import multivariate_normalX,Y = make_blobs(cluster_std=1.5,random_state=20,n_samples=500,centers=3)X = np.dot(X, np.random.RandomState(0).randn(2,2))plt.figure(figsize=(8,8))plt.scatter(X[:, 0], X[:, 1])plt.show()
 class GMM2D: '''Apply GMM to 2D data''' def __init__(self, num_clusters, max_iterations): '''Initialize num_clusters(K) and max_iterations for the model''' self.num_clusters = num_clusters self.max_iterations = max_iterations def run(self, X): '''Initialize parameters and run E and M step storing log-likelihood value after every iteration''' self.pi = np.ones(self.num_clusters)/self.num_clusters self.mu = np.random.randint(min(X[:, 0]), max(X[:, 0]), size=(self.num_clusters, len(X[0]))) self.cov = np.zeros((self.num_clusters, len(X[0]), len(X[0]))) for n in range(len(self.cov)): np.fill_diagonal(self.cov[n], 5) # reg_cov is used for numerical stability i.e. to check singularity issues in covariance matrix self.reg_cov = 1e-6*np.identity(len(X[0])) x,y = np.meshgrid(np.sort(X[:,0]), np.sort(X[:,1])) self.XY = np.array([x.flatten(), y.flatten()]).T # Plot the data and the initial model fig0 = plt.figure(figsize=(10,10)) ax0 = fig0.add_subplot(111) ax0.scatter(X[:, 0], X[:, 1]) ax0.set_title('Initial State') for m, c in zip(self.mu, self.cov): c += self.reg_cov multi_normal = multivariate_normal(mean=m, cov=c) ax0.contour(np.sort(X[:, 0]), np.sort(X[:, 1]), multi_normal.pdf(self.XY).reshape(len(X), len(X)), colors = 'black', alpha = 0.3) ax0.scatter(m[0], m[1], c='grey', zorder=10, s=100) fig0.savefig('GMM2D Initial State.png') plt.show() self.log_likelihoods = [] for iters in range(self.max_iterations): # E-Step self.ric = np.zeros((len(X), len(self.mu))) for pic, muc, covc, r in zip(self.pi, self.mu, self.cov, range(len(self.ric[0]))): covc += self.reg_cov mn = multivariate_normal(mean=muc, cov=covc) self.ric[:, r] = pic*mn.pdf(X) for r in range(len(self.ric)): self.ric[r, :] = self.ric[r, :] / np.sum(self.ric[r, :]) # M-step self.mc = np.sum(self.ric, axis=0) self.pi = self.mc/np.sum(self.mc) self.mu = np.dot(self.ric.T, X) / self.mc.reshape(self.num_clusters,1) self.cov = [] for r in range(len(self.pi)): covc = 1/self.mc[r] * (np.dot( (self.ric[:, r].reshape(len(X), 1)*(X-self.mu[r]) ).T, X - self.mu[r]) + self.reg_cov) self.cov.append(covc) self.cov = np.asarray(self.cov) self.log_likelihoods.append(np.log(np.sum([self.pi[r]*multivariate_normal(self.mu[r], self.cov[r] + self.reg_cov).pdf(X) for r in range(len(self.pi))]))) fig1 = plt.figure(figsize=(10,10)) ax1 = fig1.add_subplot(111) ax1.scatter(X[:, 0], X[:, 1]) ax1.set_title('Iteration ' + str(iters)) for m, c in zip(self.mu, self.cov): c += self.reg_cov multi_normal = multivariate_normal(mean=m, cov=c) ax1.contour(np.sort(X[:, 0]), np.sort(X[:, 1]), multi_normal.pdf(self.XY).reshape(len(X), len(X)), colors = 'black', alpha = 0.3) ax1.scatter(m[0], m[1], c='grey', zorder=10, s=100) fig1.savefig('GMM2D Iter ' + str(iters) + '.png') plt.show() fig2 = plt.figure(figsize=(10,10)) ax2 = fig2.add_subplot(111) ax2.plot(range(0, iters+1, 1), self.log_likelihoods) ax2.set_title('Log Likelihood Values') fig2.savefig('GMM2D Log Likelihood.png') plt.show() def predict(self, Y): '''Predicting cluster for new samples in array Y''' predictions = [] for pic, m, c in zip(self.pi, self.mu, self.cov): prob = pic*multivariate_normal(mean=m, cov=c).pdf(Y) predictions.append([prob]) predictions = np.asarray(predictions).reshape(len(Y), self.num_clusters) predictions = np.argmax(predictions, axis=1) fig2 = plt.figure(figsize=(10,10)) ax2 = fig2.add_subplot(111) ax2.scatter(X[:, 0], X[:, 1], c='c') ax2.scatter(Y[:, 0], Y[:, 1], marker='*', c='k', s=150, label = 'New Data') ax2.set_title('Predictions on New Data') colors = ['r', 'b', 'g'] for m, c, col, i in zip(self.mu, self.cov, colors, range(len(colors))): # c += reg_cov multi_normal = multivariate_normal(mean=m, cov=c) ax2.contour(np.sort(X[:, 0]), np.sort(X[:, 1]), multi_normal.pdf(self.XY).reshape(len(X), len(X)), colors = 'black', alpha = 0.3) ax2.scatter(m[0], m[1], marker='o', c=col, zorder=10, s=150, label = 'Centroid ' + str(i+1)) for i in range(len(Y)): ax2.scatter(Y[i, 0], Y[i, 1], marker='*', c=colors[predictions[i]], s=150) ax2.set_xlabel('X-axis') ax2.set_ylabel('Y-axis') ax2.legend() fig2.savefig('GMM2D Predictions.png') plt.show() return predictions
讓我們對(duì)此模型進(jìn)行一些預(yù)測(cè) y = np.random.randint(-10, 20, size=(12, 2))gmm2d = GMM2D(num_clusters=3, max_iterations=10)gmm2d.run(X)gmm2d.predict(y)
如果使用sklearn,可以在幾行代碼中完成相同的任務(wù)。 from sklearn.mixture import GaussianMixtureX,Y = make_blobs(cluster_std=1.5,random_state=20,n_samples=500,centers=3)X = np.dot(X, np.random.RandomState(0).randn(2,2))GMM = GaussianMixture(n_components=3)GMM.fit(X)Y = np.random.randint(-10, 20, size=(1, 2))print(GMM.means_, GMM.predict_proba(Y))'''Out: [[19.88168663 17.47097164] [-12.83538784 4.89646199] [11.09673732 18.67548935]] [[1.91500946e-17 9.30483496e-01 6.95165038e-02]]'''
GMM將樣本分類(lèi)為第二類(lèi)。 結(jié)論實(shí)現(xiàn)高斯混合模型并不難。一旦你清楚了數(shù)學(xué),它將為模型找到最大似然估計(jì)(無(wú)論是一維數(shù)據(jù)還是高維數(shù)據(jù))。該方法具有較強(qiáng)的魯棒性,在執(zhí)行聚類(lèi)任務(wù)時(shí)非常有用?,F(xiàn)在您已經(jīng)熟悉了GMM的python實(shí)現(xiàn),可以使用數(shù)據(jù)集執(zhí)行一些很酷的操作。
|