機(jī)器學(xué)習(xí)

imelee 2018-03-01

展開全文

1. SVD

1.1 分解

如下圖，一個(gè)矩陣可以分解為兩個(gè)方陣和一個(gè)對(duì)角矩陣的乘積：

C = m * n；u = m * m；sigma = m * n；v' = n * n

1.2 奇異值

sigma是一個(gè)對(duì)角矩陣，但通常不是方陣。sigma的對(duì)角元素被稱為奇異值，與特征值類似。因此與PCA類似，我們可以取sigma中最大的k個(gè)，來簡(jiǎn)化數(shù)據(jù)：

u' = m * k；sigma' = k * k；v'' = k * v

1.3 重構(gòu)C矩陣

利用新的三個(gè)矩陣u'，sigma'，v''相乘仍然得到一個(gè)m * n的矩陣。如果你選擇的k個(gè)奇異值所占的所有奇異值比例足夠大，那么新得到的m * n的矩陣將與C非常接近。

2. SVD實(shí)踐 - 矩陣壓縮

[python] view plain copy

print?

# -*- coding: utf-8 -*-
"""
arr =
[[0, 0, 0, 2, 2],
[0, 0, 0, 3, 3],
[0, 0, 0, 1, 1],
[1, 1, 1, 0, 0],
[2, 2, 2, 0, 0],
[5, 5, 5, 0, 0],
[1, 1, 1, 0, 0]]
u = 7 * 7
sigma = 7 * 5, 只返回了對(duì)角元素, 其余0元素被省略
V = 5 * 5
"""
import numpy as np
arr = np.array([[0, 0, 0, 2, 2], [0, 0, 0, 3, 3], [0, 0, 0, 1, 1],
[1, 1, 1, 0, 0], [2, 2, 2, 0, 0], [5, 5, 5, 0, 0], [1, 1, 1, 0, 0]])
# 1. 分解
u, sigma, v = np.linalg.svd(arr)
# 2. 重構(gòu)
new_arr = np.mat(u[:, 0:2]) * np.mat(np.diag(sigma[0:2])) * np.mat(v[0:2, :])

new_arr與arr非常接近，幾乎相等。這其實(shí)是類似于圖像壓縮，只保留圖像分解后的兩個(gè)方陣和一個(gè)對(duì)角陣的對(duì)角元素，就可以恢復(fù)原圖像。

3. SVD實(shí)踐 - 數(shù)據(jù)降維

之所以能進(jìn)行數(shù)據(jù)降維，原理與PCA一樣，SVD計(jì)算出的三個(gè)矩陣對(duì)應(yīng)的是：

u：CC'的特征向量矩陣； sigma：奇異值矩陣，其中每個(gè)元素為特征值開方； v'：C'C的特征向量矩陣。

如這篇文章所述主成分分析，你會(huì)發(fā)現(xiàn)sigma與C'C恰好是主成分分析所需要的兩個(gè)量。因此SVD降維與PCA是一致的，尤其是事先對(duì)數(shù)據(jù)進(jìn)行了中心化，再奇異值分解，則PCA降維和SVD降維完全一樣。

利用SVD來實(shí)現(xiàn)PCA的代碼：

[python] view plain copy

print?

# -*- coding: utf-8 -*-
""" svd應(yīng)用2 - 降維 """
import numpy as np
import matplotlib.pyplot as plt
class PCA:
""" 通過SVD分解來實(shí)現(xiàn)PCA
1. 訓(xùn)練數(shù)據(jù)train_x必須一行代表一個(gè)樣本, 一列代表一個(gè)特征
2. 能夠同時(shí)壓縮train_x的行和列
3. 可以選擇在壓縮前, 是否對(duì)數(shù)據(jù)進(jìn)行中心化
"""
def __init__(self, dimension, centered=True, compression="cols"):
"""
dimension: 降維后的維度
centered: 是否事先對(duì)數(shù)據(jù)進(jìn)行中心化
compression: 壓縮行, 還是壓縮列
"""
self.dimension = dimension
self.centered = centered
self.compression = compression
def _centered(self, train_x):
""" 數(shù)據(jù)中心化 """
return train_x - np.mean(train_x, axis=0)
def _svd(self, train_x):
""" 奇異值分解 """
return np.linalg.svd(train_x)
def transform(self, train_x):
""" 數(shù)據(jù)轉(zhuǎn)化(降維)
train_x: 訓(xùn)練數(shù)據(jù), 一行代表一個(gè)樣本
u, sigma, v: 奇異值分解結(jié)果
result: 降維后的數(shù)據(jù)
"""
# 1. 數(shù)據(jù)中心化
if self.centered == True:
train_x = self._centered(train_x)
# 2. 奇異值分解
u, sigma, v = self._svd(train_x)
v = v.T
# 3. 降維
if self.compression == "cols":
result = np.dot(train_x, v[:, 0:self.dimension])
elif self.compression == "rows":
result = np.dot(u[:, 0:self.dimension], train_x[0:self.dimension, :])
else:
raise(Exception("parameter error."))
return result

3.1 壓縮行 - 壓縮記錄

SVD分解得到的三個(gè)矩陣分別稱為：左奇異向量，奇異值矩陣，右奇異向量。左奇異向量用于壓縮行，右奇異向量壓縮列。壓縮方法均是取奇異值較大的左奇異向量或者右奇異向量與原數(shù)據(jù)C相乘。

3.2 壓縮列 - 壓縮特征

[python] view plain copy

print?

def load_data():
with open("../SVD/data/Iris.txt", "r") as f:
iris = []
for line in f.readlines():
temp = line.strip().split(",")
if temp[4] == "Iris-setosa":
temp[4] = 0
elif temp[4] == "Iris-versicolor":
temp[4] = 1
elif temp[4] == "Iris-virginica":
temp[4] = 2
else:
raise(Exception("data error."))
iris.append(temp)
iris = np.array(iris, np.float)
return iris
def draw_result(new_trainX, iris):
"""
new_trainX: 降維后的數(shù)據(jù)
iris: 原數(shù)據(jù)
"""
plt.figure()
# Iris-setosa
setosa = new_trainX[iris[:, 4] == 0]
plt.scatter(setosa[:, 0], setosa[:, 1], color="red", label="Iris-setosa")
# Iris-versicolor
versicolor = new_trainX[iris[:, 4] == 1]
plt.scatter(versicolor[:, 0], versicolor[:, 1], color="orange", label="Iris-versicolor")
# Iris-virginica
virginica = new_trainX[iris[:, 4] == 2]
plt.scatter(virginica[:, 0], virginica[:, 1], color="blue", label="Iris-virginica")
plt.legend()
plt.show()
def main(dimension, centered, compression):
# 導(dǎo)入數(shù)據(jù)
iris = load_data()
# 降維
clf = PCA(2, centered, compression)
new_iris = clf.transform(iris[:, 0:4])
# 降維結(jié)果可視化
draw_result(new_iris, iris)

數(shù)據(jù)進(jìn)行中心化后降維的結(jié)果，與PCA一文結(jié)果一致：

數(shù)據(jù)不進(jìn)行中心化的結(jié)果為：

4. SVD實(shí)踐 - 協(xié)同過濾

協(xié)同過濾包含基于用戶的協(xié)同過濾，基于物品的協(xié)同過濾。這兩種方式本身是不需要SVD就可以進(jìn)行的；但是加入了SVD之后，可以減少計(jì)算量同時(shí)還能提高推薦效果。

4.1 基于用戶的協(xié)同過濾

比如補(bǔ)充下表當(dāng)中Jim對(duì)日式雞排，壽司的打分：

	鰻魚飯	日式炸雞排	壽司	烤牛肉	手撕豬肉
Jim	2	0	0	4	4
John	5	5	5	3	3
sally	2	4	2	1	2
Tom	1	1	1	5	5

可以直接計(jì)算Jim與其余三個(gè)用戶的相似度，然后選最相似的樣本來為Jim的兩個(gè)空位打分。但是這樣，如果一旦樣本、特征過多，計(jì)算量就猛增。而事實(shí)上，我們不一定需要那么多特征，因此可以使用SVD分解把樣本映射到低維空間。（事實(shí)上，容易能從數(shù)據(jù)中看出來映射2維空間，左邊三個(gè)和右邊兩個(gè)明顯不一樣）

[python] view plain copy

print?

food = np.mat([[2, 0, 0, 4, 4], [5, 5, 5, 3, 3], [2, 4, 2, 1, 2], [1, 1, 1, 5, 4]])
u, sigma, v = np.linalg.svd(food)
simple_food = np.mat(u[:, 0:2]) * np.mat(np.diag(sigma[0:2])) * np.mat(v[0:2, :])

5. SVD計(jì)算過程

假設(shè)原數(shù)據(jù)為X，一行代表一個(gè)樣本，列代表特征。

1）計(jì)算X'X，XX'；

2）對(duì)XX'進(jìn)行特征值分解，得到的特征向量組成u，lambda_u;

3）對(duì)X'X進(jìn)行特征值分解，得到的特征向量組成v，lambda_v;

4）lambda_u，lambda_v的重復(fù)元素開方組成對(duì)角矩陣sigma主對(duì)角線上的元素；

一個(gè)詳細(xì)的例子在這里：http://download.csdn.net/download/zk_j1994/9927957

參考文獻(xiàn)

http://www.cnblogs.com/pinard/p/6251584.html

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： imelee > 《深度學(xué)習(xí)》

舉報(bào)/認(rèn)領(lǐng)