使用python進(jìn)行新聞文檔聚類（潛在語義分析）

東西二王 2019-05-04

展開全文

在本文中，我將解釋如何使用潛在語義分析（LSA）從一組新聞文章中聚類和查找類似的新聞文檔。

LSA是一種NLP技術(shù)，用于找出一組文檔中隱藏的概念或主題。

數(shù)據(jù)讀取

首先導(dǎo)入一些必要的Python庫：

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import sys import nltk # nltk.download('stopwords') from nltk.corpus import stopwords # from bs4 import BeautifulSoup as Soup import json

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

我的機(jī)器學(xué)習(xí)樣本數(shù)據(jù)：

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

以下Python代碼用于在字符串列表中加載和存儲(chǔ)數(shù)據(jù)，這部分完全取決于數(shù)據(jù)類型：

def parseLog(file):
 file = sys.argv[1]
 content = []
 with open(file) as f:
 content = f.readlines()
 content = [json.loads(x.strip()) for x in content]
 # print(content)
 
 data = json.loads(json.dumps(content))
 k=0
# preprocessing ////////////////////////////////
 content_list = []
 for i in data:
 string_content = ''
 if 'contents' in i:
	 for all in i['contents']:
	 if 'content' in all:
	 # print(str(all['content']))
	 string_content = string_content   str(all['content'])
	 content_list.append(string_content)

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

content_list包含字符串列表中的完整數(shù)據(jù)。因此，如果有45000篇文章，content_list有45000個(gè)字符串。

數(shù)據(jù)預(yù)處理

現(xiàn)在我們將使用pandas來應(yīng)用一些機(jī)器學(xué)習(xí)中的預(yù)處理技術(shù)。首先，我們將嘗試盡可能地清理文本數(shù)據(jù)。想法是使用regex replace(' [^a-zA-Z#] '， ' ')一次性刪除標(biāo)點(diǎn)、數(shù)字和特殊字符，它將替換除空格以外的所有字符。然后我們將刪除較短的單詞，因?yàn)樗鼈兺ǔ２话杏玫男畔?。最后，我們將所有文本都小寫?/p>

news_df = pd.DataFrame({'document':content_list}) # removing everything except alphabets` news_df['clean_doc'] = news_df['document'].str.replace('[^a-zA-Z#]', ' ') # removing null fields news_df = news_df[news_df['clean_doc'].notnull()] # removing short words news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3])) # make all text lowercase news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

現(xiàn)在我們將從數(shù)據(jù)中刪除stopwords。首先，我加載NLTK的英文停用詞列表。stopwords是“a”，“the”或“in”之類的詞，它們沒有表達(dá)重要意義。

 stop_words = stopwords.words('english')
 stop_words.extend(['span','class','spacing','href','html','http','title', 'stats', 'washingtonpost'])
 # tokenization
 tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())
 # remove stop-words
 tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
 # print(tokenized_doc)
 # de-tokenization
 detokenized_doc = []
 for i in range(len(tokenized_doc)):
 if i in tokenized_doc:
 t = ' '.join(tokenized_doc[i])
 detokenized_doc.append(t)
 # print(detokenized_doc)

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

應(yīng)用Tf-idf創(chuàng)建文檔術(shù)語矩陣

現(xiàn)在，我們準(zhǔn)備好了機(jī)器學(xué)習(xí)數(shù)據(jù)。我們將使用tfidf vectoriser創(chuàng)建一個(gè)文檔項(xiàng)矩陣。我們將使用sklearn的TfidfVectorizer創(chuàng)建一個(gè)包含10,000項(xiàng)的矩陣。

from sklearn.feature_extraction.text import TfidfVectorizer # tfidf vectorizer of scikit learn vectorizer = TfidfVectorizer(stop_words=stop_words,max_features=10000, max_df = 0.5, use_idf = True, ngram_range=(1,3)) X = vectorizer.fit_transform(detokenized_doc) print(X.shape) # check shape of the document-term matrix terms = vectorizer.get_feature_names() # print(terms)

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

ngram_range：unigrams，bigrams和trigrams。

這個(gè)document-term矩陣將在LSA中使用，并應(yīng)用k-means對(duì)文檔進(jìn)行聚類。

使用k-means對(duì)文本文檔進(jìn)行聚類

在這一步中，我們將使用k-means算法對(duì)文本文檔進(jìn)行聚類。

 from sklearn.cluster import KMeans
 num_clusters = 10
 km = KMeans(n_clusters=num_clusters)
 km.fit(X)
 clusters = km.labels_.tolist()

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

clusters將用于繪圖。clusters是一個(gè)包含數(shù)字1到10的列表，將每個(gè)文檔分為10個(gè)聚類。

主題建模

下一步是將每個(gè)項(xiàng)和文檔表示為向量。我們將使用文檔項(xiàng)矩陣并將其分解為多個(gè)矩陣。

我們將使用sklearn的randomized_svd執(zhí)行矩陣分解任務(wù)。您需要一些LSA和奇異值分解(SVD)的知識(shí)來理解下面的部分。

在SVD的定義中，原始矩陣A ≈ UΣV*,其中U和V具有正交列，并且Σ是非負(fù)對(duì)角線。

from sklearn.decomposition import TruncatedSVD from sklearn.utils.extmath import randomized_svd U, Sigma, VT = randomized_svd(X, n_components=10, n_iter=100, random_state=122) # SVD represent documents and terms in vectors # svd_model = TruncatedSVD(n_components=2, algorithm='randomized', n_iter=100, random_state=122) # svd_model.fit(X) # print(U.shape) for i, comp in enumerate(VT): terms_comp = zip(terms, comp) sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7] print('Concept ' str(i) ': ') for t in sorted_terms: print(t[0]) print(' ')

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

這里，U，sigma和VT是在分解矩陣之后獲得的3個(gè)矩陣X 。VT是一個(gè)term-concept矩陣，U是document-concept矩陣，Sigma是concept-concept矩陣。

在上面的代碼中，采取了10個(gè)concepts/topics （n_components=10）。然后我打印了那些concepts。示例concepts如下：

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

主題可視化

為了找出我們的主題有多么不同，我們應(yīng)該想象它們。當(dāng)然，我們無法想象超過3個(gè)維度，但有一些技術(shù)，如PCA和t-SNE，可以幫助我們將高維數(shù)據(jù)可視化為較低維度。這里我們將使用一種稱為UMAP（Uniform Manifold Approximation and Projection）的相對(duì)較新的技術(shù)。

為了發(fā)現(xiàn)我們的主題有多么不同，我們應(yīng)該把它們圖形化。當(dāng)然，我們可視化時(shí)不能超過3個(gè)維度，但是有一些技術(shù)，比如PCA和t-SNE，可以幫助我們將高維數(shù)據(jù)可視化到更低的維度。在這里，我們將使用一個(gè)相對(duì)較新的技術(shù)：UMAP。

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

 import umap
 X_topics=U*Sigma
 embedding = umap.UMAP(n_neighbors=100, min_dist=0.5, random_state=12).fit_transform(X_topics)
 plt.figure(figsize=(7,5))
 plt.scatter(embedding[:, 0], embedding[:, 1], 
 c = clusters,
 s = 10, # size
 edgecolor='none'
 )
 plt.show()
if __name__ == '__main__':
 parseLog(sys.argv[1])

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

在這里，我使用了c = clusters，這將有助于在文檔中顯示不同的顏色。

在這里，我展示了2500篇新聞文章的輸出：

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

10,000條新聞文章：

使用python進(jìn)行新聞文檔聚類（潛在語義分析）

這里，每個(gè)點(diǎn)代表一個(gè)文檔，顏色代表使用k-means找到的不同的聚類。我們的LSA模型似乎做得很好。您可以隨意修改UMAP的參數(shù)，以查看圖形如何更改其形狀。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：東西二王 > 《編程開發(fā)》

舉報(bào)/認(rèn)領(lǐng)