Sunday 15 September 2013

python - sklearn decomposition top terms -



python - sklearn decomposition top terms -

is there way can determine top features/terms each cluster in while info decomposed?

in th illustration sklearn documentation, top terms extracted sorting features , comparing vectorizer feature_names, both same number of features.

http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html

i know how implement get_top_terms_per_cluster():

x = vectorizer.fit_transform(dataset) # m features x = lsa.fit_transform(x) # cut down number of features m' k_means.fit(x) get_top_terms_per_cluster() # out of m features

assuming lsa = truncatedsvd(n_components=k) k, obvious way term weights makes utilize of fact lsa/svd linear transformation, i.e., each row of lsa.components_ weighted sum of input terms, , can multiply cluster centroids k-means.

let's set things , train models:

>>> sklearn.datasets import fetch_20newsgroups >>> sklearn.feature_extraction.text import tfidfvectorizer >>> sklearn.cluster import kmeans >>> sklearn.decomposition import truncatedsvd >>> info = fetch_20newsgroups() >>> vectorizer = tfidfvectorizer(min_df=3, max_df=.95, stop_words='english') >>> lsa = truncatedsvd(n_components=10) >>> km = kmeans(n_clusters=3) >>> x = vectorizer.fit_transform(data.data) >>> x_lsa = lsa.fit_transform(x) >>> km.fit(x_lsa)

now multiply lsa components , k-means centroids:

>>> x.shape (11314, 38865) >>> lsa.components_.shape (10, 38865) >>> km.cluster_centers_.shape (3, 10) >>> weights = np.dot(km.cluster_centers_, lsa.components_) >>> weights.shape (3, 38865)

then print; need absolute values weights because of sign indeterminacy in lsa:

>>> features = vectorizer.get_feature_names() >>> weights = np.abs(weights) >>> in range(km.n_clusters): ... top5 = np.argsort(weights[i])[-5:] ... print(zip([features[j] j in top5], weights[i, top5])) ... [(u'escrow', 0.042965734662740895), (u'chip', 0.07227072329320372), (u'encryption', 0.074855609122467345), (u'clipper', 0.075661844826553887), (u'key', 0.095064798549230306)] [(u'posting', 0.012893125486957332), (u'article', 0.013105911161236845), (u'university', 0.0131617377000081), (u'com', 0.023016036009601809), (u'edu', 0.034532489348082958)] [(u'don', 0.02087448155525683), (u'com', 0.024327099321009758), (u'people', 0.033365757270264217), (u'edu', 0.036318114826463417), (u'god', 0.042203130080860719)]

mind you, need stop word filter work. stop words tend end in every single component, , high weight in every cluster centroid.

python scikit-learn

No comments:

Post a Comment