Thursday, 15 August 2013

python - Use sklearn to find string similarity between two texts with large group of documents -



python - Use sklearn to find string similarity between two texts with large group of documents -

given big set of documents (book titles, example), how compare 2 book titles not in original set of documents, or without recomputing entire tf-idf matrix?

for example,

from sklearn.feature_extraction.text import tfidfvectorizer sklearn.metrics.pairwise import cosine_similarity book_titles = ["the bluish eagle has landed", "i fly eagle moon", "this not how should fly", "fly me moon , allow me sing among stars", "how can fly eagle", "fixing cars , repairing stuff", "and bottle of rum"] vectorizer = tfidfvectorizer(stop_words='english', norm='l2', sublinear_tf=true) tfidf_matrix = vectorizer.fit_transform(book_titles)

to check similarity between first , sec book titles, 1 do

cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

and on. considers tf-idf calculated respect all entries in matrix, weights proportional number of times token appears in corpus.

let's 2 titles should compared, title1 , title2, not in original set of book titles. 2 titles can added book_titles collection , compared afterwards, word "rum", example, counted including 1 in previous corpus:

title1="the book of rum" title2="fly safely bottle of rum" book_titles.append(title1, title2) tfidf_matrix = vectorizer.fit_transform(book_titles) index = tfidf_matrix.shape()[0] cosine_similarity(tfidf_matrix[index-3:index-2], tfidf_matrix[index-2:index-1])

what impratical , slow if documents grow big or need stored out of memory. can done in case? if compare between title1 , title2, previous corpus not used.

why append them list , recompute everything? do

new_vectors = vectorizer.transform([title1, title2])

python scikit-learn

No comments:

Post a Comment