python - Use sklearn to find string similarity between two texts with large group of documents -
given big set of documents (book titles, example), how compare 2 book titles not in original set of documents, or without recomputing entire tf-idf matrix?
for example,
from sklearn.feature_extraction.text import tfidfvectorizer sklearn.metrics.pairwise import cosine_similarity book_titles = ["the bluish eagle has landed", "i fly eagle moon", "this not how should fly", "fly me moon , allow me sing among stars", "how can fly eagle", "fixing cars , repairing stuff", "and bottle of rum"] vectorizer = tfidfvectorizer(stop_words='english', norm='l2', sublinear_tf=true) tfidf_matrix = vectorizer.fit_transform(book_titles)
to check similarity between first , sec book titles, 1 do
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
and on. considers tf-idf calculated respect all entries in matrix, weights proportional number of times token appears in corpus.
let's 2 titles should compared, title1 , title2, not in original set of book titles. 2 titles can added book_titles collection , compared afterwards, word "rum", example, counted including 1 in previous corpus:
title1="the book of rum" title2="fly safely bottle of rum" book_titles.append(title1, title2) tfidf_matrix = vectorizer.fit_transform(book_titles) index = tfidf_matrix.shape()[0] cosine_similarity(tfidf_matrix[index-3:index-2], tfidf_matrix[index-2:index-1])
what impratical , slow if documents grow big or need stored out of memory. can done in case? if compare between title1 , title2, previous corpus not used.
why append them list , recompute everything? do
new_vectors = vectorizer.transform([title1, title2])
python scikit-learn
No comments:
Post a Comment