Breedlove: python - Use sklearn to find string similarity between two texts with large group of documents -

Thursday, 15 August 2013

python - Use sklearn to find string similarity between two texts with large group of documents -

given big set of documents (book titles, example), how compare 2 book titles not in original set of documents, or without recomputing entire tf-idf matrix?

for example,

from sklearn.feature_extraction.text import tfidfvectorizer sklearn.metrics.pairwise import cosine_similarity  book_titles = ["the  bluish eagle has landed",          "i fly eagle moon",          "this not how should fly",          "fly me  moon ,  allow me sing among stars",          "how can fly eagle",          "fixing cars , repairing stuff",          "and bottle of rum"]  vectorizer = tfidfvectorizer(stop_words='english', norm='l2', sublinear_tf=true) tfidf_matrix = vectorizer.fit_transform(book_titles)

to check similarity between first , sec book titles, 1 do

cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

and on. considers tf-idf calculated respect all entries in matrix, weights proportional number of times token appears in corpus.

let's 2 titles should compared, title1 , title2, not in original set of book titles. 2 titles can added book_titles collection , compared afterwards, word "rum", example, counted including 1 in previous corpus:

title1="the book of rum" title2="fly safely bottle of rum" book_titles.append(title1, title2) tfidf_matrix = vectorizer.fit_transform(book_titles) index = tfidf_matrix.shape()[0] cosine_similarity(tfidf_matrix[index-3:index-2], tfidf_matrix[index-2:index-1])

what impratical , slow if documents grow big or need stored out of memory. can done in case? if compare between title1 , title2, previous corpus not used.

why append them list , recompute everything? do

new_vectors = vectorizer.transform([title1, title2])

python scikit-learn

Breedlove

Thursday, 15 August 2013

python - Use sklearn to find string similarity between two texts with large group of documents -

No comments:

Post a Comment