Breedlove: algorithm - Matching an element from a set of abstracts to an element in set of titles -

Monday, 15 April 2013

algorithm - Matching an element from a set of abstracts to an element in set of titles -

suppose have 2 sets,

a = {"this title", ...} b = {"this short description of title a", ...}

what best way find best match in set b element in set a, or vice versa. approach tried create tf-idf handbag of words vector space using tokens of b, , finding cosine similarity. given a, pair (a,b) selected if cosine similarity higher other element b. not accurate.

are there improve methods this? how can improve accuracy?

from sklearn.feature_extraction.text import tfidfvectorizer sklearn.metrics.pairwise import cosine_similarity # titles , abstracts arrays of strings tfidf = tfidfvectorizer(stop_words='english', analyzer='word')  vec = tfidf.fit_transform(abstracts)  def predict(title):     titlevec = tfidf.transform([title])     sim = cosine_similarity(titlevec,vec)      homecoming np.argmax(sim)  i, title in titles:     index = predict(title)     print "title: {0}\nabstracts:{1}".format(title,abstracts[index])

algorithm machine-learning nlp information-retrieval tf-idf

Breedlove

Monday, 15 April 2013

algorithm - Matching an element from a set of abstracts to an element in set of titles -

No comments:

Post a Comment