Monday 15 April 2013

algorithm - Matching an element from a set of abstracts to an element in set of titles -



algorithm - Matching an element from a set of abstracts to an element in set of titles -

suppose have 2 sets,

a = {"this title", ...} b = {"this short description of title a", ...}

what best way find best match in set b element in set a, or vice versa. approach tried create tf-idf handbag of words vector space using tokens of b, , finding cosine similarity. given a, pair (a,b) selected if cosine similarity higher other element b. not accurate.

are there improve methods this? how can improve accuracy?

from sklearn.feature_extraction.text import tfidfvectorizer sklearn.metrics.pairwise import cosine_similarity # titles , abstracts arrays of strings tfidf = tfidfvectorizer(stop_words='english', analyzer='word') vec = tfidf.fit_transform(abstracts) def predict(title): titlevec = tfidf.transform([title]) sim = cosine_similarity(titlevec,vec) homecoming np.argmax(sim) i, title in titles: index = predict(title) print "title: {0}\nabstracts:{1}".format(title,abstracts[index])

algorithm machine-learning nlp information-retrieval tf-idf

No comments:

Post a Comment