algorithm - Matching an element from a set of abstracts to an element in set of titles -
suppose have 2 sets,
a = {"this title", ...} b = {"this short description of title a", ...}
what best way find best match in set b element in set a, or vice versa. approach tried create tf-idf handbag of words vector space using tokens of b, , finding cosine similarity. given a, pair (a,b) selected if cosine similarity higher other element b. not accurate.
are there improve methods this? how can improve accuracy?
from sklearn.feature_extraction.text import tfidfvectorizer sklearn.metrics.pairwise import cosine_similarity # titles , abstracts arrays of strings tfidf = tfidfvectorizer(stop_words='english', analyzer='word') vec = tfidf.fit_transform(abstracts) def predict(title): titlevec = tfidf.transform([title]) sim = cosine_similarity(titlevec,vec) homecoming np.argmax(sim) i, title in titles: index = predict(title) print "title: {0}\nabstracts:{1}".format(title,abstracts[index])
algorithm machine-learning nlp information-retrieval tf-idf
No comments:
Post a Comment