Wednesday, 15 May 2013

What's a good database for full text search on a large number of relatively small text documents? (C# backend) -



What's a good database for full text search on a large number of relatively small text documents? (C# backend) -

i designing scheme aims ingest big numbers of documents. want back upwards total text search on document contents, other metadata (keyword/sentiment analysis). how keyword/sentiment analysis done beyond scope of question. worth considering sort of metadata needs live along side search-able documents.

the main assumptions are:

by big mean few 100,000 goal of reaching millions the documents 0-15kb. these documents text (utf-8) desire able full-text-search document contents hosted on single machine, no cloud/distributed services new documents inserted continuously (roughly 1-2 per second) ad hoc text searches more complicated query utilize cases be: show me documents 'widgets' positive daterange

c# language of selection fetching documents, processing, storing , retrieving db. having c# bindings big plus. or @ to the lowest degree easy way bridge gap.

naive approach

a naive approach utilize mysql along apache's lucene. having document contents stored files references them in db, or having document contents text field in databse.

then utilize 1 of c# wrappers lucene lucene.net

my concern/question approach whether or not size of info , want much mysql. know silly premature optimization, , oftentimes people think need 'big data' solution when turns out regular sql database fine. other main concern approach 'clunky' , cumbersome develop compared potential alternatives.

alternatives

from doing research, 1 alternative looks promising using couchdb lucene. have come across 2 libraries solve this:

couchdb-lucene divan what i'm looking for:

i haven't done whole lot size of data. wonder:

does amount of info , utilize case merit non-relational database? should documents live in database, or files references in database? is there database/full-text-search technology particularly suited scenario haven't considered?

i suggest ravendb. uses lucene , 100% .net. has text analyzers doing total text indexing , fuzzy searches.

c# database-design full-text-search sentiment-analysis keyword-search

No comments:

Post a Comment