Wednesday 15 January 2014

scikit learn - How to use multiple input features with associated extractors in a pipeline? -



scikit learn - How to use multiple input features with associated extractors in a pipeline? -

i working on classification task scikit-learn. have info set in each observation comprises 2 separate text fields. want set pipeline in each text field passed in parallel through own tfidfvectorizer , outputs of tfidfvectorizer objects passed classifier. aim able optimize parameters of 2 tfidfvectorizer objects along of classifier, using gridsearchcv.

the pipeline might depicted follows:

text 1 -> tfidfvectorizer 1 --------| +---> classifier text 2 -> tfidfvectorizer 2 --------|

i understand how without using pipeline (by creating tfidfvectorizer objects , working there), how set within pipeline?

thanks help,

rob.

use pipeline , featureunion classes. code case like:

pipeline = pipeline([ ('features', featureunion([ ('c1', pipeline([ ('text1', extracttext1()), ('tf_idf1', tfidfvectorizer()) ])), ('c2', pipeline([ ('text2', extracttext2()), ('tf_idf2', tfidfvectorizer()) ])) ])), ('classifier', multinomialnb()) ])

you can grid search on entire construction referring parameters using <estimator1>__<estimator2>__<parameter> syntax. illustration features__c1__tf_idf1__min_df refers min_df parameter of tfidfvectorizer 1 diagram.

scikit-learn

No comments:

Post a Comment