Tuesday 15 June 2010

python - How to load previously saved model and expand the model with new training data using scikit-learn -



python - How to load previously saved model and expand the model with new training data using scikit-learn -

i'm using scikit-learn i've saved logistic regression model unigrams features training set 1. possible load model , expand new info instances sec training set (training set 2)? if yes, how can done? reason doing because i'm using 2 different approaches each of training sets (the first approach involves feature corruption/regularization, , sec approach involves self-training).

i've added simple illustration code clarity:

from sklearn.linear_model import logisticregression log sklearn.feature_extraction.text import countvectorizer cv import pickle traintext1 # training set 1 text instances trainlabel1 # training set 1 labels traintext2 # training set 2 text instances trainlabel2 # training set 2 labels clf = log() # count vectorizer used logistic regression classifier vec = cv() # fit count vectorizer training text info training set 1 vec.fit(traintext1) # transforms text vectors training set1 train1text1 = vec.transform(traintext1) # fitting training set1 linear logistic regression classifier clf.fit(traintext1,trainlabel1) # saving logistic regression model training set 1 modelfilesave = open('modelfromtrainingset1', 'wb') pickle.dump(clf, modelfilesave) modelfilesave.close() # loading logistic regression model training set 1 modelfileload = open('modelfromtrainingset1', 'rb') clf = pickle.load(modelfileload) # i'm unsure how go on here....

logisticregression uses internally liblinear solver not back upwards incremental fitting. instead utilize sgdclassifier(loss='log') partial_fit method used although in practice. other hyperparameters different. careful grid search optimal value carefully. read sgdclassifier documentation meaning of hyperparameters.

countvectorizer not back upwards incremental fitting. have reuse vectorizer fitted on train set #1 transform #2. means token set #2 not seen in #1 ignored though. might not expect.

to mitigate can utilize hashingvectorizer stateless @ cost of not knowing features mean. read the documentation more details.

python machine-learning scikit-learn

No comments:

Post a Comment