Breedlove: python - How to load previously saved model and expand the model with new training data using scikit-learn -

Tuesday, 15 June 2010

python - How to load previously saved model and expand the model with new training data using scikit-learn -

i'm using scikit-learn i've saved logistic regression model unigrams features training set 1. possible load model , expand new info instances sec training set (training set 2)? if yes, how can done? reason doing because i'm using 2 different approaches each of training sets (the first approach involves feature corruption/regularization, , sec approach involves self-training).

i've added simple illustration code clarity:

from sklearn.linear_model import logisticregression log sklearn.feature_extraction.text import countvectorizer cv import pickle  traintext1 # training set 1 text instances     trainlabel1 # training set 1 labels  traintext2 # training set 2 text instances     trainlabel2 # training set 2 labels   clf = log() # count vectorizer used logistic regression classifier  vec = cv()   # fit count vectorizer training text   info training set 1 vec.fit(traintext1)   # transforms text vectors training set1 train1text1 = vec.transform(traintext1)   # fitting training set1 linear logistic regression classifier  clf.fit(traintext1,trainlabel1)  # saving logistic regression model training set 1 modelfilesave = open('modelfromtrainingset1', 'wb') pickle.dump(clf, modelfilesave) modelfilesave.close()    # loading logistic regression model training set 1     modelfileload = open('modelfromtrainingset1', 'rb') clf = pickle.load(modelfileload)  # i'm unsure how  go on here....

logisticregression uses internally liblinear solver not back upwards incremental fitting. instead utilize sgdclassifier(loss='log') partial_fit method used although in practice. other hyperparameters different. careful grid search optimal value carefully. read sgdclassifier documentation meaning of hyperparameters.

countvectorizer not back upwards incremental fitting. have reuse vectorizer fitted on train set #1 transform #2. means token set #2 not seen in #1 ignored though. might not expect.

to mitigate can utilize hashingvectorizer stateless @ cost of not knowing features mean. read the documentation more details.

python machine-learning scikit-learn

Breedlove

Tuesday, 15 June 2010

python - How to load previously saved model and expand the model with new training data using scikit-learn -

No comments:

Post a Comment