Wednesday 15 July 2015

python - random sampling with pandas dataframe -



python - random sampling with pandas dataframe -

i'm relatively new pandas (and python... , programming) , i'm trying montecarlo simulation, have not beingness able find solution takes reasonable amount of time

the info stored in info frame called "ytdsales" has sales per day, per product

date product_a product_b product_c product_d ... product_xx 01/01/2014 1000 300 70 34500 ... 780 02/01/2014 400 400 70 20 ... 10 03/01/2014 1110 400 1170 60 ... 50 04/01/2014 20 320 0 71300 ... 10 ... 15/10/2014 1000 300 70 34500 ... 5000

and want simulate different scenarios, using rest of year (from oct 15 year end) historical distribution each product had. illustration info presented fill rest of year sales between 20 , 1100.

what i've done following

# creates range of "future dates" last_historical = ytdsales.index.max() year_end = dt.datetime(2014,12,30) dateseoy = pd.date_range(start=last_historical,end=year_end).shift(1) # function obtains random sales number per product, between max , min f = lambda x:np.random.randint(x.min(),x.max()) # create "future" dates , fill output of f in dateseoy: ytdsales.loc[i]=ytdsales.apply(f)

the solution works, takes 3 seconds, lot if plan 1,000 iterations... there way not iterate?

thanks

use size alternative np.random.randint sample of needed size @ once. 1 approach consider briefly follows.

allocate space you'll need new array have index values dateseoy, columns original dataframe, , nan values. concatenate onto original data.

now know length of each random sample you'll need, utilize size keyword in numpy.random.randint sample @ once, per column, instead of looping.

overwrite info batch sampling.

here's like:

new_df = pandas.dataframe(index=dateseoy, columns=ytdsales.columns) num_to_sample = len(new_df) f = lambda x: np.random.randint(x[1].min(), x[1].max(), num_to_sample) output = pandas.concat([ytdsales, new_df], axis=0) output[len(ytdsales):] = np.asarray(map(f, ytdsales.iteritems())).t

along way, take create totally new dataframe, concatenating old 1 new "placeholder" one. inefficient big data.

another way approach setting enlargement you've done in for-loop solution.

i did not play around approach long plenty figure out how "enlarge" batches of indexes @ once. but, if figure out, can "enlarge" original info frame nan values (at index values dateseoy), , apply function ytdsales instead of bringing output @ all.

python performance pandas montecarlo

No comments:

Post a Comment