Breedlove: python - Multiprocessing on pandas dataframe. Confusing behavior based on input size -

Tuesday, 15 March 2011

python - Multiprocessing on pandas dataframe. Confusing behavior based on input size -

i trying implement df.apply function parallelized across chunks of dataframe. wrote next test code see how much gain (versus info copying etc):

from multiprocessing import pool functools import partial import pandas pd import numpy np import time  def df_apply(df, f):      homecoming df.apply(f, axis=1)  def apply_in_parallel(df, f, n=5):     pool = pool(n)     df_chunks = np.array_split(df, n)     apply_f = partial(df_apply, f=f)     result_list = pool.map(apply_f, df_chunks)      homecoming pd.concat(result_list, axis=0)  def f(x):    homecoming x+1  if __name__ == '__main__':   n = 10^8   df = pd.dataframe({"a": np.zeros(n), "b": np.zeros(n)})    print "parallel"   t0 = time.time()   r = apply_in_parallel(df, f, n=5)   print time.time() - t0    print "single"   t0 = time.time()   r = df.apply(f, axis=1)   print time.time() - t0

weird behavior: n=10^7 works n=10^8 gives me error

traceback (most recent  phone call last):   file "parallel_apply.py", line 27, in <module>     r = apply_in_parallel(df, f, n=5)   file "parallel_apply.py", line 14, in apply_in_parallel     result_list = pool.map(apply_f, df_chunks)   file "/usr/lib64/python2.7/multiprocessing/pool.py", line 227, in map      homecoming self.map_async(func, iterable, chunksize).get()   file "/usr/lib64/python2.7/multiprocessing/pool.py", line 528, in     raise self._value attributeerror: 'numpy.ndarray' object has no attribute 'apply'

does know going on here? i'd appreciate feedback on way of parallelizing. expecting functions take more time inc or sum per individual row , millions of rows.

thanks!

n = 10^8 results 2, n = 10^7 results 13, because operator ^ xor (not power). 2 row length df can not splitted 5 chunks. utilize instead: n = 10**4 , n = 10**5. these values see difference in time. careful values greater n = 10**6 (at value parallel time @ 30 sec, single time @ 167 sec). , utilize pool.close() @ end (before return) in apply_in_parallel() automatically close workers in pool.

python pandas multiprocessing python-multiprocessing

Breedlove

Tuesday, 15 March 2011

python - Multiprocessing on pandas dataframe. Confusing behavior based on input size -

No comments:

Post a Comment