Tuesday 15 March 2011

python - Multiprocessing on pandas dataframe. Confusing behavior based on input size -



python - Multiprocessing on pandas dataframe. Confusing behavior based on input size -

i trying implement df.apply function parallelized across chunks of dataframe. wrote next test code see how much gain (versus info copying etc):

from multiprocessing import pool functools import partial import pandas pd import numpy np import time def df_apply(df, f): homecoming df.apply(f, axis=1) def apply_in_parallel(df, f, n=5): pool = pool(n) df_chunks = np.array_split(df, n) apply_f = partial(df_apply, f=f) result_list = pool.map(apply_f, df_chunks) homecoming pd.concat(result_list, axis=0) def f(x): homecoming x+1 if __name__ == '__main__': n = 10^8 df = pd.dataframe({"a": np.zeros(n), "b": np.zeros(n)}) print "parallel" t0 = time.time() r = apply_in_parallel(df, f, n=5) print time.time() - t0 print "single" t0 = time.time() r = df.apply(f, axis=1) print time.time() - t0

weird behavior: n=10^7 works n=10^8 gives me error

traceback (most recent phone call last): file "parallel_apply.py", line 27, in <module> r = apply_in_parallel(df, f, n=5) file "parallel_apply.py", line 14, in apply_in_parallel result_list = pool.map(apply_f, df_chunks) file "/usr/lib64/python2.7/multiprocessing/pool.py", line 227, in map homecoming self.map_async(func, iterable, chunksize).get() file "/usr/lib64/python2.7/multiprocessing/pool.py", line 528, in raise self._value attributeerror: 'numpy.ndarray' object has no attribute 'apply'

does know going on here? i'd appreciate feedback on way of parallelizing. expecting functions take more time inc or sum per individual row , millions of rows.

thanks!

n = 10^8 results 2, n = 10^7 results 13, because operator ^ xor (not power). 2 row length df can not splitted 5 chunks. utilize instead: n = 10**4 , n = 10**5. these values see difference in time. careful values greater n = 10**6 (at value parallel time @ 30 sec, single time @ 167 sec). , utilize pool.close() @ end (before return) in apply_in_parallel() automatically close workers in pool.

python pandas multiprocessing python-multiprocessing

No comments:

Post a Comment