Tuesday 15 June 2010

numpy - Is there a way to make this Python kNN function more efficient? -



numpy - Is there a way to make this Python kNN function more efficient? -

after having troubles matlab decided seek python:

i wrote function calculates knn when samples of own class using own distance function:

def closestk(sample, othersamples, distfunc, k): "returns closest k samples sample based on distfunc" n = len(othersamples) d = [distfunc(sample, othersamples[i]) in range(0,n)] idx = sorted(range(0,len(d)), key=lambda k: d[k]) homecoming idx[1:(k+1)] def knn(samples, distfunc, k): homecoming [[closestk(samples[i], samples, distfunc, k)] in range(len(samples))]

and distance function:

@staticmethod def distancerepr(c1, c2): r1 = c1.repr r2 = c2.repr # because cdist needs 2d array if r1.ndim == 1: r1 = np.vstack([r1,r1]) if r2.ndim == 1: r2 = np.vstack([r2,r2]) homecoming scipy.spatial.distance.cdist(r1, r2, 'euclidean').min()

but still works amazingly slower compared "normal" knn function, when using "brute" algorithm. doing wrong?

update

i'm adding constructor of class. attribute repr contains set of vectors (from 1 whatever) , distance calculated minimal euclidean distance between 2 sets of repr.

class mycluster: def __init__(self, index = -1, p = np.array([])): if index ==-1 : self.repr = np.array([]) self.ids = np.array([]) self.n = 0 self.center = np.array([]) else: self.repr = np.array(p) self.ids = np.array(index) self.n = 1 self.center = np.array(p)

and rest of relevant code (x matrix rows samples , columns variables):

level = [mycluster(i, x[i,:]) in range(0,n)] knn(level, mycluster.distancerepr, 3)

update 2

i've made measurements , line takes of time

d = [distfunc(sample, othersamples[i]) in range(0,n)]

so there distfunc. when alter homecoming

np.linalg.norm(c1.repr-c2.repr)

i.e. "normal" vector calculation, no sorting, running time stays same. problem lies in calling of function. create sense utilize of classes changes running time factor of 60?

you're running slowness of python (or rather cpython interpreter should guess). wikipedia:

numpy targets cpython reference implementation of python, non-optimizing bytecode compiler/interpreter. mathematical algorithms written version of python run much slower compiled equivalents. numpy seeks address problem providing multidimensional arrays , functions , operators operate efficiently on arrays. algorithm can expressed operations on arrays , matrices can run equivalent c code.

and scipy faq:

python’s lists efficient general-purpose containers. back upwards (fairly) efficient insertion, deletion, appending, , concatenation, , python’s list comprehensions create them easy build , manipulate. however, have limitations: don’t back upwards “vectorized” operations elementwise add-on , multiplication, , fact can contain objects of differing types mean python must store type info every element, , must execute type dispatching code when operating on each element. means few list operations can carried out efficient c loops – each iteration require type checks , other python api bookkeeping.

note doesn't concern python; more background see e.g. this , this question on so.

due overhead dynamic type scheme , interpreter, python lot less usefull high performance number crunching, if wouldn't able tap sorts of compiled c , fortran libraries (e.g. numpy). also, there jit compilers numba , pypy seek python code execute closer speeds of statically typed, compiled code.

bottomline: you're doing much in plain python relative work you're offloading fast c code. suppose need adopt more "array oriented" coding style rather object oriented accomplish performance numpy (matlab similar story in regard). on other hand, if utilize more efficient algorithm (see reply ara) slowness of python might not such issue.

python numpy machine-learning distance knn

No comments:

Post a Comment