Sunday 15 July 2012

scala - Most efficient way to do a "not distinct" query -



scala - Most efficient way to do a "not distinct" query -

i have huge info set of key-value pairs , i'm trying figure out best way ignore keys 1 value (the interesting ones ones more 1 value).

the simplest way pairrdd values grouped keys:

val interestingdata = initialdata.groupbykey().filter(_._2.size > 1)

however, turn out problematic (there keys more 6million values, though mass of keys have 1 value).

one way be:

val interestingkeys = initialdata.mapvalues(_=>1).reducebykey((a,b)=>a+b).filter(_._2 > 1) val interestingdata = interetingkeys.join(initialdata).mapvalues( x=> x._2)

this more computation-intensive, doesn't set much memory pressure level on reducers - thus, it's more finish. plus, if want can add together sec filter remove keys many values.

is there improve way? or other suggestions improve query? (.cogroup after filter?)

scala functional-programming apache-spark

No comments:

Post a Comment