scala - Most efficient way to do a "not distinct" query -
i have huge info set of key-value pairs , i'm trying figure out best way ignore keys 1 value (the interesting ones ones more 1 value).
the simplest way pairrdd values grouped keys:
val interestingdata = initialdata.groupbykey().filter(_._2.size > 1)
however, turn out problematic (there keys more 6million values, though mass of keys have 1 value).
one way be:
val interestingkeys = initialdata.mapvalues(_=>1).reducebykey((a,b)=>a+b).filter(_._2 > 1) val interestingdata = interetingkeys.join(initialdata).mapvalues( x=> x._2)
this more computation-intensive, doesn't set much memory pressure level on reducers - thus, it's more finish. plus, if want can add together sec filter remove keys many values.
is there improve way? or other suggestions improve query? (.cogroup after filter?)
scala functional-programming apache-spark
No comments:
Post a Comment