Wednesday 15 September 2010

hadoop - Optimize join in HIVE query: c -



hadoop - Optimize join in HIVE query: c -

i know 1 best way optimize hive (0.12) query joining 2 tables among these 3 possible candidates (and perchance understand why):

select * bring together b on (a.id = b.id) b.dt = "2014-09-01";

or

select * bring together b on (a.id = b.id , b.dt = "2014-09-01") ;

or

select * bring together ( select * b dt = "2014-09-01" ) c on a.id = c.id ;

i have no command on how tables stored , partitioned, question more general best practices specific case. know sure a.id = b.id possible when b.dt = '2014-09-01' restrict info can joined improve speed (b huge table).

reading hive documentation understood improve smallest table , b (very) big one; couldn't understand how different queries shown above behave in terms of performances.

if there other way utilize know well.

i see 3 same in terms of #of mr-jobs , mappers used , explain plan. taking care table little plenty map-side bring together optimization utilised. switching positions of filter on table b has no effect on number of mappers used retrieve info table b. case when table b in subquery.

the optimization partition pruning if table b happens partitioned on col dt see factor cut down number of mappers compared total table scan otherwise.

join hadoop hive query-optimization

No comments:

Post a Comment