hadoop - Optimize join in HIVE query: c -
i know 1 best way optimize hive (0.12) query joining 2 tables among these 3 possible candidates (and perchance understand why):
select * bring together b on (a.id = b.id) b.dt = "2014-09-01";
or
select * bring together b on (a.id = b.id , b.dt = "2014-09-01") ;
or
select * bring together ( select * b dt = "2014-09-01" ) c on a.id = c.id ;
i have no command on how tables stored , partitioned, question more general best practices specific case. know sure a.id = b.id possible when b.dt = '2014-09-01' restrict info can joined improve speed (b huge table).
reading hive documentation understood improve smallest table , b (very) big one; couldn't understand how different queries shown above behave in terms of performances.
if there other way utilize know well.
i see 3 same in terms of #of mr-jobs , mappers used , explain plan. taking care table little plenty map-side bring together optimization utilised. switching positions of filter on table b has no effect on number of mappers used retrieve info table b. case when table b in subquery.
the optimization partition pruning if table b happens partitioned on col dt see factor cut down number of mappers compared total table scan otherwise.
join hadoop hive query-optimization
No comments:
Post a Comment