Wednesday 15 May 2013

Hive: Repeat SELECT for each Row -



Hive: Repeat SELECT for each Row -

i'm not exclusively sure if question title, i'll explain best can in body.

i'm working 1000000 songs dataset, http://labrosa.ee.columbia.edu/millionsong/

my ultimate goal create along lines of "similar song", in take song , similar songs based on year, duration, etc.

i have info in hive table set

create table if not exists songs(genre string, artist string, danceability double, duration double, loudness double, similarartists string, hotness double, title string) partitioned by(year string) row format delimited fields terminated '\t';

my problem comes because hive not back upwards inequalities in join.

ideally i'd have query like

select songs.artist, songs.title, t2.title songs bring together songs t2 on songs.year > t2.year -5 , songs.year < t2.year +5;

however not possible currently. i'm stumped best way effort similar song query. it's possible in select statment

select title songs year < 2000 + 5 , year > 2000 - 5;

but i'm unsure how run on every row, while taking apropriate values instead of hard coding 2005 i.e.:

select title songs year < song.year + 5 , year > song.year-5;

has run situation or have overall ideas try?

you can cross bring together , subset in clause:

select songs.artist, songs.title, t2.title songs cross bring together songs t2 songs.year between (t2.year - 5) , (t2.year + 5) ;

keep in mind above match each song itself. need restriction remove records if desired.

hive

No comments:

Post a Comment