Thursday, 15 January 2015

awk to print unique latest date & time lines based on column fields -



awk to print unique latest date & time lines based on column fields -

would print unique lines based on first field , latest date & time of 3rd field, maintain latest date , time occurrence of line , remove duplicate of other occurrences. having around 50 1000000 rows , file not sorted ...

input.csv

10,ab,15-sep-14.11:09:06,abc,xxx,yyy,zzz 20,ab,23-sep-14.08:09:35,abc,xxx,yyy,zzz 10,ab,25-sep-14.08:09:26,abc,xxx,yyy,zzz 62,ab,12-sep-14.03:09:23,abc,xxx,yyy,zzz 58,ab,22-jul-14.05:07:07,abc,xxx,yyy,zzz 20,ab,23-sep-14.07:09:35,abc,xxx,yyy,zzz

desired output:

10,ab,25-sep-14.08:09:26,abc,xxx,yyy,zzz 20,ab,23-sep-14.08:09:35,abc,xxx,yyy,zzz 62,ab,12-sep-14.03:09:23,abc,xxx,yyy,zzz 58,ab,22-jul-14.05:07:07,abc,xxx,yyy,zzz

have attempeted partial commands , in-complete due date , time format of file united nations sorting order ...

awk -f, '!seen[$1,$3]++' input.csv

looking suggestions ...

this awk command you:

awk -f, -v ofs=',' '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d} !($1 in b)||d>b[$1] {b[$1] =d; a[$1] = $0} end{for(x in a)print a[x]}' file first line transforms original $3 valid date format string , seconds 1970 via date cmd, later compare. using a , b 2 arrays hold final result , latest date (seconds) the end block print rows a test illustration data: kent$ cat f 10,ab,15-sep-14.11:09:06,abc,xxx,yyy,zzz 20,ab,23-sep-14.08:09:35,abc,xxx,yyy,zzz 10,ab,25-sep-14.08:09:26,abc,xxx,yyy,zzz 62,ab,12-sep-14.03:09:23,abc,xxx,yyy,zzz 58,ab,22-jul-14.05:07:07,abc,xxx,yyy,zzz 20,ab,23-sep-14.07:09:35,abc,xxx,yyy,zzz kent$ awk -f, '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d} !($1 in b)||d>b[$1] { b[$1] =d;a[$1] = $0 } end{for(x in a)print a[x]}' f 10 ab 25-sep-14 08:09:26 abc xxx yyy zzz 20 ab 23-sep-14 08:09:35 abc xxx yyy zzz 58 ab 22-jul-14 05:07:07 abc xxx yyy zzz 62 ab 12-sep-14 03:09:23 abc xxx yyy zzz

awk

No comments:

Post a Comment