awk to print unique latest date & time lines based on column fields -
would print unique lines based on first field , latest date & time of 3rd field, maintain latest date , time occurrence of line , remove duplicate of other occurrences. having around 50 1000000 rows , file not sorted ...
input.csv
10,ab,15-sep-14.11:09:06,abc,xxx,yyy,zzz 20,ab,23-sep-14.08:09:35,abc,xxx,yyy,zzz 10,ab,25-sep-14.08:09:26,abc,xxx,yyy,zzz 62,ab,12-sep-14.03:09:23,abc,xxx,yyy,zzz 58,ab,22-jul-14.05:07:07,abc,xxx,yyy,zzz 20,ab,23-sep-14.07:09:35,abc,xxx,yyy,zzz
desired output:
10,ab,25-sep-14.08:09:26,abc,xxx,yyy,zzz 20,ab,23-sep-14.08:09:35,abc,xxx,yyy,zzz 62,ab,12-sep-14.03:09:23,abc,xxx,yyy,zzz 58,ab,22-jul-14.05:07:07,abc,xxx,yyy,zzz
have attempeted partial commands , in-complete due date , time format of file united nations sorting order ...
awk -f, '!seen[$1,$3]++' input.csv
looking suggestions ...
this awk command you:
awk -f, -v ofs=',' '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d} !($1 in b)||d>b[$1] {b[$1] =d; a[$1] = $0} end{for(x in a)print a[x]}' file
first line transforms original $3
valid date format string , seconds 1970 via date
cmd, later compare. using a , b
2 arrays hold final result , latest date (seconds) the end
block print rows a
test illustration data: kent$ cat f 10,ab,15-sep-14.11:09:06,abc,xxx,yyy,zzz 20,ab,23-sep-14.08:09:35,abc,xxx,yyy,zzz 10,ab,25-sep-14.08:09:26,abc,xxx,yyy,zzz 62,ab,12-sep-14.03:09:23,abc,xxx,yyy,zzz 58,ab,22-jul-14.05:07:07,abc,xxx,yyy,zzz 20,ab,23-sep-14.07:09:35,abc,xxx,yyy,zzz kent$ awk -f, '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d} !($1 in b)||d>b[$1] { b[$1] =d;a[$1] = $0 } end{for(x in a)print a[x]}' f 10 ab 25-sep-14 08:09:26 abc xxx yyy zzz 20 ab 23-sep-14 08:09:35 abc xxx yyy zzz 58 ab 22-jul-14 05:07:07 abc xxx yyy zzz 62 ab 12-sep-14 03:09:23 abc xxx yyy zzz
awk
No comments:
Post a Comment