openrefine - Keep newest duplicate row depending on multiple Columns -


i seem have workflow problem open refine (google refine 2.5 [r2407]) sophisticated duplicate row cleaning. have found far how delete duplicate rows based on single column.

my aim delete duplicate rows based on multiple columns, @ best, in specific hierarchy.

example

given following dummy data in refine

+----+---------+---------+--------+------------+------+-----------------------------------+ | id | timeago |  title  | author |    date    | val1 |    [after refine, keep record]    | +----+---------+---------+--------+------------+------+-----------------------------------+ |  1 |      10 | faust   | mr.  | 2014-01-15 |   10 | ->b, older entry                  | |  2 |      11 | faust   | mr.  | 2014-01-21 |   10 | (because of date)               | |  3 |       8 | faust   | mr.  | 2014-01-15 |   10 | b                                 | |  4 |       8 | redhead | mr. b  | 2014-01-21 |   34 | ->d, older entry                  | |  5 |       7 | redhead | mr. b  | 2014-01-21 |   34 | ->d, same time ago, lower id  | |  6 |       7 | redhead | mr.  | 2014-01-01 |   13 | c (because of author, date, val1) | |  7 |       7 | redhead | mr. b  | 2014-01-21 |   34 | d                                 | +----+---------+---------+--------+------------+------+-----------------------------------+ 

i want kill duplicate rows based on following logic. if

  • title && auther && date && val1 same, than
  • keep newest (least timeago) row, if there multiple, than
  • keep 1 highest id

the result be:

+---------+----+---------+---------+--------+------------+------+ | refined | id | timeago |  title  | author |    date    | val1 | +---------+----+---------+---------+--------+------------+------+ |       |  2 |      10 | faust   | mr.  | 2014-01-21 |   10 | | b       |  3 |       8 | faust   | mr.  | 2014-01-15 |   10 | | c       |  6 |       7 | redhead | mr.  | 2014-01-01 |   13 | | d       |  7 |       7 | redhead | mr. b  | 2014-01-21 |   34 | +---------+----+---------+---------+--------+------------+------+ 

easy approach?

if there no other solution, thankfully take scripting/grel one.

but done refines famous workflow "recording" achieve above logic, extracted , applied other same format datasets?

my motivation behind enable employees work more thoughtfully data (beyond excel) without confronting them right away full blown scripting language.

that sounds straightforward sorting problem.

  1. sort records title, author, time ago, , id
  2. re-order rows permanently (important - won't work if forget step)
  3. blank down on title & author
  4. move 2 columns 2 left positions
  5. join multivalued cells on remaining columns
  6. transform columns step 5 using value.split(',')[0] extract first value (which should value record want if sorted them in right order

Comments

Popular posts from this blog

how to proxy from https to http with lighttpd -

android - Automated my builds -

python - Flask migration error -