openrefine - Keep newest duplicate row depending on multiple Columns -
i seem have workflow problem open refine (google refine 2.5 [r2407]) sophisticated duplicate row cleaning. have found far how delete duplicate rows based on single column.
my aim delete duplicate rows based on multiple columns, @ best, in specific hierarchy.
example
given following dummy data in refine
+----+---------+---------+--------+------------+------+-----------------------------------+ | id | timeago | title | author | date | val1 | [after refine, keep record] | +----+---------+---------+--------+------------+------+-----------------------------------+ | 1 | 10 | faust | mr. | 2014-01-15 | 10 | ->b, older entry | | 2 | 11 | faust | mr. | 2014-01-21 | 10 | (because of date) | | 3 | 8 | faust | mr. | 2014-01-15 | 10 | b | | 4 | 8 | redhead | mr. b | 2014-01-21 | 34 | ->d, older entry | | 5 | 7 | redhead | mr. b | 2014-01-21 | 34 | ->d, same time ago, lower id | | 6 | 7 | redhead | mr. | 2014-01-01 | 13 | c (because of author, date, val1) | | 7 | 7 | redhead | mr. b | 2014-01-21 | 34 | d | +----+---------+---------+--------+------------+------+-----------------------------------+
i want kill duplicate rows based on following logic. if
- title && auther && date && val1 same, than
- keep newest (least timeago) row, if there multiple, than
- keep 1 highest id
the result be:
+---------+----+---------+---------+--------+------------+------+ | refined | id | timeago | title | author | date | val1 | +---------+----+---------+---------+--------+------------+------+ | | 2 | 10 | faust | mr. | 2014-01-21 | 10 | | b | 3 | 8 | faust | mr. | 2014-01-15 | 10 | | c | 6 | 7 | redhead | mr. | 2014-01-01 | 13 | | d | 7 | 7 | redhead | mr. b | 2014-01-21 | 34 | +---------+----+---------+---------+--------+------------+------+
easy approach?
if there no other solution, thankfully take scripting/grel one.
but done refines famous workflow "recording" achieve above logic, extracted , applied other same format datasets?
my motivation behind enable employees work more thoughtfully data (beyond excel) without confronting them right away full blown scripting language.
that sounds straightforward sorting problem.
- sort records title, author, time ago, , id
- re-order rows permanently (important - won't work if forget step)
- blank down on title & author
- move 2 columns 2 left positions
- join multivalued cells on remaining columns
- transform columns step 5 using
value.split(',')[0]
extract first value (which should value record want if sorted them in right order
Comments
Post a Comment