java - Removing consecutive duplicates words out of text using Regex and displaying the new text -
hy,
i have following code:
import java.io.*; import java.util.arraylist; import java.util.scanner; import java.util.regex.*; / public class regexsimple4 { public static void main(string[] args) { try { scanner myfis = new scanner(new file("d:\\myfis32.txt")); arraylist <string> foundaz = new arraylist<string>(); arraylist <string> noduplicates = new arraylist<string>(); while(myfis.hasnext()) { string line = myfis.nextline(); string delim = " "; string [] words = line.split(delim); (string s : words) { if (!s.isempty() && s != null) { pattern pi = pattern.compile("[aa-zz]*"); matcher ma = pi.matcher(s); if (ma.find()) { foundaz.add(s); } } } } if(foundaz.isempty()) { system.out.println("no words have been found"); } if(!foundaz.isempty()) { int n = foundaz.size(); string plus = foundaz.get(0); noduplicates.add(plus); for(int i=1; i<n; i++) { if(!noduplicates.get(i-1).equalsignorecase(foundaz.get(i))) { noduplicates.add(foundaz.get(i)); } } //system.out.print("cuvantul/cuvintele \n"+i); } if(!foundaz.isempty()) { system.out.print("original text \n"); for(string s: foundaz) { system.out.println(s); } } if(!noduplicates.isempty()) { system.out.print("remove duplicates\n"); for(string s: noduplicates) { system.out.println(s); } } } catch(exception ex) { system.out.println(ex); } } }
with purpose of removing consecutive duplicates phrases. code works column of strings not full length phrases.
for example input should be:
blah blah dog cat mice. cat mice dog dog.
and output
blah dog cat mice. cat mice dog.
sincerly,
first of all, regex [aa-zz]*
doesn't think does. means "match 0 or more a
s or characters in range between ascii a
, ascii z
(which includes [
, ]
, \
, others), or z
s". therefore matches empty string.
assuming looking duplicate words consists solely of ascii letters, case-insensitively, keeping first word (which means wouldn't want match "it's it's"
or "olé olé!"
), can in single regex operation:
string result = subject.replaceall("(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+", "$1");
which change
hello hello hello there there past pastures
into
hello there past pastures
explanation:
(?i) # mode: case-insensitive \b # match start of word ([a-z]+) # match 1 ascii "word", capture in group 1 \b # match end of word (?: # start of non-capturing group: \s+ # match @ least 1 whitespace character \1 # match same word captured before (case-insensitively) \b # , make sure ends there. )+ # repeat possible
see live on regex101.com.
Comments
Post a Comment