java - Removing consecutive duplicates words out of text using Regex and displaying the new text -

hy,

i have following code:

    import java.io.*; import java.util.arraylist; import java.util.scanner; import java.util.regex.*;  / public  class regexsimple4 {       public static void main(string[] args) {               try           {          scanner myfis = new scanner(new file("d:\\myfis32.txt"));           arraylist <string> foundaz = new arraylist<string>();           arraylist <string> noduplicates = new arraylist<string>();          while(myfis.hasnext())         {             string line = myfis.nextline();             string delim = " ";             string [] words = line.split(delim);         (string s : words) {                                         if (!s.isempty() && s != null)                      {                         pattern pi = pattern.compile("[aa-zz]*");                         matcher ma = pi.matcher(s);                          if (ma.find()) {                            foundaz.add(s);                         }                     }                 }             }                     if(foundaz.isempty())                 {                     system.out.println("no words have been found");                 }                      if(!foundaz.isempty())                     {                         int n = foundaz.size();                         string plus = foundaz.get(0);                         noduplicates.add(plus);                         for(int i=1; i<n; i++)                         {                               if(!noduplicates.get(i-1).equalsignorecase(foundaz.get(i)))                            {                            noduplicates.add(foundaz.get(i));                            }                         }                         //system.out.print("cuvantul/cuvintele \n"+i);                  }                     if(!foundaz.isempty())                     { system.out.print("original text \n");                         for(string s: foundaz)                         {                             system.out.println(s);                 }                         }                     if(!noduplicates.isempty())                     { system.out.print("remove duplicates\n");                         for(string s: noduplicates)                         {                             system.out.println(s);                 }                         }          }   catch(exception ex)      {         system.out.println(ex);       } } }

with purpose of removing consecutive duplicates phrases. code works column of strings not full length phrases.

for example input should be:

blah blah dog cat mice. cat mice dog dog.

and output

blah dog cat mice. cat mice dog.

sincerly,

first of all, regex [aa-zz]* doesn't think does. means "match 0 or more as or characters in range between ascii a , ascii z (which includes [, ], \ , others), or zs". therefore matches empty string.

assuming looking duplicate words consists solely of ascii letters, case-insensitively, keeping first word (which means wouldn't want match "it's it's" or "olé olé!"), can in single regex operation:

string result = subject.replaceall("(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+", "$1");

which change

hello hello hello there there past pastures

into

hello there past pastures

explanation:

(?i)     # mode: case-insensitive \b       # match start of word ([a-z]+) # match 1 ascii "word", capture in group 1 \b       # match end of word (?:      # start of non-capturing group:  \s+     # match @ least 1 whitespace character  \1      # match same word captured before (case-insensitively)  \b      # , make sure ends there. )+       # repeat possible

see live on regex101.com.

Search This Blog

WIKI

java - Removing consecutive duplicates words out of text using Regex and displaying the new text -

Comments

Post a Comment

Popular posts from this blog

android - Automated my builds -

how to proxy from https to http with lighttpd -

python - Flask migration error -