python - Why isn't this regexp working -
i have source code of webpage formatted this:
<span class="l r positive-icon"> turkish </span> <span> the.mist[2007]dvdrip[eng]-axxo </span> <span class="l r neutral-icon"> vietnamese </span> <span> the.mist.2007.720p.bluray.x264.yify </span>
as can see, there either spans class of "l r positive-icon" or "l r neutral-icon". want languages, between span class. use regexp gives me empty list:
alllanguages = re.findall('<span class=".*">\s(.*)\s</span>', alllanguagestags)
alllanguagestags contains source code shown above. can give me hint?
don't use regular expressions. use actual html parser. recommend use beautifulsoup instead:
from bs4 import beautifulsoup soup = beautifulsoup(yourhtml) languages = [s.get_text().strip() s in soup.find_all('span', class_=true)]
demo:
>>> bs4 import beautifulsoup >>> soup = beautifulsoup('''\ ... <span class="l r positive-icon"> ... turkish ... </span> ... <span> ... the.mist[2007]dvdrip[eng]-axxo ... </span> ... <span class="l r neutral-icon"> ... vietnamese ... </span> ... <span> ... the.mist.2007.720p.bluray.x264.yify ... </span> ... ''') >>> [s.get_text().strip() s in soup.find_all('span', class_=true)] [u'turkish', u'vietnamese']
Comments
Post a Comment