parsing - urllib2 returning nothing in python -
i confused !!! can tell me problem is??? code used work started returning nothing since yesterday !! did not make changes on !!! have idea???
import re re import sub import time import cookielib cookielib import cookiejar import urllib2 urllib2 import urlopen import difflib import requests def twitparser(): try: cj = cookiejar() opener = urllib2.build_opener(urllib2.httpcookieprocessor(cj)) res=opener.open('https://twitter.com/haberturk') html=res.read() splitsource=re.findall(r'<p class="js-tweet-text tweet-text">(.*?)</p>',html) print len(splitsource) item in splitsource: atweet = re.sub(r'<.*?>','',item) print atweet except exception, e: print str(e) print 'error in main try' twitparser()
if code did not change, propably else did:
this tag not exists anymore:
<p class="js-tweet-text tweet-text">
instead there like:
profiletweet-text js-tweet-text u-dir
although possible want using regexp, not use it, use xml parser instead:
from bs4 import beautifulsoup soup = beautifulsoup(html) ptags = soup.find_all("p") texts = [p.text p in ptags if "js-tweet-text" in p["class"]]
propably split function, first making sure html, if find p tags, if find meet criteria.
as wooble said, use twitter api instead, these companies offer don't have scrape , cost them resources.
Comments
Post a Comment