21.04.2010
Sometimes you need to grab a html page and extract some content from it. You can achieve this by doing a regex formatting, but it's much easy to get it with Beautiful Soup
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. One important feature is that the BS won't choke if you give it bad markup, so it yields a parse tree that makes approximately as much sense as your original document.
Download and install it from here (python setup.py install)
To have a live example, we'll be using the HowTo's page from Linux Journal website. Over time, this page might change, so I'll attach a picture with the DOM structure as it is now. Our goal is to grab the title, author and the small description of each article on that page.
Each article section is described by “class:node node-teaser node-type-story” and within that we find the title, author and the short description. Using find method from BeautifulSoup, we grab links, headers and paragraphs. For more on this, check http://www.crummy.com/software/BeautifulSoup/documentation.html.
#!/usr/bin/env python # -*- coding: utf8 -*- import sys import re from urllib2 import Request, urlopen, URLError, HTTPError from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup class GrabArticles: articles_url = 'http://www.linuxjournal.com/taxonomy/term/19' def run(self): '''main function''' self._extract_page() self._process_page() def _extract_page(self): '''extracts a page in self.html variable''' print "Extracting page %s" % self.articles_url try: response = urlopen(self.articles_url) self.html = response.read() response.close() except HTTPError, e: print 'The server couldn\'t fulfill the request.' print 'Error code: ', e.code except URLError, e: print 'We failed to reach a server.' print 'Reason: ', e.reason def _process_page(self): '''process the html content and extract needed information''' try: if not self.html: raise Exception("No html content to be parsed") dom = BeautifulSoup(self.html) sections = dom.findAll(None, {"class":"node node-teaser node-type-story"}) for sec in sections: secdom = BeautifulSoup(str(sec), convertEntities=BeautifulStoneSoup.ALL_ENTITIES) title = secdom.find('h2', { 'class':'title' }).find('a').text author = secdom.find('div', { 'class':'submitted' }).find('a', {'title':'View user profile.'}).text content_short = secdom.find('div', { 'class':'content' }).find('p').text print "%s BY %s" % (title, author) print "DESCRIPTION: %s" % content_short[:-7] # to remove ".more>>" print "--------------------------" except Exception, e: print e sys.exit(1) if __name__ == "__main__": grabob = GrabArticles() grabob.run();
This is just a small example, but I hope it helps. Cheers!