User Tools

Site Tools


python:graburl

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

python:graburl [2013/03/16 17:40]
python:graburl [2013/03/16 17:40] (current)
Line 1: Line 1:
 +==== Grab and parse a HTML page using BeautifulSoup and Python ====
 +21.04.2010
  
 +
 +Sometimes you need to grab a html page and extract some content from it. You can achieve this by doing a regex formatting, but it's much easy to get it with [[http://​www.crummy.com/​software/​BeautifulSoup/​ | Beautiful Soup]]
 +
 +
 +=== What's Beautiful Soup anyway? ===
 +
 +
 +Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. One important feature is that the BS won't choke if you give it bad markup, so it yields a parse tree that makes approximately as much sense as your original document. \\ 
 +Download and install it from [[http://​www.crummy.com/​software/​BeautifulSoup/#​Download | here]] (python setup.py install)
 +
 +
 +=== Analyze the test page ===
 +
 +To have a live example, we'll be using the HowTo'​s page from [[ http://​www.linuxjournal.com/​taxonomy/​term/​19 | Linux Journal website.]] Over time, this page might change, so I'll attach a picture with the DOM structure as it is now. Our goal is to grab the title, author and the small description of each article on that page.
 +
 +{{:​python:​domstructure.png?​100| Dom Structure }}
 +
 +
 +=== Make a pythonic soup ===
 +
 +
 +Each article section is described by //"​class:​node node-teaser node-type-story"//​ and within that we find the title, author and the short description. Using find method from BeautifulSoup,​ we grab links, headers and paragraphs. For more on this, check http://​www.crummy.com/​software/​BeautifulSoup/​documentation.html.
 +
 +<code python>
 +#​!/​usr/​bin/​env python
 +# -*- coding: utf8 -*-
 +
 +import sys
 +import re
 +from urllib2 import Request, urlopen, URLError, HTTPError
 +from BeautifulSoup import BeautifulSoup,​ BeautifulStoneSoup
 +
 +
 +
 +class GrabArticles:​
 +    articles_url = '​http://​www.linuxjournal.com/​taxonomy/​term/​19'​
 +    ​
 +    def run(self):
 +        '''​main function'''​
 +        self._extract_page()
 +        self._process_page()
 +
 +    def _extract_page(self):​
 +        '''​extracts a page in self.html variable'''​
 +        print "​Extracting page %s" % self.articles_url
 +        try:
 +            response = urlopen(self.articles_url)
 +            self.html = response.read()
 +            response.close()
 +        except HTTPError, e:
 +            print 'The server couldn\'​t fulfill the request.'​
 +            print 'Error code: ', e.code
 +        except URLError, e:
 +            print 'We failed to reach a server.'​
 +            print '​Reason:​ ', e.reason
 +
 +
 +    def _process_page(self):​
 +        '''​process the html content and extract needed information'''​
 +        try:
 +            if not self.html:
 +                raise Exception("​No html content to be parsed"​)
 +
 +            dom = BeautifulSoup(self.html)
 +            sections = dom.findAll(None,​ {"​class":"​node node-teaser node-type-story"​})
 +            for sec in sections:
 +                secdom = BeautifulSoup(str(sec),​ convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
 +                title = secdom.find('​h2',​ { '​class':'​title'​ }).find('​a'​).text
 +                author = secdom.find('​div',​ { '​class':'​submitted'​ }).find('​a',​ {'​title':'​View user profile.'​}).text
 +                content_short = secdom.find('​div',​ { '​class':'​content'​ }).find('​p'​).text
 +                print "%s BY %s" % (title, author)
 +                print "​DESCRIPTION:​ %s" % content_short[:​-7] # to remove "​.more>>"​
 +                print "​--------------------------"​
 +        except Exception, e:
 +            print e
 +            sys.exit(1)
 +
 +
 +if __name__ == "​__main__":​
 +    grabob = GrabArticles()
 +    grabob.run();​
 +</​code>​
 +
 +
 +//This is just a small example, but I hope it helps. Cheers!//
python/graburl.txt ยท Last modified: 2013/03/16 17:40 (external edit)