User Tools

Site Tools


python:graburl

Grab and parse a HTML page using BeautifulSoup and Python

21.04.2010

Sometimes you need to grab a html page and extract some content from it. You can achieve this by doing a regex formatting, but it's much easy to get it with Beautiful Soup

What's Beautiful Soup anyway?

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. One important feature is that the BS won't choke if you give it bad markup, so it yields a parse tree that makes approximately as much sense as your original document.
Download and install it from here (python setup.py install)

Analyze the test page

To have a live example, we'll be using the HowTo's page from Linux Journal website. Over time, this page might change, so I'll attach a picture with the DOM structure as it is now. Our goal is to grab the title, author and the small description of each article on that page.

 Dom Structure

Make a pythonic soup

Each article section is described by “class:node node-teaser node-type-story” and within that we find the title, author and the short description. Using find method from BeautifulSoup, we grab links, headers and paragraphs. For more on this, check http://www.crummy.com/software/BeautifulSoup/documentation.html.

#!/usr/bin/env python
# -*- coding: utf8 -*-
 
import sys
import re
from urllib2 import Request, urlopen, URLError, HTTPError
from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup
 
 
 
class GrabArticles:
    articles_url = 'http://www.linuxjournal.com/taxonomy/term/19'
 
    def run(self):
        '''main function'''
        self._extract_page()
        self._process_page()
 
    def _extract_page(self):
        '''extracts a page in self.html variable'''
        print "Extracting page %s" % self.articles_url
        try:
            response = urlopen(self.articles_url)
            self.html = response.read()
            response.close()
        except HTTPError, e:
            print 'The server couldn\'t fulfill the request.'
            print 'Error code: ', e.code
        except URLError, e:
            print 'We failed to reach a server.'
            print 'Reason: ', e.reason
 
 
    def _process_page(self):
        '''process the html content and extract needed information'''
        try:
            if not self.html:
                raise Exception("No html content to be parsed")
 
            dom = BeautifulSoup(self.html)
            sections = dom.findAll(None, {"class":"node node-teaser node-type-story"})
            for sec in sections:
                secdom = BeautifulSoup(str(sec), convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
                title = secdom.find('h2', { 'class':'title' }).find('a').text
                author = secdom.find('div', { 'class':'submitted' }).find('a', {'title':'View user profile.'}).text
                content_short = secdom.find('div', { 'class':'content' }).find('p').text
                print "%s BY %s" % (title, author)
                print "DESCRIPTION: %s" % content_short[:-7] # to remove ".more>>"
                print "--------------------------"
        except Exception, e:
            print e
            sys.exit(1)
 
 
if __name__ == "__main__":
    grabob = GrabArticles()
    grabob.run();

This is just a small example, but I hope it helps. Cheers!

python/graburl.txt · Last modified: 2013/03/16 17:40 (external edit)