in Coding, Projekt, Studies

Extracting meaningful content from raw HTML

Parsing HTML is easy. Libraries like Beautiful Soup give you an compact and straight forward interface to process websites in your preferred programming language. But this is only the first step. The interesting question is: How to extract the meaningful content of HTML?

I tried to find a answer to this questions during the last couple of days – and here’s what I found.

Arc90 Readability

My favorite solution is the so called Arc90 Readability algorithm. It was developed by the Arc90 Labs to make websites more comfortable to read (e.g. on mobile devices). You can find it – for example – as a Google Chrome browser plugin. The whole project is also on Google Code, but more interesting is the actual algorithm, ported to Python by Nirmal Patel. Here you can find his original source code.

The algorithm is based on two lists of HTML-ID-names and HTML-CLASS-names. One list contains IDs and CLASSes with a positive meaning, the other list contains IDs and CLASSes with a negative meaning. If a tag has a positive ID or CLASS, it will get additional points; if it has a negative ID or CLASS, it will loos points. When we calculate this points for all tags in the HTML document, we can just render the tags with the most points to get the main content in the end. Here’s an example:

The first div-tag has a very positive ID (“id=”post”), so it will probably contain the actual post. However, the div-tag in the second line has a very negative class (class=”footer”), which tells use that it seems to contain the footer of the page and not any meaningful content. With this knowledge, we do the following:

  1. get all paragraphs (p-tags) from the HTML source
  2. for each paragraph:
    1. add the parent of the paragraph to a list (if it's not already added)
    2. initialize the score of the parent with 0
    3. if the parent has a positive attribute, add points!
    4. if the parent has a negative attribute, subtract points!
    5. optional: check additional rules, e.g. a minimum length
  3. find parent with most points (the so called top-parent)
  4. render the textual content of the top-parent

Here’s my code which is based very much on the code of Nirmal Patel which you can find here. The main thing I changed, is some more cleaning before the actual algorithm. This will produce an easy to interpret HTML without scripts, images and so on, but still with all textual content.

Whitespace Rendering

The idea behind this technique is quite simple and goes like this: You go trough your raw HTML string and replace every tag (everything between < and >) with white spaces. When you render the content, all textual blocks should still be "blocks", whereas the rest of the page should be scattered words with a lot of white spaces.The only thing you have to do right now is to get the blocks of text and throw away the rest. Here's a quick implementation:

The problem with this solution is, that it's not very generic. You have to do a lot of fine tuning to find a good length of white space to split the string. Websites with a lot of markup will produce much more white spaces compared to simple pages. On the other side, this is a quite simple approach - a simple is mostly good.

Libraries

As always: don't reinvent the wheel! There are a lot of libraries out there that are dealing with this problem. On of my favorite libraries is Boilerpipe. You can find it online as a web-service on http://boilerpipe-web.appspot.com/ and as a Java project on https://code.google.com/p/boilerpipe/. It's doing a real good job, but compared to the two algorithms I explained above, it's much more complicated inside. However, using it as a black-box might be a good solution to find your content.

Best regards,
Thomas Uhrig

  • Interesting reading. For parsing html in JAVA I found the html cleaner project very useful. http://htmlcleaner.sourceforge.net/ It gracefully converts most (malformed) html into well-formed XHTML and provides a nice visitor-pattern based API to access the nodes.