Visualizing KML files in Google Maps

The question is easy: How can I visualize the track of a KML file on Google Maps via JavaScript?

Last evening, I spend about four hours looking for a solution for this (pretty trivial) question. In the end, the code was simple and very short, but it was hard to find some good and clear resources about the topic. The documentation from Google about their maps API is very well – but in my opinion it lacks of some simple examples to start with. Therefore, here is a super simple example to copy and past and start right ahead.

Example

<html>

    <script src="http://code.jquery.com/jquery-1.10.1.min.js"></script>
    <script src="https://maps.googleapis.com/maps/api/js?v=3.exp&sensor=false"></script>
    <script src="http://geoxml3.googlecode.com/svn/branches/polys/geoxml3.js"></script>
    <script src="http://geoxml3.googlecode.com/svn/trunk/ProjectedOverlay.js"></script>

    <script>
	
        function initialize() {
	
            var options = {
                center: new google.maps.LatLng(-34.397, 150.644),
                mapTypeId: google.maps.MapTypeId.ROADMAP
            };
		    
            var map = new google.maps.Map(document.getElementById("canvas"), options);
            var parser = new geoXML3.parser({map: map, processStyles: true});
            parser.parse("test.kml");
        }
	    
        $(document).ready(initialize);
	    
    </script>
	
    <div id="canvas" style="width:500px; height:500px"></div>
	
<html>

Run it

You can just copy and paste the example. You only have to do two things:

  1. Get a KML file, call it test.kml and put it beside your HTML file in a folder.
  2. Start a web-server in that folder (see below).

In order to run the example, you have to start a web-server. Otherwise you can’t load the KML file, since it is not available trough the browser (due to cross origin requests and all that stuff). A very easy way to do that is with python. Just open a command line the folder where your HTML document is and start a server like this:

python -m SimpleHTTPServer 8080

Then go on http://localhost:8080/ and see your map.

Explanation

The script is only doing a few simple things. First of all, it initializes a map object using the Google Maps API. This object represents the actual map drawn in the div with the id canvas. Then the script creates a parser object of the geoxml3 library. This library offers a very comfortable way to display KML files on Google Maps. However, the support for polylines (tracks on the map) is pretty new. So you have to use the poly branch of library. Otherwise you won’t see any lines, just your starting point. The library can also parse KML as a pure string. Check their wiki for more information.

Finally



Best regards,
Thomas Uhrig

Image optimization for websites

Optimizing the images of a web page is easy and the best way to speedup the site. On the basis of the famous book “Even Faster Web Sites” by Steve Souders, Annette Landmesser and I made a small presentation for an university course called Development of Rich Media Systems thought by Jakob Schröter. The presentation contains some general patterns and practices for lossy and non-lossy optimization. Enjoy.

I will also write a paper about image optimization for websites for my university course in a couple of weeks. The paper will be in German.

Best regards,
Thomas Uhrig

Extracting meaningful content from raw HTML

Parsing HTML is easy. Libraries like Beautiful Soup give you an compact and straight forward interface to process websites in your preferred programming language. But this is only the first step. The interesting question is: How to extract the meaningful content of HTML?

I tried to find a answer to this questions during the last couple of days – and here’s what I found.

Arc90 Readability

My favorite solution is the so called Arc90 Readability algorithm. It was developed by the Arc90 Labs to make websites more comfortable to read (e.g. on mobile devices). You can find it – for example – as a Google Chrome browser plugin. The whole project is also on Google Code, but more interesting is the actual algorithm, ported to Python by Nirmal Patel. Here you can find his original source code.

The algorithm is based on two lists of HTML-ID-names and HTML-CLASS-names. One list contains IDs and CLASSes with a positive meaning, the other list contains IDs and CLASSes with a negative meaning. If a tag has a positive ID or CLASS, it will get additional points; if it has a negative ID or CLASS, it will loos points. When we calculate this points for all tags in the HTML document, we can just render the tags with the most points to get the main content in the end. Here’s an example:

<div id="post"><h1>My post</h1><p>...</p></div>
<div class="footer"><a...>Contact</a></div>

The first div-tag has a very positive ID (“id=”post”), so it will probably contain the actual post. However, the div-tag in the second line has a very negative class (class=”footer”), which tells use that it seems to contain the footer of the page and not any meaningful content. With this knowledge, we do the following:

  1. get all paragraphs (p-tags) from the HTML source
  2. for each paragraph:
    1. add the parent of the paragraph to a list (if it's not already added)
    2. initialize the score of the parent with 0
    3. if the parent has a positive attribute, add points!
    4. if the parent has a negative attribute, subtract points!
    5. optional: check additional rules, e.g. a minimum length
  3. find parent with most points (the so called top-parent)
  4. render the textual content of the top-parent

Here’s my code which is based very much on the code of Nirmal Patel which you can find here. The main thing I changed, is some more cleaning before the actual algorithm. This will produce an easy to interpret HTML without scripts, images and so on, but still with all textual content.

import re
from bs4 import BeautifulSoup
from bs4 import Comment
from bs4 import Tag

NEGATIVE = re.compile(".*comment.*|.*meta.*|.*footer.*|.*foot.*|.*cloud.*|.*head.*")
POSITIVE = re.compile(".*post.*|.*hentry.*|.*entry.*|.*content.*|.*text.*|.*body.*")
BR = re.compile("<br */? *>[ \r\n]*<br */? *>")

def extract_content_with_Arc90(html):

    soup = BeautifulSoup( re.sub(BR, "</p><p>", html) )
    soup = simplify_html_before(soup)

    topParent = None
    parents = []
    for paragraph in soup.findAll("p"):
        
        parent = paragraph.parent
        
        if (parent not in parents):
            parents.append(parent)
            parent.score = 0

            if (parent.has_key("class")):
                if (NEGATIVE.match(str(parent["class"]))):
                    parent.score -= 50
                elif (POSITIVE.match(str(parent["class"]))):
                    parent.score += 25

            if (parent.has_key("id")):
                if (NEGATIVE.match(str(parent["id"]))):
                    parent.score -= 50
                elif (POSITIVE.match(str(parent["id"]))):
                    parent.score += 25

        if (len( paragraph.renderContents() ) > 10):
            parent.score += 1

        # you can add more rules here!

    topParent = max(parents, key=lambda x: x.score)
    simplify_html_after(topParent)
    return topParent.text

def simplify_html_after(soup):

    for element in soup.findAll(True):
        element.attrs = {}    
        if( len( element.renderContents().strip() ) == 0 ):
            element.extract()
    return soup

def simplify_html_before(soup):

    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]

    # you can add more rules here!

    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("li"))    # tag to text
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("em"))    # tag to text
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("tt"))    # tag to text
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("b"))     # tag to text
    
    replace_by_paragraph(soup, 'blockquote')
    replace_by_paragraph(soup, 'quote')

    map(lambda x: x.extract(), soup.findAll("code"))      # delete all
    map(lambda x: x.extract(), soup.findAll("style"))     # delete all
    map(lambda x: x.extract(), soup.findAll("script"))    # delete all
    map(lambda x: x.extract(), soup.findAll("link"))      # delete all
    
    delete_if_no_text(soup, "td")
    delete_if_no_text(soup, "tr")
    delete_if_no_text(soup, "div")

    delete_by_min_size(soup, "td", 10, 2)
    delete_by_min_size(soup, "tr", 10, 2)
    delete_by_min_size(soup, "div", 10, 2)
    delete_by_min_size(soup, "table", 10, 2)
    delete_by_min_size(soup, "p", 50, 2)

    return soup

def delete_if_no_text(soup, tag):
    
    for p in soup.findAll(tag):
        if(len(p.renderContents().strip()) == 0):
            p.extract()

def delete_by_min_size(soup, tag, length, children):
    
    for p in soup.findAll(tag):
        if(len(p.text) < length and len(p) <= children):
            p.extract()

def replace_by_paragraph(soup, tag):
    
    for t in soup.findAll(tag):
        t.name = "p"
        t.attrs = {}  

Whitespace Rendering

The idea behind this technique is quite simple and goes like this: You go trough your raw HTML string and replace every tag (everything between < and >) with white spaces. When you render the content, all textual blocks should still be "blocks", whereas the rest of the page should be scattered words with a lot of white spaces.The only thing you have to do right now is to get the blocks of text and throw away the rest. Here's a quick implementation:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import requests
import re

from bs4 import BeautifulSoup
from bs4 import Comment

if __name__ == "__main__":
    
    html_string = requests.get('http://www.zdnet.com/windows-8-microsofts-new-coke-moment-7000014779/').text
    
    soup = BeautifulSoup(str( html_string ))
    
    map(lambda x: x.extract(), soup.findAll("code"))
    map(lambda x: x.extract(), soup.findAll("script"))
    map(lambda x: x.extract(), soup.findAll("pre"))
    map(lambda x: x.extract(), soup.findAll("style"))
    map(lambda x: x.extract(), soup.findAll("embed"))
    
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
        
    [comment.extract() for comment in comments]
    
    white_string = ""
    isIn = False;
    
    for character in soup.prettify():

        if character == "<":
            isIn = True;
        
        if isIn:
            white_string += " "
        else:
            white_string += character
            
        if character == ">":
            isIn = False;
            
    for string in white_string.split("           "):    # tune here!
        
        p = string.strip()
        p = re.sub(' +',' ', p)
        p = re.sub('\n+',' ', p)
        
        if( len( p.strip() ) > 50):
            print p.strip()

The problem with this solution is, that it's not very generic. You have to do a lot of fine tuning to find a good length of white space to split the string. Websites with a lot of markup will produce much more white spaces compared to simple pages. On the other side, this is a quite simple approach - a simple is mostly good.

Libraries

As always: don't reinvent the wheel! There are a lot of libraries out there that are dealing with this problem. On of my favorite libraries is Boilerpipe. You can find it online as a web-service on http://boilerpipe-web.appspot.com/ and as a Java project on https://code.google.com/p/boilerpipe/. It's doing a real good job, but compared to the two algorithms I explained above, it's much more complicated inside. However, using it as a black-box might be a good solution to find your content.

Best regards,
Thomas Uhrig

Writing an online scraper on Google App Engine (Python)

Sometimes you need to collect data – for visualization, data-mining, research or whatever you want. But collecting data takes time, especially when time is a major concern and data should be collected over a long period.
Typically you would use a dedicated machine (e.g. a server) to do this, rather then using your own laptop or PC to crawl the internet for weeks. But setting up a server can be complicated and time consuming – nothing you would do for a small private project.

A good and free alternative is the Google App Engine (GAE). The GAE is a web-hosting service of Google which offers a platform to upload Java and Python applications. It comes with its own user authentication system and its own database. If you already have a Google account, you can upload up to ten applications for free. However, the free version has some limitations, e.g. you only have a 1 GB database with a maximum of 50.000 write-operations per day (more details).

One big advantage of the GAE is the possibility to create cron-jobs. A cron-job is a task that is executed on fixed points in time, e.g. all 10 minutes. Exactly what you need to build a scraper!

But let’s do it step by step:

1. Registration

First of all, you need a Google account and you must be registered by the GAE. After your registration, you can create a new application (go to https://appengine.google.com and click on Create Application).

01_gae_registration

Choose the name for your application wisely, you can’t change it later on!

2. Install Python, GAE SDK and Google Eclipse plugin

To start programming for GAE, you need to set up some simple things. Since we want to develop an application in Python, Python (v. 2.7) must be installed on your computer. Also, you need to install the GAE SDK for Python. Optional, you can also install the Google plugin for Eclipse together wit PyDev which I would recommend, because it makes life much easier.

3. Create your application

Now you can start and develop your application! Open Eclipse and create a new PyDev Google App Engine Project. To make a GAE application, we need at least two files: a main Python script and the app.yaml (a configuration file). Since we want to create a cron-job, too, we need a third file (cron.yaml) to define this job. For reading a RSS stream we also use a third-party library called feedparser.py. Just download the ZIP-file and unpack the file feedparser.py to your project folder (this is ok for the beginning). A very simple scrawler could look like this:

Scrawler.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals

from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext import db

import feedparser 
import time

class Item(db.Model): 
    title = db.StringProperty(required=False)
    link = db.StringProperty(required=False)
    date = db.StringProperty(required=False)

class Scrawler(webapp.RequestHandler):
    
    def get(self):
        self.read_feed()      
        self.response.out.write(self.print_items())
        
    def read_feed(self):
        
        feeds = feedparser.parse( "http://www.techrepublic.com/search?t=14&o=1&mode=rss" )
        
        for feed in feeds[ "items" ]:
            querry = Item.gql("WHERE link = :1", feed[ "link" ])
            if(querry.count() == 0):
                item = Item()
                item.title = feed[ "title" ]
                item.link = feed[ "link" ]
                item.date = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(time.time()))
                item.put()
    
    def print_items(self):
        s = "All items:<br>"
        for item in Item.all():
            s += item.date + " - <a href='" + item.link + "'>" + item.title + "</a><br>"
        return s

application = webapp.WSGIApplication([('/', Scrawler)], debug=True)

def main():
    run_wsgi_app(application)

if __name__ == "__main__":
    main() 

app.yaml

application: tech-rep-scrawler
version: 1
runtime: python27
api_version: 1
threadsafe: false

handlers:
- url: /
  script: Scrawler.py

Note: The application must be the same name as your registered on Google in the first step!

cron.yaml

cron:
- description: scrawler
  url: /
  schedule: every 15 mins
 

Done! Your project should look like this now (including feedparser.py):

02_gae_project

4. Test it on your own machine

Before we deploy the application on GAE, we want to test it locally to see if it is really working. To do so, we have to make a new run-configuration in Eclipse. Click on the small arrow at the small green run button and choose “Run configurations…”. Then, create a new “Google App Engine” configuration and fill in the following parameters (see the pictures):

Name:
GAE (you can choose anything as name)

Project:
TechRepScrawler (your project in your Eclipse workspace)

Main Module:
C:\Program Files (x86)\Google\google_appengine\dev_appserver.py (dev_appserver.py in your GAE installation folder)

Program Arguments:
--admin_port=9000 "C:\Users\Thomas\workspace_python\TechRepScrawler"

04_gae_run_config_1

04_gae_run_config_2

After starting GAE locally on your computer using the run configuration, just open your browser and go to http://localhost:8080/ to see the running application. You can also go to an admin perspective on http://localhost:9000/ to see, e.g. some information about your data.

5. Deploy your application to GAE

The next – and last step! – is to deploy the application on GAE. Using the Google Eclipse plugin, this is as easy as it can be. Just click right on your project, go to PyDev: Google App Engine and click upload. Now your app will be upload on GAE! On the first time, you will be asked for your credentials, that’s all.

05_gae_upload

06_gae_deployed

Now your app is online and available for every one! The cron-job will refresh it every 10 minutes (which just means, it will surf on your site like every other user would do it). Here’s how it should look:

07_last

Best regards,
Thomas Uhrig

Some impressions of Aix-en-Provence

I spend the last few weeks in Aix-en-Provence, a small town in the very south of France close to Marseille. My girlfriend is currently studying there as a part of her master-degree. Since the weather is not so bad there (especially compared to Germany and Sweden), I got the chance to shot some good photos. Enjoy.

Best regards,
Thomas Uhrig

ERASMUS Bericht

Nachdem ich nun mittlerweile wieder zurück in Deutschland bin, ist es Zeit für den obligatorischen ERASMUS-Bericht. Anbei findet ihr das PDF für meine Hochschule, aber auch ein paar persönliche Meinungen von mir zu den Vorlesungen die besucht habe. Ich hoffe es hilft euch.

TDDB68 Concurrent programming and Operating Systems (Bachelor)

Diese Vorlesung war meine einzige Bachelor-Vorlesung. Sie wird für die Vorlesung “Multicore and GPU Programming” als Voraussetzung angegeben (was aber für ERASMUS-Studenten formell egal ist). Jedoch wollte ich etwas ins Thema und in C-Programmierung reinkommen.
Die Vorlesung ist fundiert und orientiert sich stark am Buch “Operating System Concepts” von Silberschatz. Ich empfehle das Buch sehr, man kann es wirklich günstig gebraucht kaufen.
Das wirklich interessante an der Vorlesung sind aber die Labs. Hierbei arbeitet man an Pintos, einem kleinen Operating System, dass in einer VM läuft. Pintos ist sehr einfach und umfasst nur ein paar tausend Zeilen C-Code. In den Labs wird dieses System erweitert. Man muss beispielsweise System Calls implementieren oder etwas im Prozess-Scheduler verbessern.
Die Labs sind gut dokumentiert und machen Spaß. Am Anfang braucht man etwas Zeit um sich zurecht zu finden, aber ich finde es lohnt sich. Wer die Labs rechtzeitig besteht bekommt außerdem Bonus-Punkte für die Klausur. Man sollte aber definitiv programmieren können und etwas C kennen.

TDDB84 Design Patterns (Master)

Diese Vorlesung befasst sich relativ klassisch mit dem Thema Design Patterns. Alles orientiert sich demnach auch am Buch “Design Patterns” der GoF. Die Vorlesung nimmt jedes Pattern einzeln durch, bespricht Vor- und Nachteile, sowie Beispiele. In den Labs werden einige ausgewählte Pattern mit kleinen Beispielen in Java programmiert.
Die Vorlesung hat Spaß gemacht und war außerdem recht einfach. Die Labs und die Klausur gehen locker von der Hand. Der Prof ist zudem ein Phänomen. Ihr werdet gut unterhalten und bekommt sicher einige Geschichten aus dem Silicon Valley oder der NASA zuhören.

TDDB44 Compiler Construction (Master)

Bei dieser Vorlesung bin ich etwas zwiegespalten. Die Vorlesung an sich ist relativ schlecht. Es wird sich am Buch “Compilers: Principles, Techniques, and Tools” von Alfred V. Aho orientiert und das Thema Stück für Stück besprochen. Automatentheorie, Regex, Sprachen, Alphabete, Parser… blabla. Alles ist relativ banal, aber gleichzeitig schlecht erklärt. Über den theoretischen Grundbau kommt die Vorlesung nicht hinaus. Wer sich für Konzepte wie Objektorientierung, Garbage Collection, Optimierung oder ähnliches interessiert ist hier falsch.
Auf der anderen Seite sind aber die Labs. Diese sind unglaublich gut dokumentiert (mit ca. 130 Seiten) und machen durchweg Spaß. Es waren die besten Labs die ich bisher in meinem Studium hatte. Die Aufgabenstellung ist klar, die Lösung meist schwierig, aber genau so, dass man selbst dahinter kommt und nicht entnervt aufgibt.
Wer sich für das Thema interessiert sollte es machen, denn der theoretische Teil wird wohl überall gleich sein. Wer nur mal reinschnuppern möchte wird enttäuscht sein.

TDDD56 Multicore and GPU Programming (Master)

Die Vorlesung greift ein sehr aktuelles Thema der Informatik auf – allerdings nur in einer Hälfte, denn die Vorlesung ist in zwei Blöcke geteilt. Der erste Block beschäftigt sich mit einfacher Thread-Programmierung. Es geht um Synchronisierung, Kommunikation und z.B. Non-Blocking-Data-Structures. Alles ist relativ simple, wäre es nicht in C++ erklärt.
Die zweite Hälfte beschäftigt sich dann mit GPU-Programmierung. Das Thema ist gut erklärt und spannend, aber leider etwas kurz. Alles in allem spricht man ca. 3 Wochen wirklich über GPUs. Dafür könnte man sich auch einfach ein paar Tutorials durchlesen. Aber, die Vorlesung ist gut gemacht, was vor allem an den zwei Professoren liegt.
Die Labs sind hingegen schlicht nervig. Die Code-Vorlagen sind riesig (für winzige Beispiele) und voller Fehler. Außerdem gibt es zisch Skripte um Grafiken zu generieren die garantiert nicht funktionieren. Die Aufgaben sind teils nicht eindeutig beschrieben und es macht wirklich keinen Spaß.
Ich war enttäuscht von dieser Vorlesung, bin aber auf der anderen Seite ganz froh mal etwas über GPU-Programmierung gehört zu haben. Schade dass man nicht mehr daraus gemacht hat.

Swedish for Foreign Students (egal)

Klassiker. Schwedisch in Schweden zu lernen gehört einfach dazu. Mein Tipp ist aber einen Anfängerkurs schon vorab in Deutschland zu machen. Schwedisch ist zwar einfach (gerade für Deutsche), aber so einfach nun auch wieder nicht. Wer hingegen schon etwas Schwedisch kann, kann hier ganz entspannt mitschwimmen und seine Kenntnisse noch ein wenig verbessern. Andernfalls werdet ihr sicher einige Nachmittage Verben und Substantive mit “a” und “ar” und “en” versehen bis ihr nicht mehr wisst was Singular und Plural ist.
Der Kurs macht aber Spaß und ihr lernt viele andere Austauschstudenten kennen. Ein Muss.

Download: ERASMUS Bericht.pdf

Ich hoffe euch helfen meine Einschätzungen und Kommentare ein wenig weiter. Ich bin insgesamt sehr zufrieden mit meinem Studium gewesen und – bei aller Kritik – auch mit meinen Vorlesungen.

Beste Grüße,
Thomas Uhrig

Hej då Sverige!

After almost 5 months Linköping University in Sweden, it’s time for me to leave and go home again. The tickets for the ferry are already booked as well as a cheap hostel in the middle of Hamburg – I think this stop will be fun.

Here are some last pictures from Lappland, Västerås and Linköping itself. Enjoy, I did it.

Best regards,
Thomas Uhrig