Media Night Winter Semester 2013/2014

opinion-trends-poster

During the last summer semester, two friends of mine and I made a student project called TechTrends. TechTrends was a web application that let you search for articles and trends in the field of computer science. Based on posts from Reddit and Hackernews, it provided an intelligent search on a growing number of articles and blogs.

During this winter semester I continued the project and implemented a sentiment analysis for TechTrends. Based on the existing infrastructure such as our database and our crawler, I add an automated categorization of articles according to their comments on Hackernews and Reddit.

You can find the old and stable version of our project under http://techtrends.mi.hdm-stuttgart.de/. The up-to-date development version is available under http://opiniontrends.mi.hdm-stuttgart.de/.

media_night_ws13

I will present the project at the Media Night at our university next week. It’s open for everybody and for free. It will start around 6 pm, but you can come whenever you want to, there is no schedule. Every project has its own booth, where it is presented and where you can ask question and get in touch with the people behind it.

You can find the program and information about all projects on http://www.hdm-stuttgart.de/medianight.

What? – Media Night Winter 2013
When? – 16th January 2014, from 6 pm to 10 pm
Where? – Hochschule der Medien, Nobelstraße 10, 70569 Stuttgart

Best regards,
Thomas

TechTrends – Searching trends on HN and Reddit

It’s done! Last Friday (26th July 2013) was the final presentation of our semester project TechTrends. Together with Raphael Brand and Hannes Pernpeintner I worked the last 5 months on this project – and we are really happy about it.

What is TechTrends?

TechTrends is basically a search-engine for HN and Reddit (just the programming sub-reddit). You can type in a key-word and you will find a bunch of similar articles. But our main concern was not to find just articles, but also to find trends. Therefor, you will not only get a pure list of articles, you will get a chart showing when articles have been found for your query. For each article we calculate a popularity, indicating how important that article is. Based on these popularities, we draw the trend-chart for your search.


reddit_logo

hn_logo

How does it work?

TechTrends has six major parts (see the graphic below). First of all, we crawl HN and Reddit all 15 minutes to get the latest links. In the second part we get the actual content form each link and store it in our database. Then we do a preprocessing on this pure text content to remove stop-words, digits and so on. After that, we use the great Gensim Server from Radim Řehůřek to build an index of all documents. The next (and last part on the server) is a JSON-based web API to access all of our data (here is its documentation). On top of these API we built our user-interface – the actual website.

techtrends-components

Presentation

Here is the video of our final presentation about TechTrends in our university (on the 26the July 2013). Our presentation is about 70 minutes long and we explain a lot of details about our project. The video is available on http://events.mi.hdm-stuttgart.de/2013-07-25-vortr%C3%A4ge-programming-intelligent-applications#techtrends or below. It also contains the presentations of the other groups, but we are the first on the video.You can find the slides under the video.

Video

Thanks to Stephan Soller and his team, our presentation has been recorded on video. You can see the video below or on http://events.mi.hdm-stuttgart.de/.

Slides

Here are the slides on my speaker deck account.

What’s next?

We have a lot of ideas for the future of TechTrends. We are thinking for example about a mobile version or a reporting tool. But in my oppinion, the most important step is to make TechTrends more easy to customize. Currently, we are focused on HN and Reddit. However, everything but the actual crawlers is independent of the underlying source. With a little bit of work, you can easily implement a new crawler for the data source of your choice. Making this customization more comfortable and easy is our next goal for project.

More

There is plenty more! Here you go:

Best regards,
Thomas Uhrig

TechTrends Final Presentation

Tomorrow morning is the final presentation of our semester project TechTrends. I posted several articles about TechTrends (see here) and I will definetely post one more after tomorrow. But for now, here’s our final presentation.



The presentation shows how we built TechTrends and covers different aspects of the development process. It talks about crawling Hackernews and Reddit, preprocessing and learning a model to query. We also describe problems, further ideas and many more. The presentation will take about 60 to 70 minutes (and everybody is still welcome tomorrow).

The presentation will also be streamed live on http://events.mi.hdm-stuttgart.de.

Best regards,
Thomas Uhrig

TechTrends Presentation

Next Friday (July 26th 2013) the final presentation of TechTrends will take place at our university. The presentation will take about 60 min and will cover topics like architecture, crawling, data storage and front end design. Everybody interested is welcome (please send me an email before). Here’s the whole schedule for next week:

09.00h-10.10h Tech Trends (by Brand, Uhrig, Pernpeintner)
10.15h-11.25h Newsline (by Förder, Golpashin, Wetzel, Keller)
11.45h-12.35h Intellligent Filters for Image Search (by Landmesser, Mussin)
12.40h-13.50h Nao Face Recognition (by Sandrock, Schneider, Müller)
13.55h-14.35h GPU-driven deep CNN (by Schröder)

The presentations will take place in room 056 (Aquarium). I will upload the presentation to my speakerdeck account at the end of next week.

Best regards,
Thomas Uhrig

TechTrends at the Media Night 2013 of the Media University Stuttgart

During this summer semester, two friends of mine and I made a student project called TechTrends. TechTrends is a web application that lets you search for articles and trends in the field of computer science. We crawl posts from Reddit and Hackernews and provide an intelligent search on them. You can type in a key-word (e.g. bitcoin) and get a timeline showing you when articles for this topic have been published.

techtrends_medianight

We will present our project at the Media Night at our university next week. It’s open for everybody and for free. It will start around 6 pm, but you can come whenever you want to. There is no schedule. Every project has its own booth, where it is presented and where you can ask question and get in touch with the people behind it.

You can find the program and information about all projects on http://www.hdm-stuttgart.de/medianight.

What? – Media Night 2013
When? – 27th June 2013, from 6 pm to 10 pm
Where? – Hochschule der Medien, Nobelstraße 10 , 70569 Stuttgart

Our project is online on http://techtrends.mi.hdm-stuttgart.de/.

Best regards,
Thomas Uhrig

Extracting meaningful content from raw HTML

Parsing HTML is easy. Libraries like Beautiful Soup give you an compact and straight forward interface to process websites in your preferred programming language. But this is only the first step. The interesting question is: How to extract the meaningful content of HTML?

I tried to find a answer to this questions during the last couple of days – and here’s what I found.

Arc90 Readability

My favorite solution is the so called Arc90 Readability algorithm. It was developed by the Arc90 Labs to make websites more comfortable to read (e.g. on mobile devices). You can find it – for example – as a Google Chrome browser plugin. The whole project is also on Google Code, but more interesting is the actual algorithm, ported to Python by Nirmal Patel. Here you can find his original source code.

The algorithm is based on two lists of HTML-ID-names and HTML-CLASS-names. One list contains IDs and CLASSes with a positive meaning, the other list contains IDs and CLASSes with a negative meaning. If a tag has a positive ID or CLASS, it will get additional points; if it has a negative ID or CLASS, it will loos points. When we calculate this points for all tags in the HTML document, we can just render the tags with the most points to get the main content in the end. Here’s an example:

The first div-tag has a very positive ID (“id=”post”), so it will probably contain the actual post. However, the div-tag in the second line has a very negative class (class=”footer”), which tells use that it seems to contain the footer of the page and not any meaningful content. With this knowledge, we do the following:

  1. get all paragraphs (p-tags) from the HTML source
  2. for each paragraph:
    1. add the parent of the paragraph to a list (if it's not already added)
    2. initialize the score of the parent with 0
    3. if the parent has a positive attribute, add points!
    4. if the parent has a negative attribute, subtract points!
    5. optional: check additional rules, e.g. a minimum length
  3. find parent with most points (the so called top-parent)
  4. render the textual content of the top-parent

Here’s my code which is based very much on the code of Nirmal Patel which you can find here. The main thing I changed, is some more cleaning before the actual algorithm. This will produce an easy to interpret HTML without scripts, images and so on, but still with all textual content.

Whitespace Rendering

The idea behind this technique is quite simple and goes like this: You go trough your raw HTML string and replace every tag (everything between < and >) with white spaces. When you render the content, all textual blocks should still be "blocks", whereas the rest of the page should be scattered words with a lot of white spaces.The only thing you have to do right now is to get the blocks of text and throw away the rest. Here's a quick implementation:

The problem with this solution is, that it's not very generic. You have to do a lot of fine tuning to find a good length of white space to split the string. Websites with a lot of markup will produce much more white spaces compared to simple pages. On the other side, this is a quite simple approach - a simple is mostly good.

Libraries

As always: don't reinvent the wheel! There are a lot of libraries out there that are dealing with this problem. On of my favorite libraries is Boilerpipe. You can find it online as a web-service on http://boilerpipe-web.appspot.com/ and as a Java project on https://code.google.com/p/boilerpipe/. It's doing a real good job, but compared to the two algorithms I explained above, it's much more complicated inside. However, using it as a black-box might be a good solution to find your content.

Best regards,
Thomas Uhrig

Writing an online scraper on Google App Engine (Python)

Sometimes you need to collect data – for visualization, data-mining, research or whatever you want. But collecting data takes time, especially when time is a major concern and data should be collected over a long period.
Typically you would use a dedicated machine (e.g. a server) to do this, rather then using your own laptop or PC to crawl the internet for weeks. But setting up a server can be complicated and time consuming – nothing you would do for a small private project.

A good and free alternative is the Google App Engine (GAE). The GAE is a web-hosting service of Google which offers a platform to upload Java and Python applications. It comes with its own user authentication system and its own database. If you already have a Google account, you can upload up to ten applications for free. However, the free version has some limitations, e.g. you only have a 1 GB database with a maximum of 50.000 write-operations per day (more details).

One big advantage of the GAE is the possibility to create cron-jobs. A cron-job is a task that is executed on fixed points in time, e.g. all 10 minutes. Exactly what you need to build a scraper!

But let’s do it step by step:

1. Registration

First of all, you need a Google account and you must be registered by the GAE. After your registration, you can create a new application (go to https://appengine.google.com and click on Create Application).

01_gae_registration

Choose the name for your application wisely, you can’t change it later on!

2. Install Python, GAE SDK and Google Eclipse plugin

To start programming for GAE, you need to set up some simple things. Since we want to develop an application in Python, Python (v. 2.7) must be installed on your computer. Also, you need to install the GAE SDK for Python. Optional, you can also install the Google plugin for Eclipse together wit PyDev which I would recommend, because it makes life much easier.

3. Create your application

Now you can start and develop your application! Open Eclipse and create a new PyDev Google App Engine Project. To make a GAE application, we need at least two files: a main Python script and the app.yaml (a configuration file). Since we want to create a cron-job, too, we need a third file (cron.yaml) to define this job. For reading a RSS stream we also use a third-party library called feedparser.py. Just download the ZIP-file and unpack the file feedparser.py to your project folder (this is ok for the beginning). A very simple scrawler could look like this:

Scrawler.py

app.yaml

Note: The application must be the same name as your registered on Google in the first step!

cron.yaml

Done! Your project should look like this now (including feedparser.py):

02_gae_project

4. Test it on your own machine

Before we deploy the application on GAE, we want to test it locally to see if it is really working. To do so, we have to make a new run-configuration in Eclipse. Click on the small arrow at the small green run button and choose “Run configurations…”. Then, create a new “Google App Engine” configuration and fill in the following parameters (see the pictures):

Name:
GAE (you can choose anything as name)

Project:
TechRepScrawler (your project in your Eclipse workspace)

Main Module:
C:Program Files (x86)Googlegoogle_appenginedev_appserver.py (dev_appserver.py in your GAE installation folder)

Program Arguments:
--admin_port=9000 "C:UsersThomasworkspace_pythonTechRepScrawler"

04_gae_run_config_1

04_gae_run_config_2

After starting GAE locally on your computer using the run configuration, just open your browser and go to http://localhost:8080/ to see the running application. You can also go to an admin perspective on http://localhost:9000/ to see, e.g. some information about your data.

5. Deploy your application to GAE

The next – and last step! – is to deploy the application on GAE. Using the Google Eclipse plugin, this is as easy as it can be. Just click right on your project, go to PyDev: Google App Engine and click upload. Now your app will be upload on GAE! On the first time, you will be asked for your credentials, that’s all.

05_gae_upload

06_gae_deployed

Now your app is online and available for every one! The cron-job will refresh it every 10 minutes (which just means, it will surf on your site like every other user would do it). Here’s how it should look:

07_last

Best regards,
Thomas Uhrig

Gelesen: Grundkurs Künstliche Intelligenz

Eines der größten Forschungsgebiete in der aktuellen Softwareentwicklung ist wohl die künstliche Intelligenz und das Data Mining. Es geht um intelligente Lösungen, das Erkennen von Trends und das Lösen von Aufgaben “in denen der Mensch momentan noch die Nase vorne hat” – so beschreibt zumindest Elaine Rich die KI.

Mit seinem Buch “Grundkurs Künstliche Intelligenz” gibt Wolfgang Ertel eine fundierte Einführung in die Grundlagen der modernen KI. Und eines vorweg: dieses Buch ist durchaus gelungen und sehr empfehlenswert.

Ertel durchstreift auf rund 330 Seiten nahezu alle Teilgebiete der KI. Angefangen bei der Aussagen- und Prädikatenlogik, über Suchbauprobleme und Wahrscheinlichkeitstheorie hinweg, bis hin zu neuronalen Netzen und ihren Anwendungen. Dabei finden sich in jedem Kapitel mehrere (praxisnahe) Beispiele, Grafiken und Erklärungen. Außerdem hat Ertel zu jedem Kapitel eine kleine Aufgabensammlung samt Lösung zusammengestellt.

Etwas problematisch ist an diesem Buch der durchweg mathematische Ansatz. So muss man keineswegs Softwareentwicklung oder ähnliches studieren um an diesem Werk gefallen zu finden – es reicht auch Mathematik. Jeder Ansatz wird akkurat hergeleitet und beweisen. Dies ist zwar eine an sich schöne Vorgehensweise, macht das Lesen und verstehe aber zum Teil nicht einfacher. Ich bin mir nicht sicher ob dies der Sache (die Grundlagen der KI zu verstehen) dienlich ist.

Die Aufgabensammlungen gestalten sich dabei oftmals ähnlich. Es müssen Sätze und Formeln bewiesen werden, Regeln hergeleitet oder Behauptungen widerlegt werden. Echte Aufgaben zu Anwendungen der KI finden sich zwar auch, aber eher wenige. Außerdem sind einige Aufgaben schlicht Verweise auf andere Bücher, die man nicht mal eben so bearbeiten kann:

Besorgen Sie sich den Theorembeweiser E [Sch02] …

Solch ein Hinweis bringt einem wiederum in das angefügte und reichlich gefüllte Literaturverzeichnis und zu dem Verweis:

S. Schulze. E A Brainiac Theorem Prover. Journal of AI Communications 15 (2002)

…womit mir an einem Sonntag Nachmittag in der Klausurvorbereitung leider recht wenig geholfen ist.

Das Buch eignet sich meiner Meinung nach im Übrigen auch nicht zum Selbststudium. Zwar ist das gesamte Gebiet der KI gut und ausführlich beschrieben, meistens aber eben nur knapp und mathematisch exakt. Daher ist dieses Buch vor allem als Begleitung zu einer Vorlesung zu empfehlen.

Insgesamt ist “Grundkurs Künstliche Intelligenz” (ISBN: 978-3-8348-0783-0) von Wolfgang Ertel aber ein rundherum empfehlenswerte, weil sehr solides Werk. Man merkt, dass dieses Buch gut durchdacht ist und mit viel Mühe geschrieben wurde. Ein Kauf lohnt auf jeden Fall. Außerdem gibt es eine weitere gute Möglichkeit dieses Buch kostenlos zu beziehen. Der Vieweg und Teubner Verlag bietet nämlich für Studenten mit einem Zugang von ihrer Hochschule aus, die Möglichkeit das komplette Buch als PDF herunterzuladen. Ein sehr fairer Deal. Informationen hierzu erhält man meist von seiner örtlichen Bibliothek oder ähnlichem.

Beste Grüße, Thomas Uhrig.