When modularity comes down to OSGi

Last week I attended a tech talk by Christian Schlichtherle, organized by Euro Staff in Berlin. The talk was about modularity and its different forms in software development. Christian talked about packages, libraries, APIs, dependency management and – as always when it comes down to modularity – about OSGi.

Every time somebody is talking about OSGi I get this tiny pain in my stomach. OSGi is a really great idea and a lot of people are working with it for many years right now. That’s cool! But I still have some questions about this…

Who wants to install code during run time on a server?

Let’s start with my biggest point of criticism. OSGi has two main points:

  • OSGi (maybe) solves the class path hell (we will come to this later)
  • OSGi introduces modularity during run time for Java software

But really – who builds software with a highly static language such as Java but then installs code during run time on his production environment!? Maybe I just don’t know the use-case, but every company I saw so far is looking for a stable server architecture. They want to deploy something in their test environment, then go to staging and finally do a green-blue-deployment. They don’t want to throw some new software in and let itself install it. And if they would do so, I would wonder how they deal with open user sessions, with buggy updates which breaks the system, with security and so on. In my opinion there are only very few cases where you really need and want modularity on the server.

OSGi is solving the class path hell!

It’s often said that OSGi is solving the class path hell. And when somebody is saying this, he comes up with the diamond problem:

Some package A needs a package B and C. B and C need a package D but in two different version. And you are screwed!

OSGi solves this problem, because every package will get its own class loader and every Java class is not only identified by its name but also by it class loaded. This makes it possible to load the same class in different version for package B and C. Problem solved!

But what happens when package B and C are returning objects from package D? Let’s make some stupid example: Package D contains a model: Person.class, Address.class. Package B and C are different factory strategies for this model. Package B reads data from a web service and package C from a local file. And package A is using both of them:

The problem with this code is that it looks perfectly fine and will compile, too. But it will fail on run time with a class cast exception! Why? Because User.class from package B was loaded with a different class loader as User.class from package C. So for Java, those are two different classes!

To solve this problem in OSGi you would write a service with an interface in package B and C, describe it in XML and wire it in package A. You would expose those services explicitly and create bunch of XML. Then you would package everything with an OSGi compatible manifest, assign version number to each bundle, re-pack third party libraries to OSGi bundles too, define a start-up order for the bundles, load an OSGi run time environment and then – a couple of hours later – you really solved the diamond problem.

IMHO

I think solving the class path hell in OSGi means to introduce a class loader and bundle hell instead. In pure Java it’s simple: if something is on the class path, you can use it. In OSGi, this is more complicated and if you miss to strictly use OSGi services to decouple you software you will end up with a highly coupled bunch of bundles which might throw class cast exception because you still cannot user version 1.5 and version 2 at the same time.

Best regards,
Thomas

Install OpinionTrends with nginx and memcached

OpinionTrends is build with Python and Flask. Therefore you can run it without any additional server right out of the box. Batteries included! However, it is much more common and much more efficient if you run it with a web server and an application server. A widely used combination for Python web applications is nginx together with uWSGI. In this tutorial I want to show how we set up this two tools to run TechTrends and OpinionTrends. The tutorial starts at the very beginning and the only thing I assume is that you run a Linux machine with Python and easy_install.

Note: OpinionTrends is the new version of TechTrends. However, it is just a code name. So when I talk about OpinionTrends or TechTrends I mean the same thing.

OpinionTrends

Base folder

The first thing we have to do is to create the base directory for TechTrends. By convention a web application is located in /var/www. All of our code will go into this folder.

Clone & Update

Now we can start to set-up the application from scratch. OpinionTrends is deployed with git. So we just check-out the code and switch to the new opinion branch:

After the first checkout an update is super easy. Just do a pull for the latest code:

Configuration

Now we have to make some simple settings for TechTrends. To do this, we create a new file called config.py in the folder Configuration in our checked-out project. There is also a file called config_template.py in this folder which is an empty but well documented template for the configuration. The file contains all individual settings for the application. It should look like this:

Dependencies

Now we have to install the libraries needed by TechTrends. To do this we use easy_install. All dependencies are in a file called requirements.txt in the root folder of TechTrends. We have to install all dependencies in this file:

Note: As you see, the installation of the dependencies is very easy in theory. However, some dependencies such as scikit-learn are sometimes hard to install since they are not pure Python and use come C bindings. If you have problems installing the whole requirements.txt at once try to install every dependency manually on its own.

nginx

800px-Nginx_Logo.svg

Install

Installing nginx is not a big thing:

After we installed and started nginx we can verify if it is running correctly by taking our favorite browser and surf to our machine.

By the way, stopping nginx is also very simple:

Configuration

Now we have to configure ngix to point to TechTrends instead to the default welcome page. To do this we first remove the default configuration:

Now we create our own configuration in /var/www/techtrends/nginx.conf. It should look like this:

We link this file to nginx a restart it:

Now we should get a Bad Gateway exception. Perfect! This tells use that nginx found our configuration and that every things looks good – except of the missing uWSGI!

uWSGI

logo_uWSGI

Install

uWGSI is the protocol between our Python application and nginx. It is their way of communication. First we have to install it:

Configuration

Now we create a configuration file in /var/www/techtrends/uwsgi.ini. It should look like this:

Now we can start uWSGI as a daemon in the background:

Done! Nginx is serving TechTrends now. However, we should get an exception again since memcached is still missing. If we want to use TechTrends without memcache we have to change a value in the config.py in the Configuration folder in TechTrends base directory. We have to set DEBUG = True to not use memcache.

memcached

memcached_banner75

One of the biggest performance improvements you can do (I guess in general, but especially for TechTrends) is to use memcached. Memcached is a key-value in-memory store to cache frequently requested data. In TechTrends we use it to store whole pages and JSON responses from our API.

Install

Install memcached first:

And that’s it! Memcached is installed and running now. You can restart memcached (e.g. to clear it) like this:

Congratulations

Congratulations! TechTrends should run now in a stable production mode. We installed nginx, uWSGI and memcached. We also configured it to work together. Great! But – there are still some open points we have to do.

crontab

TechTrends/OpinionTrends has two regularly scheduled jobs. One job is the crawler which crawls posts from Reddit and Hackernews and the other job is the training and restart of the application. To execute this jobs we set-up a cron tab. First we create two files which execute these jobs. The first file is called crawl.sh and looks like this:

The second file is called restart.sh and looks like this:

Both files should be in the root folder (/var/www/techtrends/) of TechTrends. Now we add those two files to our locale crontab. We can edit the it like this:

It should look like this:

This crontab will run the crawler every 30 minutes and once a day at 3 o’clock it will trigger the training and restart of the whole application. So the data will growth every 30 minutes and will be indexed once a day. That’s it.

Best regards,
Thomas

Opinion Mining on Hackernews and Reddit

TechTrends

Last semester two of my friends and I made some sort of a search engine for Hackernews and Reddit. The idea was to collect all articles published on those two platforms and search them for trends. It should be possible to type-in a certain keyword such as “Bitcoin” and retrieve a trend chart showing when and how many articles have been published to “Bitcoin”.

The result was TechTrends. Based on Radim Řehůřek’s Gensim (a Python framework for text classification) we build a web application which crawls Hackernews and Reddit continuously and offers a simple web interface to search trends in these posts. You can find more posts about TechTrends here.

OpinionTrends

This winter semester I started to implement a new feature for TechTrends. I wanted to build an opinion mining and sentiment analysis for all posts. Based on the comments for each post on Hackernews and Reddit I wanted to classify all posts in one of three categories:

  • Positive, which means that most of the comments are praising the article.
  • Negative, which means that most of the comments are criticizing the article.
  • Neutral, which means that there is a balance between positive and negative comments.

I gave it the code name OpinionTrends.

The basic idea

The basic idea was to train a supervised classifier to categorize each comment and therefore each post. This should work similar to a spam filter in an email application: Each email marked as ‘spam’ will train a classifier which can categorize emails on its own in good or bad (which actually means spam). I wanted to do the same but with comments instead of emails and with three instead of two categories.

Training

A classifier is only as good as its training data. In case of a spam filter the training data are emails marked as ‘spam’ by the user. This makes the training data very good and very individual to each user. In my case I decided to use Amazon product reviews.

Amazon Product Reviews

Amazon product reviews are a great way to retrieve training data. They are marked with stars in 5 categories, they are available in many languages and for many domains and you can crawl them very easily. The only thing you have to do is to sign up a free developer account on Amazon and get started with your favorite language (there are libraries for most common languages out there).

Once the classifier is trained, it can be saved to a file and used further on. It doesn’t need to be updated anymore. However, the performance of the application depends completely on the classifier. Therefore it should be trained and tested carefully.

Validation

I tested different classifiers and different pre-processing steps of the Amazon Reviews. Below you can see a comparison between a Bayes Classifier and a SVM. The SVM beats the Bayes Classifier by 10% or more. However, its performance also depends dramatically on the pre-processing of the raw reviews.

Type of classfier No pre-processing Bi-grams Only adjectives No stop words
Bayes Classifier 71% accuracy 65% accuracy 70% accuracy 72% accuracy
SVM 85% accuracy 86% accuracy 78% accuracy 84% accuracy

Problems with validation

All tests were made with a 10-fold cross-validation. The only problem with those tests is, that I trained and tested with reviews from Amazon, but my final data were comments on blog posts which is not the exact same domain. Since Hackernews and Reddit are both about computer science I used reviews from SSDs, Microsoft and Apple software or computer games to be as close as possible to my final domain. However, I can’t really validate my final results. This has two reasons:

  • I don’t have a huge number of tagged blog posts and comments to compare them with the results of the classifier.
  • Comments are very subjective. In many cases you can not decide for sure whether a comment is positive or negative. Some few comments are very clear and easy (I hate your article.), but a lot of comments are something in between (I love your website but I hate the color.). Even I as a human beeing can not decide if they are positive or negative and if I could decide it my friend would argue with me.

My final result

OpinionTrends is in its last steps since a couple of days. Next week will present it at the Media Night of my university (a fairy for student projects). You can read more about it here.

OpinionTrends is also online and in some kind of a stable-state. However, it is still under development: http://opiniontrends.mi.hdm-stuttgart.de/

This is how it works

OpinionTrends works the same as TechTrends. You go to the website, type-in a keyword and get the results. The only really new thing is a brand new tab on the result page.

op_trends_new_tab

When you click on it you will see a new chart similar to the blue one on the Timeline tab. The chart has three colors and is very easy to read. The green bars represent the positive articles, the red bars represent the negative articles and the light yellow bars represent the neutral articles. The neutral articles are visualized on the positive scale and on the negative scale with same amount.

op_trends_nsa_chart

Above you see the result for NSA, which is actually a very good example since the overwhelming opinion about the NSA is very negative which you can see perfectly in the chart.

You can click on each bar to see a pop-up showing the articles behind the bar. You can jump directly to the article or open the discussion with all comments on this article.

op_trends_pop_up

Examples

Here are some good examples to show you how OpinionTrends works. The best one I found so fare is a search for NSA. The opinion is very negative as everyone would expect.

op_trends_nsa_chart

The opinion on Git is much more balanced. It has not only a nearly equal number of positive and negative articles, it has also a lager number of neutral posts.

op_trend_git

The opinion on Python is much better. Is has a lot of neutral posts, but beside them, Python has fare more positive than negative posts.

op_trend_python

More…

OpinionTrends has some more features such as individual settings so adjust each search. However, I think this is too much for this post. You can get a lot more information directly on the project site. TechTrends/OpinionTrends is also open source, so you can checkout the source code from BitBucket. OpinionTrends is in its own branch!

I hope you enjoy it and I would be really happy about some feedback.

Best regards,
Thomas

Media Night Winter Semester 2013/2014

opinion-trends-poster

During the last summer semester, two friends of mine and I made a student project called TechTrends. TechTrends was a web application that let you search for articles and trends in the field of computer science. Based on posts from Reddit and Hackernews, it provided an intelligent search on a growing number of articles and blogs.

During this winter semester I continued the project and implemented a sentiment analysis for TechTrends. Based on the existing infrastructure such as our database and our crawler, I add an automated categorization of articles according to their comments on Hackernews and Reddit.

You can find the old and stable version of our project under http://techtrends.mi.hdm-stuttgart.de/. The up-to-date development version is available under http://opiniontrends.mi.hdm-stuttgart.de/.

media_night_ws13

I will present the project at the Media Night at our university next week. It’s open for everybody and for free. It will start around 6 pm, but you can come whenever you want to, there is no schedule. Every project has its own booth, where it is presented and where you can ask question and get in touch with the people behind it.

You can find the program and information about all projects on http://www.hdm-stuttgart.de/medianight.

What? – Media Night Winter 2013
When? – 16th January 2014, from 6 pm to 10 pm
Where? – Hochschule der Medien, Nobelstraße 10, 70569 Stuttgart

Best regards,
Thomas

TechTrends – Searching trends on HN and Reddit

It’s done! Last Friday (26th July 2013) was the final presentation of our semester project TechTrends. Together with Raphael Brand and Hannes Pernpeintner I worked the last 5 months on this project – and we are really happy about it.

What is TechTrends?

TechTrends is basically a search-engine for HN and Reddit (just the programming sub-reddit). You can type in a key-word and you will find a bunch of similar articles. But our main concern was not to find just articles, but also to find trends. Therefor, you will not only get a pure list of articles, you will get a chart showing when articles have been found for your query. For each article we calculate a popularity, indicating how important that article is. Based on these popularities, we draw the trend-chart for your search.


reddit_logo

hn_logo

How does it work?

TechTrends has six major parts (see the graphic below). First of all, we crawl HN and Reddit all 15 minutes to get the latest links. In the second part we get the actual content form each link and store it in our database. Then we do a preprocessing on this pure text content to remove stop-words, digits and so on. After that, we use the great Gensim Server from Radim Řehůřek to build an index of all documents. The next (and last part on the server) is a JSON-based web API to access all of our data (here is its documentation). On top of these API we built our user-interface – the actual website.

techtrends-components

Presentation

Here is the video of our final presentation about TechTrends in our university (on the 26the July 2013). Our presentation is about 70 minutes long and we explain a lot of details about our project. The video is available on http://events.mi.hdm-stuttgart.de/2013-07-25-vortr%C3%A4ge-programming-intelligent-applications#techtrends or below. It also contains the presentations of the other groups, but we are the first on the video.You can find the slides under the video.

Video

Thanks to Stephan Soller and his team, our presentation has been recorded on video. You can see the video below or on http://events.mi.hdm-stuttgart.de/.

Slides

Here are the slides on my speaker deck account.

What’s next?

We have a lot of ideas for the future of TechTrends. We are thinking for example about a mobile version or a reporting tool. But in my oppinion, the most important step is to make TechTrends more easy to customize. Currently, we are focused on HN and Reddit. However, everything but the actual crawlers is independent of the underlying source. With a little bit of work, you can easily implement a new crawler for the data source of your choice. Making this customization more comfortable and easy is our next goal for project.

More

There is plenty more! Here you go:

Best regards,
Thomas Uhrig

TechTrends Final Presentation

Tomorrow morning is the final presentation of our semester project TechTrends. I posted several articles about TechTrends (see here) and I will definetely post one more after tomorrow. But for now, here’s our final presentation.



The presentation shows how we built TechTrends and covers different aspects of the development process. It talks about crawling Hackernews and Reddit, preprocessing and learning a model to query. We also describe problems, further ideas and many more. The presentation will take about 60 to 70 minutes (and everybody is still welcome tomorrow).

The presentation will also be streamed live on http://events.mi.hdm-stuttgart.de.

Best regards,
Thomas Uhrig

TechTrends Presentation

Next Friday (July 26th 2013) the final presentation of TechTrends will take place at our university. The presentation will take about 60 min and will cover topics like architecture, crawling, data storage and front end design. Everybody interested is welcome (please send me an email before). Here’s the whole schedule for next week:

09.00h-10.10h Tech Trends (by Brand, Uhrig, Pernpeintner)
10.15h-11.25h Newsline (by Förder, Golpashin, Wetzel, Keller)
11.45h-12.35h Intellligent Filters for Image Search (by Landmesser, Mussin)
12.40h-13.50h Nao Face Recognition (by Sandrock, Schneider, Müller)
13.55h-14.35h GPU-driven deep CNN (by Schröder)

The presentations will take place in room 056 (Aquarium). I will upload the presentation to my speakerdeck account at the end of next week.

Best regards,
Thomas Uhrig

Media Night Review

Raphael Brand, Hannes Pernpeintner and I presented our semester project yesterday on the Media Night at our university. It was very nice to met some interested people, to answer their questions and to show what we did during the last four months. Thanks.

Here are some impressions of last night. Our project is still online on http://techtrends.mi.hdm-stuttgart.de/. We will run it as long as we can.

Best regards,
Thomas Uhrig

TechTrends at the Media Night 2013 of the Media University Stuttgart

During this summer semester, two friends of mine and I made a student project called TechTrends. TechTrends is a web application that lets you search for articles and trends in the field of computer science. We crawl posts from Reddit and Hackernews and provide an intelligent search on them. You can type in a key-word (e.g. bitcoin) and get a timeline showing you when articles for this topic have been published.

techtrends_medianight

We will present our project at the Media Night at our university next week. It’s open for everybody and for free. It will start around 6 pm, but you can come whenever you want to. There is no schedule. Every project has its own booth, where it is presented and where you can ask question and get in touch with the people behind it.

You can find the program and information about all projects on http://www.hdm-stuttgart.de/medianight.

What? – Media Night 2013
When? – 27th June 2013, from 6 pm to 10 pm
Where? – Hochschule der Medien, Nobelstraße 10 , 70569 Stuttgart

Our project is online on http://techtrends.mi.hdm-stuttgart.de/.

Best regards,
Thomas Uhrig

Extracting meaningful content from raw HTML

Parsing HTML is easy. Libraries like Beautiful Soup give you an compact and straight forward interface to process websites in your preferred programming language. But this is only the first step. The interesting question is: How to extract the meaningful content of HTML?

I tried to find a answer to this questions during the last couple of days – and here’s what I found.

Arc90 Readability

My favorite solution is the so called Arc90 Readability algorithm. It was developed by the Arc90 Labs to make websites more comfortable to read (e.g. on mobile devices). You can find it – for example – as a Google Chrome browser plugin. The whole project is also on Google Code, but more interesting is the actual algorithm, ported to Python by Nirmal Patel. Here you can find his original source code.

The algorithm is based on two lists of HTML-ID-names and HTML-CLASS-names. One list contains IDs and CLASSes with a positive meaning, the other list contains IDs and CLASSes with a negative meaning. If a tag has a positive ID or CLASS, it will get additional points; if it has a negative ID or CLASS, it will loos points. When we calculate this points for all tags in the HTML document, we can just render the tags with the most points to get the main content in the end. Here’s an example:

The first div-tag has a very positive ID (“id=”post”), so it will probably contain the actual post. However, the div-tag in the second line has a very negative class (class=”footer”), which tells use that it seems to contain the footer of the page and not any meaningful content. With this knowledge, we do the following:

  1. get all paragraphs (p-tags) from the HTML source
  2. for each paragraph:
    1. add the parent of the paragraph to a list (if it's not already added)
    2. initialize the score of the parent with 0
    3. if the parent has a positive attribute, add points!
    4. if the parent has a negative attribute, subtract points!
    5. optional: check additional rules, e.g. a minimum length
  3. find parent with most points (the so called top-parent)
  4. render the textual content of the top-parent

Here’s my code which is based very much on the code of Nirmal Patel which you can find here. The main thing I changed, is some more cleaning before the actual algorithm. This will produce an easy to interpret HTML without scripts, images and so on, but still with all textual content.

Whitespace Rendering

The idea behind this technique is quite simple and goes like this: You go trough your raw HTML string and replace every tag (everything between < and >) with white spaces. When you render the content, all textual blocks should still be "blocks", whereas the rest of the page should be scattered words with a lot of white spaces.The only thing you have to do right now is to get the blocks of text and throw away the rest. Here's a quick implementation:

The problem with this solution is, that it's not very generic. You have to do a lot of fine tuning to find a good length of white space to split the string. Websites with a lot of markup will produce much more white spaces compared to simple pages. On the other side, this is a quite simple approach - a simple is mostly good.

Libraries

As always: don't reinvent the wheel! There are a lot of libraries out there that are dealing with this problem. On of my favorite libraries is Boilerpipe. You can find it online as a web-service on http://boilerpipe-web.appspot.com/ and as a Java project on https://code.google.com/p/boilerpipe/. It's doing a real good job, but compared to the two algorithms I explained above, it's much more complicated inside. However, using it as a black-box might be a good solution to find your content.

Best regards,
Thomas Uhrig