Hello Berlin!

2014-10-03 17.53.23

The last months have been an interesting and busy time for me (that’s why I didn’t write as much as I had planed, but I already got some new drafts in my WordPress!). After I finished my master’s thesis at Informatica in Stuttgart (greetings to the whole team!), I moved to Berlin for my first “real” job. I started at a company called Westernacher Products and Services, which does a lot of web applications and consulting for juristic institutions in Germany. Based on an up-to-date technology stack with Spring, Gradle, ExtJS and AngularJS, Westernacher implements document management and all kinds of business applications. Being part of the development team I am looking forward to a bunch of interesting projects and to a lot of new stuff to learn. Nice to be here!

Best regards,
Thomas

Who is using OSGi?

Who is using OSGi? If you take a look http://www.osgi.org/About/Members you will see more than a hundred members of the the OSGi Alliance and a lot of big players just like IBM or Oracle. But lets do a little investigation with Google Trends on our own.

Google Trends

Google Trends is a service where you can search for a particular key word and get a time line. The time line shows you when the key word was search and how many requests have been made. That is a great way to estimate how popular a certain technology is at a time. Google will also show you where the search requests have been made – and that is where we start.

OSGi in Google Trends

If we search for “OSGi” on Google Trends we will see a chart just like the one shown below. As we can see, OSGi is somehow over its peak and the interest in the technology is decreasing since a couple of years.

But even more interesting is the map which shows where the search requests have been made. As we can see, most requests came from China. On the fourth place is Germany.

If we take a closer look on Germany, we see that most requests come from the south of Germany.

But we can even get more specific and view the exact cities. It is a little bit hard to see in the chart, but you can click on the link at the bottom to see the full report. You will see Walldorf, Karlsruhe and Stuttgart on top. So what? Well, in Walldorf, there is one big player who is not on the list of the OSGi Alliance: SAP.

We can do the very same with the USA and we will end up in California and Redwood City, where companies like Oracle and Informatica are located.

Best regards,
Thomas

Flatten a Docker container or image

Docker containers and respectively images can become fairly large. I recently worked with a Docker image which was over 7 GB big. However, it is pretty easy to flatten an image at the end.

Difference between save and export

As I described in my last post (http://tuhrig.de/difference-between-save-and-export-in-docker), there are two ways to persist a Docker images or container:

  • A Docker image can be saved to a tarball and loaded back again. This will preserve the history of the image.

  • A Docker container can be exported to a tarball and imported back again. This will not preserve the history of the container.

No history

We can use this mechanism to flatten and shrink a Docker container. If we save an image to the disk, its whole history will be preserved, but if we export a container, its history gets lost and the resulting tarball will be much smaller.

We can see the history of a image be running docker tag <LAYER ID> <IMAGE NAMEgt;:

So if we export a container (either an already running one or just start a new one from an image) it will lose its history and all previous layers. This will make it impossible to make a rollback to a certain layer, but it will also shrink the image. My >7 GB image is now >3 GB large, which saves more than 50% of disk space.

Flatten a Docker container

So it is only possible to “flatten” a Docker container, not an image. So we need to start a container from an image first. Then we can export and import the container in one line:

What else?

You can use some common Linux tricks to shrink Docker images. One simple trick is to clear the cache of the package manager. So depending on which base image you use you can do something like this (for an Ubuntu/Debian system, for more see here):

Resources

Best regards,
Thomas

Install OpinionTrends with nginx and memcached

OpinionTrends is build with Python and Flask. Therefore you can run it without any additional server right out of the box. Batteries included! However, it is much more common and much more efficient if you run it with a web server and an application server. A widely used combination for Python web applications is nginx together with uWSGI. In this tutorial I want to show how we set up this two tools to run TechTrends and OpinionTrends. The tutorial starts at the very beginning and the only thing I assume is that you run a Linux machine with Python and easy_install.

Note: OpinionTrends is the new version of TechTrends. However, it is just a code name. So when I talk about OpinionTrends or TechTrends I mean the same thing.

OpinionTrends

Base folder

The first thing we have to do is to create the base directory for TechTrends. By convention a web application is located in /var/www. All of our code will go into this folder.

Clone & Update

Now we can start to set-up the application from scratch. OpinionTrends is deployed with git. So we just check-out the code and switch to the new opinion branch:

After the first checkout an update is super easy. Just do a pull for the latest code:

Configuration

Now we have to make some simple settings for TechTrends. To do this, we create a new file called config.py in the folder Configuration in our checked-out project. There is also a file called config_template.py in this folder which is an empty but well documented template for the configuration. The file contains all individual settings for the application. It should look like this:

Dependencies

Now we have to install the libraries needed by TechTrends. To do this we use easy_install. All dependencies are in a file called requirements.txt in the root folder of TechTrends. We have to install all dependencies in this file:

Note: As you see, the installation of the dependencies is very easy in theory. However, some dependencies such as scikit-learn are sometimes hard to install since they are not pure Python and use come C bindings. If you have problems installing the whole requirements.txt at once try to install every dependency manually on its own.

nginx

800px-Nginx_Logo.svg

Install

Installing nginx is not a big thing:

After we installed and started nginx we can verify if it is running correctly by taking our favorite browser and surf to our machine.

By the way, stopping nginx is also very simple:

Configuration

Now we have to configure ngix to point to TechTrends instead to the default welcome page. To do this we first remove the default configuration:

Now we create our own configuration in /var/www/techtrends/nginx.conf. It should look like this:

We link this file to nginx a restart it:

Now we should get a Bad Gateway exception. Perfect! This tells use that nginx found our configuration and that every things looks good – except of the missing uWSGI!

uWSGI

logo_uWSGI

Install

uWGSI is the protocol between our Python application and nginx. It is their way of communication. First we have to install it:

Configuration

Now we create a configuration file in /var/www/techtrends/uwsgi.ini. It should look like this:

Now we can start uWSGI as a daemon in the background:

Done! Nginx is serving TechTrends now. However, we should get an exception again since memcached is still missing. If we want to use TechTrends without memcache we have to change a value in the config.py in the Configuration folder in TechTrends base directory. We have to set DEBUG = True to not use memcache.

memcached

memcached_banner75

One of the biggest performance improvements you can do (I guess in general, but especially for TechTrends) is to use memcached. Memcached is a key-value in-memory store to cache frequently requested data. In TechTrends we use it to store whole pages and JSON responses from our API.

Install

Install memcached first:

And that’s it! Memcached is installed and running now. You can restart memcached (e.g. to clear it) like this:

Congratulations

Congratulations! TechTrends should run now in a stable production mode. We installed nginx, uWSGI and memcached. We also configured it to work together. Great! But – there are still some open points we have to do.

crontab

TechTrends/OpinionTrends has two regularly scheduled jobs. One job is the crawler which crawls posts from Reddit and Hackernews and the other job is the training and restart of the application. To execute this jobs we set-up a cron tab. First we create two files which execute these jobs. The first file is called crawl.sh and looks like this:

The second file is called restart.sh and looks like this:

Both files should be in the root folder (/var/www/techtrends/) of TechTrends. Now we add those two files to our locale crontab. We can edit the it like this:

It should look like this:

This crontab will run the crawler every 30 minutes and once a day at 3 o’clock it will trigger the training and restart of the whole application. So the data will growth every 30 minutes and will be indexed once a day. That’s it.

Best regards,
Thomas

Opinion Mining on Hackernews and Reddit

TechTrends

Last semester two of my friends and I made some sort of a search engine for Hackernews and Reddit. The idea was to collect all articles published on those two platforms and search them for trends. It should be possible to type-in a certain keyword such as “Bitcoin” and retrieve a trend chart showing when and how many articles have been published to “Bitcoin”.

The result was TechTrends. Based on Radim Řehůřek’s Gensim (a Python framework for text classification) we build a web application which crawls Hackernews and Reddit continuously and offers a simple web interface to search trends in these posts. You can find more posts about TechTrends here.

OpinionTrends

This winter semester I started to implement a new feature for TechTrends. I wanted to build an opinion mining and sentiment analysis for all posts. Based on the comments for each post on Hackernews and Reddit I wanted to classify all posts in one of three categories:

  • Positive, which means that most of the comments are praising the article.
  • Negative, which means that most of the comments are criticizing the article.
  • Neutral, which means that there is a balance between positive and negative comments.

I gave it the code name OpinionTrends.

The basic idea

The basic idea was to train a supervised classifier to categorize each comment and therefore each post. This should work similar to a spam filter in an email application: Each email marked as ‘spam’ will train a classifier which can categorize emails on its own in good or bad (which actually means spam). I wanted to do the same but with comments instead of emails and with three instead of two categories.

Training

A classifier is only as good as its training data. In case of a spam filter the training data are emails marked as ‘spam’ by the user. This makes the training data very good and very individual to each user. In my case I decided to use Amazon product reviews.

Amazon Product Reviews

Amazon product reviews are a great way to retrieve training data. They are marked with stars in 5 categories, they are available in many languages and for many domains and you can crawl them very easily. The only thing you have to do is to sign up a free developer account on Amazon and get started with your favorite language (there are libraries for most common languages out there).

Once the classifier is trained, it can be saved to a file and used further on. It doesn’t need to be updated anymore. However, the performance of the application depends completely on the classifier. Therefore it should be trained and tested carefully.

Validation

I tested different classifiers and different pre-processing steps of the Amazon Reviews. Below you can see a comparison between a Bayes Classifier and a SVM. The SVM beats the Bayes Classifier by 10% or more. However, its performance also depends dramatically on the pre-processing of the raw reviews.

Type of classfier No pre-processing Bi-grams Only adjectives No stop words
Bayes Classifier 71% accuracy 65% accuracy 70% accuracy 72% accuracy
SVM 85% accuracy 86% accuracy 78% accuracy 84% accuracy

Problems with validation

All tests were made with a 10-fold cross-validation. The only problem with those tests is, that I trained and tested with reviews from Amazon, but my final data were comments on blog posts which is not the exact same domain. Since Hackernews and Reddit are both about computer science I used reviews from SSDs, Microsoft and Apple software or computer games to be as close as possible to my final domain. However, I can’t really validate my final results. This has two reasons:

  • I don’t have a huge number of tagged blog posts and comments to compare them with the results of the classifier.
  • Comments are very subjective. In many cases you can not decide for sure whether a comment is positive or negative. Some few comments are very clear and easy (I hate your article.), but a lot of comments are something in between (I love your website but I hate the color.). Even I as a human beeing can not decide if they are positive or negative and if I could decide it my friend would argue with me.

My final result

OpinionTrends is in its last steps since a couple of days. Next week will present it at the Media Night of my university (a fairy for student projects). You can read more about it here.

OpinionTrends is also online and in some kind of a stable-state. However, it is still under development: http://opiniontrends.mi.hdm-stuttgart.de/

This is how it works

OpinionTrends works the same as TechTrends. You go to the website, type-in a keyword and get the results. The only really new thing is a brand new tab on the result page.

op_trends_new_tab

When you click on it you will see a new chart similar to the blue one on the Timeline tab. The chart has three colors and is very easy to read. The green bars represent the positive articles, the red bars represent the negative articles and the light yellow bars represent the neutral articles. The neutral articles are visualized on the positive scale and on the negative scale with same amount.

op_trends_nsa_chart

Above you see the result for NSA, which is actually a very good example since the overwhelming opinion about the NSA is very negative which you can see perfectly in the chart.

You can click on each bar to see a pop-up showing the articles behind the bar. You can jump directly to the article or open the discussion with all comments on this article.

op_trends_pop_up

Examples

Here are some good examples to show you how OpinionTrends works. The best one I found so fare is a search for NSA. The opinion is very negative as everyone would expect.

op_trends_nsa_chart

The opinion on Git is much more balanced. It has not only a nearly equal number of positive and negative articles, it has also a lager number of neutral posts.

op_trend_git

The opinion on Python is much better. Is has a lot of neutral posts, but beside them, Python has fare more positive than negative posts.

op_trend_python

More…

OpinionTrends has some more features such as individual settings so adjust each search. However, I think this is too much for this post. You can get a lot more information directly on the project site. TechTrends/OpinionTrends is also open source, so you can checkout the source code from BitBucket. OpinionTrends is in its own branch!

I hope you enjoy it and I would be really happy about some feedback.

Best regards,
Thomas

When your Acer Aspire 5560G doesn’t shutdown anymore

Last year I bought an Acer Aspire 5560G as my second laptop (yes, sometimes two are better than one). A half year later, I was working pretty much with this laptop and decided to upgrade it with a SSD drive. So I bought a brand new Samsung SSD 840 PRO with 256 GB. I was very happy. Well, was…

After installing the SSD and reinstalling Windows 7 there was something strange – Windows didn’t shutdown anymore. Whenever I tried to shutdown my laptop (or restart it or put it to hibernate) Windows went off, but the laptop was still on. The ventilator was running, power was on and the lights at the front of my laptop were blinking. So what?

So I tried to fix it. I reinstalled Windows (both, Windows 7 and Windows 8.1). I installed all kind of driver, even of devices I don’t have. I updated, I formated, I read the internet and so on. I even flashed my BIOS to be sure. But nothing worked for me.

Last night I was searching the web again and I was very despaired. I found a forum post of some guy who had the same problem. And his solution was pretty simple: he installed Windows using an USB driver rather than a DVD. I thought what the fuck, he’s joking, forget about it. But as I woke up this morning I was still despraid and I thought, well, let’s try it. I mean, why not?

So I put my Windows 7 DVD into my drive and copied it to an USB drive using some dubios freeware software that tried to install one browser-toolbar after another. But ten minutes later I had my Windows copy on my bootable USB stick. Restart, format, install, waiting… During the installation the laptop restarted several times and I assumed that this is a good sign.

After aproximately 30 minutes the big moment came and I pressed on “shutdown” in my brand new Windows 7 installation. Aaaannnddd baaammm it went off!

I studied 11 semesters of computer science right now. I installed all kinds of Windows (98, XP, Vista, 7, 8), some Linux distributions and other stuff. But that there is a difference between installing Windows from a DVD or from an USB stick doesn’t make any f***ing sense to me. Thank you Microsoft.

reinstalling_windows

Best regards,
Thomas Uhrig

Posting Speaker Deck Presentations in WordPress

I’ve recently tried to post a Spreaker Deck presentation on my WordPress blog. But a problem using Google’s Chrome browser came up to me: Chrome throws a scripting exception when I post the embedded code.

The exception:

The embedded code:

Chrome seems to refuse the execution of the script – perhaps to prevent something like cross-site-scripting. But the solution is pretty simple. First, install a new plugin to your wordpress blog to embed Speacker Deck presentations. Then fix it by changing the method to:

Here‘s the post for the fix. Thanks!

Plugin: http://wordpress.org/extend/plugins/speakerdeck-embed

Fix: http://wordpress.org/support/topic/plugin-speaker-deck-embed-add-https-support

Best regards,
Thomas Uhrig

Ein Studium in Zahlen

Mit Ende des aktuellen Semesters, WS2011/12, beende ich mein Bachelorstudium. Einen Bachelor of Science inklusive. Und sonst? – Zahlen:

Kohle:

  • 1260 € für den Öffentlichen Nahverkehr in Stuttgart
  • 10500 € Miete (WOW!) + 3000 € für Strom und… oh je, die Rechnung lasse ich lieber…
  • 4340 € Studiengebühren und Semesterbeiträge

Studium:

  • 1277 Tage Studium
  • 115 Semesterwochenstunden, was rund 1380 Stunden reine Vorlesungszeit in der HdM bedeutet
  • 39 Prüfungen
    • 28 schriftlich Klausuren
    • 9 Projekte, Präsentationen oder Hausarbeiten
    • 1 Praxisphase
    • 1 Abschlussarbeit
  • 20 Informatikbücher, 62 die nichts mit Informatik zu tun hatten
  • fast 70 Live-Sendungen von “Der Neue Morgen” auf HoRadS

On the road:

  • 13300 km auf der A81 was ca. 1300 € für Sprit
  • 6000 km gelaufen (macht drei Paar Schuhe)
  • 5000 km auf dem Rad
  • 24200 Minuten in der S-Bahn, was ca. 400 Stunden sind oder knapp 17 komplette Tage

Das wirklich wichtige:

  • 10200 Stunden geschlafen

Puh, jetzt nur hoffen, dass ich mich nirgends verrechnet habe…

Beste Grüße, Thomas Uhrig.