Install OpinionTrends with nginx and memcached

OpinionTrends is build with Python and Flask. Therefore you can run it without any additional server right out of the box. Batteries included! However, it is much more common and much more efficient if you run it with a web server and an application server. A widely used combination for Python web applications is nginx together with uWSGI. In this tutorial I want to show how we set up this two tools to run TechTrends and OpinionTrends. The tutorial starts at the very beginning and the only thing I assume is that you run a Linux machine with Python and easy_install.

Note: OpinionTrends is the new version of TechTrends. However, it is just a code name. So when I talk about OpinionTrends or TechTrends I mean the same thing.

OpinionTrends

Base folder

The first thing we have to do is to create the base directory for TechTrends. By convention a web application is located in /var/www. All of our code will go into this folder.

Clone & Update

Now we can start to set-up the application from scratch. OpinionTrends is deployed with git. So we just check-out the code and switch to the new opinion branch:

After the first checkout an update is super easy. Just do a pull for the latest code:

Configuration

Now we have to make some simple settings for TechTrends. To do this, we create a new file called config.py in the folder Configuration in our checked-out project. There is also a file called config_template.py in this folder which is an empty but well documented template for the configuration. The file contains all individual settings for the application. It should look like this:

Dependencies

Now we have to install the libraries needed by TechTrends. To do this we use easy_install. All dependencies are in a file called requirements.txt in the root folder of TechTrends. We have to install all dependencies in this file:

Note: As you see, the installation of the dependencies is very easy in theory. However, some dependencies such as scikit-learn are sometimes hard to install since they are not pure Python and use come C bindings. If you have problems installing the whole requirements.txt at once try to install every dependency manually on its own.

nginx

800px-Nginx_Logo.svg

Install

Installing nginx is not a big thing:

After we installed and started nginx we can verify if it is running correctly by taking our favorite browser and surf to our machine.

By the way, stopping nginx is also very simple:

Configuration

Now we have to configure ngix to point to TechTrends instead to the default welcome page. To do this we first remove the default configuration:

Now we create our own configuration in /var/www/techtrends/nginx.conf. It should look like this:

We link this file to nginx a restart it:

Now we should get a Bad Gateway exception. Perfect! This tells use that nginx found our configuration and that every things looks good – except of the missing uWSGI!

uWSGI

logo_uWSGI

Install

uWGSI is the protocol between our Python application and nginx. It is their way of communication. First we have to install it:

Configuration

Now we create a configuration file in /var/www/techtrends/uwsgi.ini. It should look like this:

Now we can start uWSGI as a daemon in the background:

Done! Nginx is serving TechTrends now. However, we should get an exception again since memcached is still missing. If we want to use TechTrends without memcache we have to change a value in the config.py in the Configuration folder in TechTrends base directory. We have to set DEBUG = True to not use memcache.

memcached

memcached_banner75

One of the biggest performance improvements you can do (I guess in general, but especially for TechTrends) is to use memcached. Memcached is a key-value in-memory store to cache frequently requested data. In TechTrends we use it to store whole pages and JSON responses from our API.

Install

Install memcached first:

And that’s it! Memcached is installed and running now. You can restart memcached (e.g. to clear it) like this:

Congratulations

Congratulations! TechTrends should run now in a stable production mode. We installed nginx, uWSGI and memcached. We also configured it to work together. Great! But – there are still some open points we have to do.

crontab

TechTrends/OpinionTrends has two regularly scheduled jobs. One job is the crawler which crawls posts from Reddit and Hackernews and the other job is the training and restart of the application. To execute this jobs we set-up a cron tab. First we create two files which execute these jobs. The first file is called crawl.sh and looks like this:

The second file is called restart.sh and looks like this:

Both files should be in the root folder (/var/www/techtrends/) of TechTrends. Now we add those two files to our locale crontab. We can edit the it like this:

It should look like this:

This crontab will run the crawler every 30 minutes and once a day at 3 o’clock it will trigger the training and restart of the whole application. So the data will growth every 30 minutes and will be indexed once a day. That’s it.

Best regards,
Thomas

Working with GerritHub.io

This semester I attended an university course called System Engineering and Management taught by Prof. Walter Kriha. Together with Jan Müller, I build a continuous integration environment for the course. As we were looking for a code review system we found a plattform called GerritHub.io which offered a Gerrit system for GitHub as a service. Here is our experience with this service.

Gerrit

gerrit-logo

Gerrit is an open-source tool to do code reviews with Git. In its core it encapsulates a Git implementation (JGit in this case) and works exactly like Git for a developer. You can clone a Gerrit repository, commit and push changes to it. The only difference is that code must be reviewed by somebody else before it is merged into the real repository. Gerrit acts like a layer between the developer and the repository. It is used to enforce a certain review policy in (big) teams.

GerritHub.io

A good idea

GerritHub.io offers a Gerrit system for GitHub as a service. This is very nice, because setting up Gerrit on our own can be very annoying. Gerrit is written in Java and comes with a Maven pom file. This should make things easy, but it doesn’t. Some dependencies needed by Gerrit are not in public Maven repositories so you have to get theses libraries first on our own. If you need a certain plugin for Gerrit (e.g. to connect easily to GitHub) it gets even worse. Most plugins depend on a certain version of Gerrit and mostly not on the current one. The versioning of the plugins and of Gerrit isn’t the same, too. Therefore, GerritHub.io is a really good idea.

Let’s begin with something positive, it is easy!

To use GerritHub.io is really easy. You have to create a developer account and connect this account with our GitHub account. Then you can clone your GitHub repository to GerritHub.io. From this point on you only push and pull to GerritHub.io, not to GitHub any more. GerritHub.io lets you do some settings and will offer you the code review. After a review it will automatically merge the code back to GitHub so both repositories will stay synchronized. It’s really easy.

IMHO, it’s very buggy

conversation_with_GerritHub

GerritHub.io is a really good idea and it is really easy to use, but – and this is my own humbling opinion – it is really really buggy.

We used GerritHub.io for no longer than two months in a small team of two persons. We pushed maybe 20 commits. In this time we had approximately 5 problems with the platform which we couldn’t fix on our own and which blocked use completely. There were no error messages and as far as I understand nothing that we as a user did wrong. So we wrote an email to the support. And another one. And another one…

The support of GerritHub.io was really nice and really fast. But more important, they always fixed our problems. Sometimes it was something with their file system, sometimes a problem with OAuth, sometimes something else. Whatever, this should not happen. I know that GerritHub.io is for free and maybe it is a very new project. But it is backed by a company called GerritForge.com which tries to make money with such products. But this doesn’t cast a positive light on the company. Besides the blocking bugs on GerritHub.io, there were some minor bugs like the search/filter function which didn’t find any projects or the user interface which sometimes just stopped loading anything. IMHO, a project should not go public like this.

If GerritHub.io will fix the problems, the platform could be a really great product. And maybe the problems are already fixed when you read this post. But right now, I can not recommend this platform. It’s crap.

Best regards,
Thomas

Media Night Apps

This year there is something new on the Media Night, the student fair at our university.

The booklet is dead!

In its past, the Media Night had had a printed booklet with its program and time table in it. The booklet was produced by one of our printing faculties for which the HdM is pretty famous. It was distributed for free during the evening of the fair.

old_bookelt

Long live the booklet!

However, since this semester the good old booklet is history! Instead of it, the Media Night has a brand new app with all projects, students and even an indoor navigation. You can find it in Apple’s App Store and Google’s Play Store.

Most innovative projects

But why only have one app if you could have two? In case you have an Apple iPad or iPhone you can get an app about the 10 most innovative projects of this Media Night. And – TechTrends is one of it!

innovative_app

I hope you enjoy the apps, the projects and the evening.

Best regards from the Media Night,
Thomas

Continuous Integration Development Workflow

This semester I attended an university course called System Engineering and Management taught by Prof. Walter Kriha. The course has a slightly different topic every year and is made up of presentations from students, research assistants and other lecturers. The topic in this year was continuous integration and software deployment. Together with Jan Müller, I set up a continuous integration environment to develop a Python web app.

Our demo app

We made a simple demo app to demonstrate your workflow and to show the purpose of the components in our CI environment. We build our app with Python and Flask on the server, a simple template rendering for the client and a MongoDB at the back-end. The app itself was a simple CRUD application to add, update and delete users from a list.

The CI environment

ci_env_2

GitHub

We used GitHub as our central repository and to communicate with the rest of the world. The idea (for a real world team) was to use GitHub as a repository, as a wiki, as an issue tracker and as a community tool, e.g. for pull requests. However, we did not push any code directly to GitHub! Here is why:

GerritHub

We used GerritHub as a review tool and as our central repository for all development tasks. To do this, GerritHub cloned the original repository from GitHub and automatically synchronized both of them. As a developer we checked out code from GerritHub (not from GitHub!) and also pushed our changes to GerritHub (and, guess what, not to GitHub directly!). After a new change was pushed to GerritHub, all developers of the project received an email notification about the change. The change itself had to be reviewed and approved in GerritHub by at least one other developer. Only after a change was reviewed and approved it went into our code base on GitHub (GerritHub merged it into the code based). This enforced a certain policy in our development process where every piece of code must be reviewed by somebody else.

BuildBot

After the code was pushed to GerritHub, approved there and merged into GitHub, it was time for our build system. We set up a so called post commit hook in GitHub, which informed our build system about code changes. So each time new code arrived from GerritHub in our GitHub repository our build system started a new build. Since our app was made in Python we decided to use BuildBot (which itself is made in Python) as our build system. However, what we did is possible with any other build system, too. BuildBot did a set of tasks for us:

  1. Check out the latest code from GitHub
  2. Execute all unit tests
  3. Run static code analysis such as PEP8
  4. Build an artifact to download
  5. Deploy the app to our demo machine

Demo machine

We set up a demo machine where our BuildBot deployed ever (successful) build. With this automated deployment we always had an up-to-date and running instance of our app. This is great as a team since everybody of the team can see the latest features and show them to potential customers.

Demo videos

I made a short (~10 minutes) video to demonstrate the development workflow in this environment. The video covers a full development cycle of a new feature from the first check-out of the project to the final result on the demo machine. I made a voice-over in English and in German, both should contain the same (more or less). The video is cut to make it as short as possible, so you won’t have to wait for my Eclipse to start…

What I do in the video is the following:

  1. Check out the project and run make sure it is in a stable state
  2. Implement a new feature in a test-driven way
  3. Push the new feature
  4. Watch BuildBot doing its work
  5. Enjoy the final result on our demo machine

Demo Video (English): CI Development Workflow

Video: http://www.youtube.com/watch?v=ZYiwhvRDRak



Demo Video (German): CI Development Workflow

Video: http://www.youtube.com/watch?v=tza03NG1xwo



Best regards,
Thomas

Opinion Mining on Hackernews and Reddit

TechTrends

Last semester two of my friends and I made some sort of a search engine for Hackernews and Reddit. The idea was to collect all articles published on those two platforms and search them for trends. It should be possible to type-in a certain keyword such as “Bitcoin” and retrieve a trend chart showing when and how many articles have been published to “Bitcoin”.

The result was TechTrends. Based on Radim Řehůřek’s Gensim (a Python framework for text classification) we build a web application which crawls Hackernews and Reddit continuously and offers a simple web interface to search trends in these posts. You can find more posts about TechTrends here.

OpinionTrends

This winter semester I started to implement a new feature for TechTrends. I wanted to build an opinion mining and sentiment analysis for all posts. Based on the comments for each post on Hackernews and Reddit I wanted to classify all posts in one of three categories:

  • Positive, which means that most of the comments are praising the article.
  • Negative, which means that most of the comments are criticizing the article.
  • Neutral, which means that there is a balance between positive and negative comments.

I gave it the code name OpinionTrends.

The basic idea

The basic idea was to train a supervised classifier to categorize each comment and therefore each post. This should work similar to a spam filter in an email application: Each email marked as ‘spam’ will train a classifier which can categorize emails on its own in good or bad (which actually means spam). I wanted to do the same but with comments instead of emails and with three instead of two categories.

Training

A classifier is only as good as its training data. In case of a spam filter the training data are emails marked as ‘spam’ by the user. This makes the training data very good and very individual to each user. In my case I decided to use Amazon product reviews.

Amazon Product Reviews

Amazon product reviews are a great way to retrieve training data. They are marked with stars in 5 categories, they are available in many languages and for many domains and you can crawl them very easily. The only thing you have to do is to sign up a free developer account on Amazon and get started with your favorite language (there are libraries for most common languages out there).

Once the classifier is trained, it can be saved to a file and used further on. It doesn’t need to be updated anymore. However, the performance of the application depends completely on the classifier. Therefore it should be trained and tested carefully.

Validation

I tested different classifiers and different pre-processing steps of the Amazon Reviews. Below you can see a comparison between a Bayes Classifier and a SVM. The SVM beats the Bayes Classifier by 10% or more. However, its performance also depends dramatically on the pre-processing of the raw reviews.

Type of classfier No pre-processing Bi-grams Only adjectives No stop words
Bayes Classifier 71% accuracy 65% accuracy 70% accuracy 72% accuracy
SVM 85% accuracy 86% accuracy 78% accuracy 84% accuracy

Problems with validation

All tests were made with a 10-fold cross-validation. The only problem with those tests is, that I trained and tested with reviews from Amazon, but my final data were comments on blog posts which is not the exact same domain. Since Hackernews and Reddit are both about computer science I used reviews from SSDs, Microsoft and Apple software or computer games to be as close as possible to my final domain. However, I can’t really validate my final results. This has two reasons:

  • I don’t have a huge number of tagged blog posts and comments to compare them with the results of the classifier.
  • Comments are very subjective. In many cases you can not decide for sure whether a comment is positive or negative. Some few comments are very clear and easy (I hate your article.), but a lot of comments are something in between (I love your website but I hate the color.). Even I as a human beeing can not decide if they are positive or negative and if I could decide it my friend would argue with me.

My final result

OpinionTrends is in its last steps since a couple of days. Next week will present it at the Media Night of my university (a fairy for student projects). You can read more about it here.

OpinionTrends is also online and in some kind of a stable-state. However, it is still under development: http://opiniontrends.mi.hdm-stuttgart.de/

This is how it works

OpinionTrends works the same as TechTrends. You go to the website, type-in a keyword and get the results. The only really new thing is a brand new tab on the result page.

op_trends_new_tab

When you click on it you will see a new chart similar to the blue one on the Timeline tab. The chart has three colors and is very easy to read. The green bars represent the positive articles, the red bars represent the negative articles and the light yellow bars represent the neutral articles. The neutral articles are visualized on the positive scale and on the negative scale with same amount.

op_trends_nsa_chart

Above you see the result for NSA, which is actually a very good example since the overwhelming opinion about the NSA is very negative which you can see perfectly in the chart.

You can click on each bar to see a pop-up showing the articles behind the bar. You can jump directly to the article or open the discussion with all comments on this article.

op_trends_pop_up

Examples

Here are some good examples to show you how OpinionTrends works. The best one I found so fare is a search for NSA. The opinion is very negative as everyone would expect.

op_trends_nsa_chart

The opinion on Git is much more balanced. It has not only a nearly equal number of positive and negative articles, it has also a lager number of neutral posts.

op_trend_git

The opinion on Python is much better. Is has a lot of neutral posts, but beside them, Python has fare more positive than negative posts.

op_trend_python

More…

OpinionTrends has some more features such as individual settings so adjust each search. However, I think this is too much for this post. You can get a lot more information directly on the project site. TechTrends/OpinionTrends is also open source, so you can checkout the source code from BitBucket. OpinionTrends is in its own branch!

I hope you enjoy it and I would be really happy about some feedback.

Best regards,
Thomas

Media Night Winter Semester 2013/2014

opinion-trends-poster

During the last summer semester, two friends of mine and I made a student project called TechTrends. TechTrends was a web application that let you search for articles and trends in the field of computer science. Based on posts from Reddit and Hackernews, it provided an intelligent search on a growing number of articles and blogs.

During this winter semester I continued the project and implemented a sentiment analysis for TechTrends. Based on the existing infrastructure such as our database and our crawler, I add an automated categorization of articles according to their comments on Hackernews and Reddit.

You can find the old and stable version of our project under http://techtrends.mi.hdm-stuttgart.de/. The up-to-date development version is available under http://opiniontrends.mi.hdm-stuttgart.de/.

media_night_ws13

I will present the project at the Media Night at our university next week. It’s open for everybody and for free. It will start around 6 pm, but you can come whenever you want to, there is no schedule. Every project has its own booth, where it is presented and where you can ask question and get in touch with the people behind it.

You can find the program and information about all projects on http://www.hdm-stuttgart.de/medianight.

What? – Media Night Winter 2013
When? – 16th January 2014, from 6 pm to 10 pm
Where? – Hochschule der Medien, Nobelstraße 10, 70569 Stuttgart

Best regards,
Thomas