Wednesday, December 19, 2012

3rd Generation Collaborative Filtering - Sparse Case [or] Can Matrix Factoriztion be used for Classification?

Following my gensgd implementation which supports dense feature matrices, I was asked by my mega collaborator Justin Yan to implement a sparse version that supports libsvm format.
sparse_gensgd is exactly the same algorithm (high dimensional matrix factorization), but for the sparse case. Perhaps a bit surprising, I will show below the sparse_gensgd algorithm can be used for classification with a very nice performance. As a case study I will discuss KDD CUP 2010.

Case study: ACM KDD CUP 2010

In this case study I will show you how you can get state-of-the-art performance from GraphChi CF toolkit for solving a recent KDD CUP 2010 task. Here is a text from the contest website describing the task:
This year's challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting. 
I have used libsvm data respository which converted the task into a binary classification. The problem is moderate in size: around 20M samples for training, 750K samples for testing, with 29M sparse features.

The winning team was NTU, and here is their winning paper. Here is a graph depicting their single algorithm improvement:
As you can see, prediction RMSE around 0.2815 is the best result obtained by a single model.

The data is sparse in the sense that any number of features can appear in one sample. For example, here are the first 3 lines of the data in libsvm format:

1 2:1 103:1 104:1 105:1 106:0.301 107:0.301 32913:1 32917:1 2990385:1 2990386:1 2990387:1 2990388:1 2990389:0.301 2990390:1 2990391:1 2990392:1 2990393:1 2990394:1 2990395:1 2990396:1 2990397:1 2990398:1 2990399:1 2990400:1 2990401:1
0 2:1 92:1 115:1 116:1 117:1 32913:1 32917:1 2990387:1 2990388:1 2990389:0.477 2990390:1 2990391:1 2990393:1 2990394:1 2990396:1 2990398:1 2990399:1 2990402:1 2990403:1 2990404:1 2990405:1 2990406:1
0 2:1 100:1 143:1 144:1 145:1 12235:1 32913:1 32917:1 2990387:1 2990388:1 2990389:0.477 2990390:1 2990391:1 2990393:1 2990394:1 2990396:1 2990398:1 2990399:1 2990407:1 2990408:1 2990409:1 2990410:1 2990411:1

The target is either 1 or 0, this is the parameter we would like to predict as part of the matrix factorization procedure. The rest of the features are integer ids, where most of them are binary (1) but some of them are doubles (0.477). Given the validation dataset where only the features are given we would like to predict the target - either 0 or 1.


Now we run sparse_gensgd and we get:

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/sparse_gensgd --training=kddb --cutoff=0.5 --calc_error=1 --quiet=1 --gensgd_mult_dec=0.99999 --max_iter=100 --validation=kddb.t --gensgd_rate3=1e-4 --D=20 --gensgd_regw=1e-4 --gensgd_regv=1e-4 --gensgd_rate1=1e-4 --gensgd_rate2=1e-4 --gensgd_reg0=1e-3
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
   253.313) Iteration:   0 Training RMSE:   0.316999 Train err:   0.134567  Validation RMSE:   0.301573 Validation Err:    0.11255
   302.651) Iteration:   1 Training RMSE:   0.311927 Train err:   0.131657  Validation RMSE:   0.299295 Validation Err:   0.111737
   350.719) Iteration:   2 Training RMSE:   0.309526 Train err:   0.130117  Validation RMSE:   0.298312 Validation Err:   0.111191
   399.433) Iteration:   3 Training RMSE:   0.307839 Train err:   0.128916  Validation RMSE:   0.297752 Validation Err:   0.111072
...

   1598.31) Iteration:  27 Training RMSE:   0.293562 Train err:   0.117877  Validation RMSE:   0.288989 Validation Err:   0.109334
   1647.51) Iteration:  28 Training RMSE:   0.293217 Train err:   0.117608  Validation RMSE:   0.288871 Validation Err:   0.109252
   1696.24) Iteration:  29 Training RMSE:   0.292884 Train err:   0.117357  Validation RMSE:   0.288908 Validation Err:   0.109398
   1745.18) Iteration:  30 Training RMSE:   0.292555 Train err:   0.117106  Validation RMSE:   0.289125 Validation Err:   0.109435


As you can see, we got a nice validation RMSE of 0.289 while the best customized solution for this task get an RMSE of 0.281. So for a general purpose solver the obtained performance is quite nice.

Some explanation about the run time parameters:
--training - training input file name
--validation - validation input file name
--D - width of feature vector
--calc_error - calculates the classification error.
--cutoff - threshold value used for binary classification. Since the data contains 0/1 labels I set the cutoff threshold to be 0.5
--max_iter - number of iterations
--gensgd_rate1/2/3 - learning rates (gradient step sizes)
--gensgd_regw/v/0 - regularization rates
--quiet - runs in less verbose mode

Instructions:
1) Install GraphChi as instructed here (steps 1-3).
2) Download the datasets kddb (training) and kddb.t (validation) and put them in the root GraphChi folder. (Tip: use bunzip2 to open those files).
3) Create a file named kddb\:info with the following two lines:
%%MatrixMarket matrix coordinate real general
1000 1000 19264097
4) Create a file named kddb.t\:info with the following two lines:
%%MatrixMarket matrix coordinate real general
1000 1000 748400
5) Run as instructed.






2 comments: