Monday, March 21, 2011

Large scale matrix factorization - Yahoo! KDD Cup

Recently I learned from Sean Owen about this year Yahoo! KDD Cup

This is a great opportunity to get large scale data, for testing matrix factorization algorithms. (Specifically I am using GraphLab Matrix factorization ).

In track1, the task is to predict user music recommendations. There are about 1M users, 600K music titles, and about 252M non-zero ratings in the training data. (To compare to Netflix, there where only 50K users, 500K movies and about 100M non-zero ratings). Unlike Netflix, the ratings are between 0-100 (and not 1-5). In matrix factorization context, need to think how to represent a rating of zero vs. non existing rating.

For a start, I have tested GraphLab matrix factorization data with the track1 dataset. On an 8 core machine (2x4 cores Intel Xeon 2.67 Ghz) each alternating least squares iteration takes around 150 seconds. After 10 iterations I get training RMSE of 21.7 and validation RMSE of 22.5.

For reader's requests, I have posted some more detailed instructions on how to run GraphLab on KDD Cup data here

Here is the output of the GraphLab program:

<40|0>bickson@bigbro6:~/newgraphlab/graphlabapi/release/demoapps/pmf$ ./pmf kddcup 0 --ncpus=8 --float=true --zero=true --scheduler="round_robin(max_iterations=10)" --lambda=2 --D=25
Setting run mode ALS_MATRIX
INFO   :pmf.cpp(main:1236): PMF starting

loading data file kddcup
Loading kddcup TRAINING
Matrix size is: 1000990 624961 1
Creating 252800275 edges...
.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................loading data file kddcupe
Loading kddcupe VALIDATION
Matrix size is: 1000990 624961 1
Creating 4003960 edges...
.....................loading data file kddcupt
Loading kddcupt TEST
Matrix size is: 1000990 624961 6649
Creating 6005940 edges...
...............................setting regularization weight to 2
PTF_ALS for matrix (1000990, 624961, 6649):252800275.  D=25
pU=2, pV=2, pT=1, muT=1, D=25
nuAlpha=1, Walpha=1, mu=0, muT=1, nu=25, beta=1, W=1, WT=1 BURN_IN=10
complete. Obj=4.85306e+11, TRAIN RMSE=61.9632 TEST RMSE=75.7155.
INFO   :asynchronous_engine.hpp(run:79): Worker 0 started.

INFO   :asynchronous_engine.hpp(run:79): Worker 1 started.

INFO   :asynchronous_engine.hpp(run:79): Worker 2 started.

INFO   :asynchronous_engine.hpp(run:79): Worker 3 started.

INFO   :asynchronous_engine.hpp(run:79): Worker 4 started.

INFO   :asynchronous_engine.hpp(run:79): Worker 5 started.

INFO   :asynchronous_engine.hpp(run:79): Worker 6 started.

INFO   :asynchronous_engine.hpp(run:79): Worker 7 started.

Entering last iter with 1
161.488) Iter ALS 1  Obj=4.63853e+11, TRAIN RMSE=60.5707 TEST RMSE=26.6540.
Entering last iter with 2
301.86) Iter ALS 2  Obj=9.7782e+10, TRAIN RMSE=27.7971 TEST RMSE=26.0930.
Entering last iter with 3
445.925) Iter ALS 3  Obj=8.91077e+10, TRAIN RMSE=26.5305 TEST RMSE=24.1940.
Entering last iter with 4
585.759) Iter ALS 4  Obj=7.09672e+10, TRAIN RMSE=23.6696 TEST RMSE=23.0499.
Entering last iter with 5
727.565) Iter ALS 5  Obj=6.42765e+10, TRAIN RMSE=22.5229 TEST RMSE=22.6883.
Entering last iter with 6
870.943) Iter ALS 6  Obj=6.17103e+10, TRAIN RMSE=22.0672 TEST RMSE=22.5748.
Entering last iter with 7
1011.14) Iter ALS 7  Obj=6.06398e+10, TRAIN RMSE=21.8743 TEST RMSE=22.5319.
Entering last iter with 8
1153.29) Iter ALS 8  Obj=6.01234e+10, TRAIN RMSE=21.7807 TEST RMSE=22.5117.
Entering last iter with 9
1292.27) Iter ALS 9  Obj=5.98422e+10, TRAIN RMSE=21.7295 TEST RMSE=22.5004.
INFO   :asynchronous_engine.hpp(run:89): Worker 5 finished.

Entering last iter with 10
INFO   :asynchronous_engine.hpp(run:89): Worker 0 finished.

INFO   :asynchronous_engine.hpp(run:89): Worker 1 finished.

INFO   :asynchronous_engine.hpp(run:89): Worker 4 finished.

INFO   :asynchronous_engine.hpp(run:89): Worker 7 finished.

INFO   :asynchronous_engine.hpp(run:89): Worker 2 finished.

INFO   :asynchronous_engine.hpp(run:89): Worker 6 finished.

1430.98) Iter ALS 10  Obj=5.96712e+10, TRAIN RMSE=21.6983 TEST RMSE=22.4935.
INFO   :asynchronous_engine.hpp(run:89): Worker 3 finished.

Final result. Obj=5.96743e+10, TRAIN RMSE= 21.6989 TEST RMSE= 22.4935.
Exporting KDD cup test graph: kddcupt.kdd.out
**Completed successfully (mean prediction: 64.919045)**
Finished in 1437.142338 
Counters are: 0) EDGE_TRAVERSAL, 6038.63
Counters are: 1) BPTF_SAMPLE_STEP, 0
Counters are: 2) CALC_RMSE_Q, 0.044609
Counters are: 3) ALS_LEAST_SQUARES, 3258.77
Counters are: 4) NA, 0
Counters are: 5) BPTF_TIME_EDGES, 0
Counters are: 6) BPTF_LEAST_SQUARES, 0
Counters are: 7) CALC_OBJ, 3.51506
Counters are: 8) NA, 0
Counters are: 9) BPTF_MVN_RNDEX, 0
Counters are: 10) BPTF_LEAST_SQUARES2, 0

 === REPORT FOR core ===
[Numeric]
ncpus:          8
[Other]
affinities:     false
compile_flags:  -O3 -Wall -g -mfpmath=sse -msse2 -funroll-loops -fprefetch-loop-arrays -fopenmp
engine: async
scheduler:      round_robin
schedyield:     true
scope:  edge

 === REPORT FOR engine ===
[Numeric]
num_edges:              2.528e+08
num_syncs:              0
num_vertices:           1.62595e+06
updatecount:            1.62595e+07     (count: 8, min: 1.97158e+06, max: 2.11913e+06, avg: 2.03244e+06)
[Timings]
runtime:                1412.8 s
[Other]
termination_reason:     task depletion (natural)
[Numeric]
updatecount_vector:             1.62595e+07     (count: 8, min: 1.97158e+06, max: 2.11913e+06, avg: 2.03244e+06)
updatecount_vector.values:              1.99981e+06,2.01133e+06,1.97158e+06,2.08794e+06,2.00821e+06,2.11913e+06,2.02736e+06,2.03415e+06,


I wrote a simple Matlab script to parse the text input files into sparse Matlab matrix. Here it is:
%Written by Danny Bickson, CMU, March 211
%This script reads KDD cup track1 format and saves
%it into a sparse Matlab matrix
nUsers=1000990;
nItems=624961;
nRatings=262810175;
nTrainRatings=252800275;
nProbeRatings=4003960;
nTestRatings=6005940;


runmode=1; % 1 for training, 2 for probe, 3 for test

switch runmode
    case 1
        filename='/path/to/trainIdx1.txt';
        ratings=nTrainRatings;
        outfile='/path/to/train1.mat';
    case 2
        filename='/path/to/validationIdx1.txt';
        ratings=nProbeRatings;
        outfile='/path/to/probe1.mat';
    case 3
        filename='/path/to/testIdx1.txt';
        ratings=nTestRatings;
        outfile='/path/to/test1.mat';
end
rows=zeros(ratings,1);
cols=rows;
vals=rows;
try

ff=fopen(filename,'r');
cnt=1;
for j=1:nUsers
  % read user id and number of ratings
  [a,num]=fscanf(ff,'%d|%d',2);
  assert(num==2);

  user=a(1);  
  numofratings=a(2);
  if (mod(j,1000)==0)
      disp(['user: ', num2str(user),' ratings: ', num2str(numofratings)]);
  end
  if (runmode==3)
      assert(numofratings==6); % 6 ratings per each user in test file
  else if (runmode == 2)
      assert(numofratings==4); % 4 ratings per user in probe file
  end


  for i=1:numofratings % for each ratings
    b=-100;
    if (runmode<=2)
        [b,num]=fscanf(ff,'%d %d %d %d:%d:%d',6);
        assert(num==6);
    else
        [b,num]=fscanf(ff,'%d %d %d:%d:%d',5);
        assert(num==5);
    end
    rows(cnt)=user+1;  % user
    assert(rows(cnt)>=1 && rows(cnt)<=nUsers);
    cols(cnt)=b(1)+1; % item
    assert(cols(cnt)>=1 && cols(cnt)<=nItems);

    if (runmode<=2)
        vals(cnt)=b(2); % rating
        assert(vals(cnt)>=0 && vals(cnt)<=100);
    else
        vals(cnt)=1; % placeholder for missing rating
    end
    cnt=cnt+1;
  end
end

assert(cnt==ratings+1);

catch ex
    disp(['exception on ', num2str(i), ' ', num2str(j)]);
    ex
end

% construct and save a sparse Matlab matrix
A=sparse(double(rows), double(cols), double(vals));
save(outfile,'-v7.3','A');

6 comments:

  1. I wanted to comment in your kdd-cup group post but gave me an error :(
    - thanks for the script ;) I'm gonna try it as soon as download the data at home, can I ask u if you're using matlab only to do the whole game or are u using also mexfiles with c++ , openMP ... ?

    ReplyDelete
  2. Hi!
    Actually I am using Graphlab, and I am getting an RMSE of 22.3 for the validation data using alternating least squares. Write me if you like to get directions on how to setup GraphLab to run on KDDCUP.

    Best,

    DB

    ReplyDelete
  3. Hi,

    Did you submit your prediction result for the Test data of KDDCup 2011? Is the Test result much different from Validation result?

    Bests,
    Thanh

    ReplyDelete
  4. Yes, the test is about +2 points higher in RMSE.
    For alternating least squares I got 23.97. Now I have implemented SVD++ and I will try it soon. I heard from people that SVD++ got RMSE 23.5.

    ReplyDelete
  5. connat 23.9727
    Idealists 23.9768
    TEST 23.9768

    There are 3 contestants with this score. Which one is you? Or is it BiXolVer? I'm also curious about SVD++ score.

    ReplyDelete
  6. Hi,
    Yes, it is TEST.
    I tried SVD++, with number of features = 20, without any tuning for different params, for 15 iterations, on 8 cores machine and got 24.6
    but I am pretty sure this can be improved when tuning learning rate, step size, etc.

    One of the groups combined my ALS score with their SVM and get -0.2 RMSE (total 23.2)

    ReplyDelete