MVPA-Light/classifier/train_logreg.m at master · Charestlab/MVPA-Light · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
function cf = train_logreg(cfg,X,clabel)
% Trains a logistic regression classifier with L2 regularisation.
%
% NOTE: Due to the exponential term in the cost function, it is recommended
% that X (the data) is z-scored to reduce the probability of numerical
% issues due to round-off errors.
%
% Usage:
% cf = train_logreg(cfg,X,clabel)
%
%Parameters:
% X              - [samples x features] matrix of training samples
% clabel         - [samples x 1] vector of class labels
%
% cfg          - struct with hyperparameters:
% .lambda        - regularisation hyperparameter controlling the magnitude
%                  of regularisation. If a single value is given, it is
%                  used for regularisation. If a vector of values is given,
%                  5-fold cross-validation is used to test all the values
%                  in the vector and the best one is selected
%                  Note: lambda is reciprocally related to the cost
%                  parameter C used in LIBSVM/LIBLINEAR, ie C = 1/lambda
%                  roughly
% .prob          - if 1, decision values are returned as probabilities. If
%                  0, the decision values are simply the distance to the
%                  hyperplane (default 0)
% .correct_bias  - if the number of samples in the two classes is not the
%                  same, logistic regression is biased towards the majority
%                  class. If correct_bias is 1, this is corrected for by
%                  adjusting the bias term
%
% Further parameters (that usually do not need to be changed):
% bias          - if >0 augments the data with a bias term equal to the
%                 value of bias:  X <- [X; bias], and augments the weight
%                 vector with the bias variable w <- [w; b].
%                 If 0, the bias is assumed to be 0. By default, it is set
%                 to a large value (bias=100). This prevents that b is
%                 penalised by the regularisation term.
% k             - the number of folds in the k-fold cross-validation for
%                 the lambda search
% plot          - if a lambda search is performed, produces diagnostic
%                 plots including the regularisation path and
%                 cross-validated accuracy as a function of lambda
%
% BACKGROUND:
% Logistic regression introduces a non-linearity over the linear regression
% term f(x) = w * x + b by means of the sigmoid function s(x) = 1/(1+e^-x),
% hence:       s(f(x)) = 1 / ( 1 + e^-f(x) )
% and fits the sigmoid function to the data. Logistic regression is a
% linear function of the log-odds and directly models class probabilities.
% The log likelihood function
% including a L2 regularisation term can be arranged as
%
%      L(w,lambda) = SUM log(1+exp(-yi*w*xi)) + lambda * ||w||^2
%
% where w is the coefficient vector and lambda is the regularisation
% strength, yi = {-1,+1} are the class labels, and xi the samples. This is
% a convex optimisation problem that can be solved by unconstrained
% minimisation.
%
% IMPLEMENTATION DETAILS:
% A Trust Region Dogleg algorithm (TrustRegionDoglegGN.m) is used to
% optimise w.
% Hyperparameter optimisation is very costly, since the classifier
% has to be trained for each value of the hyperparameter. To reduce the
% number of iterations, warm starts are used wherein the next w is
% initialised (w_init) as follows:
% - in iteration 1, w_init is initialised as the zero vector
% - in iterations 2 and 3, w_init is initialised by the previous w's (w_1
%   and w_2)
% - in iterations 4+, if predict_regularisation_path=1, then w_init is
%   initialised by a predicted w: for the
%   k-th iteration, a quadratic polynomial is fit through w_k-2, w_k-1 and
%   wk as a function of lambda. The polynomial is then evaluated at
%   lambda_k to obtain the prediction. Simulations show that this approach
%   substantially uses the number of to convergence and hence computation
%   time
%
%Output:
% cf - struct specifying the classifier with the following fields:
% w            - projection vector (normal to the hyperplane)
% b            - bias term, setting the threshold
%
%References:
% Lin, Weng & Keerthi (2008). Trust Region Newton Method for Large-Scale
% Logistic Regression. Journal of Machine Learning Research, 9, 627-650

% (c) Matthias Treder 2017

[N, nFeat] = size(X);
X0 = X;

cf = [];

% Make sure labels come as column vector
clabel = double(clabel(:));

% Need class labels 1 and -1 here
clabel(clabel == 2) = -1;

% Stack labels in diagonal matrix for matrix multiplication during
% optimisation
Y = diag(clabel);

% Take vector connecting the class means as initial guess [it does not seem
% to speed up convergence though so we keep w0 = 0 for now]
% w0 = double(mean(X0(clabel==1,:))) -  double(mean(X0(clabel==-1,:)));
% w0 = w0' / norm(w0);

% Augment X with bias
if cfg.bias > 0
    X0 = cat(2, X0, ones(N,1) * cfg.bias );
    nFeat = nFeat + 1;
end

I = eye(nFeat);

% Initialise w with zeros
w0 = zeros(nFeat,1);

% logfun = @(w) lr_objective_tanh(w);
% logfun = @(w) lr_objective(w);
% logfun = @(w) lr_gradient(w);
logfun = @(w) lr_gradient_and_hessian_tanh(w);

%% Automatic regularisation
if ischar(cfg.lambda) && strcmp(cfg.lambda,'auto')
    cfg.lambda = logspace(-4,3,10);
end

%% Find best lambda using (nested) cross-validation
if numel(cfg.lambda)>1
    % Perform inner cross validation by again partitioning the training
    % data into folds. Then, cycle through all the lambda's, calculate
    % the classifier, and validate it on the test set. The lambda giving
    % the best result is then taken forward and a model is trained on the
    % full data using the best lambda.

    CV = cvpartition(N,'KFold',cfg.k);
    ws = zeros(nFeat, numel(cfg.lambda));
    acc = zeros(numel(cfg.lambda),1);

    if cfg.plot
        C = zeros(numel(cfg.lambda));
        iter_tmp = zeros(numel(cfg.lambda),1);
        delta_tmp = zeros(numel(cfg.lambda),1);
        iter = zeros(numel(cfg.lambda),1);
        delta = zeros(numel(cfg.lambda),1);
        wspred= ws;
    end

    if cfg.predict_regularisation_path
        % Create predictor matrix for the polynomial approximation
        % of the regularisation path by taking the lambda's to the powers
        % up to the polyorder.
        polyvec = 0:cfg.polyorder;
        % Use the log of the lambda's to get a better conditioned matrix
        qpred = cell2mat( arrayfun(@(n) log(cfg.lambda(:)).^n, polyvec,'Un',0));
    end

    % --- Start cross-validation ---
    for ff=1:cfg.k
        % Training data
        X = X0(CV.training(ff),:);
        YX = Y(CV.training(ff),CV.training(ff))*X;

        % Sum of samples needed for the gradient
        sumyxN = sum(YX)'/N;

        % --- Loop through lambdas ---
        for ll=1:numel(cfg.lambda)
            lambda = cfg.lambda(ll);

            % Warm-starting the initial w: wstart
            if ll==1
                wstart = w0;
            elseif cfg.predict_regularisation_path ...
                    && ll>cfg.polyorder+1      % make sure that enough terms have been calculated
                % Fit polynomial to regularisation path
                % and predict next w(lambda_k)
                quad = qpred(ll-cfg.polyorder-1:ll-1,:)\(ws(:,ll-cfg.polyorder-1:ll-1)');
                wstart = ( repmat(log(lambda),[1,numel(polyvec)]).^polyvec * quad)';
            else
                % Use the result obtained in the previous step lambda_k-1
                wstart = ws(:,ll-1);
            end
            if cfg.plot
                wspred(:,ll)= wstart;
                [ws(:,ll),iter_tmp(ll),delta(ll)] = TrustRegionDoglegGN(logfun, wstart, cfg.tolerance, cfg.max_iter,ll);
            else
                ws(:,ll) = TrustRegionDoglegGN(logfun, wstart, cfg.tolerance, cfg.max_iter,ll);
            end
        end
        if cfg.plot
            delta = delta + delta_tmp;
            iter = iter + iter_tmp;
            C = C + corr(ws);
        end

        cl = clabel(CV.test(ff));
        % Calculate classification accuracy by multiplying decision values
        % with the class label
        acc = acc + sum( (X0(CV.test(ff),:) * ws) .* repmat(cl(:),[1,numel(cfg.lambda)]) > 0)' / CV.TestSize(ff);
    end

    acc = acc / cfg.k;

    [~, best_idx] = max(acc);

    % Diagnostic plots if requested
    if cfg.plot
        figure,
        nCol=3; nRow=1;
        subplot(nRow,nCol,1),imagesc(C); title({'Mean correlation' 'between w''s'}),xlabel('lambda#')
        subplot(nRow,nCol,2),plot(delta),title({'Mean trust region' 'size at termination'}),xlabel('lambda#')
        subplot(nRow,nCol,3),plot(iter/cfg.k),hold all,
        title({'Mean number' 'of iterations (across folds)'}),xlabel('lambda#')

        % Plot regularisation path (for the last training fold)
        figure
        for ii=1:nFeat, semilogx(cfg.lambda,ws(ii,:),'o-','MarkerFaceColor','w'), hold all, end
        plot(xlim,[0,0],'k-'),title('Regularisation path for last iteration'),xlabel('lambda#')

        % Plot cross-validated classification performance
        figure
        semilogx(cfg.lambda,acc)
        title([num2str(cfg.k) '-fold cross-validation performance'])
        hold all
        plot([cfg.lambda(best_idx), cfg.lambda(best_idx)],ylim,'r--'),plot(cfg.lambda(best_idx), acc(best_idx),'ro')
        xlabel('Lambda'),ylabel('Accuracy'),grid on

        % Plot first two dimensions
        figure
        plot(ws(1,:),ws(end,:),'ko-')
        hold all, plot(wspred(1,2:end),wspred(end,2:end),'+')
        title('Regularisation path for 2 features')
        legend({'w' 'predicted w'})

    end
else
    % there is just one lambda: no grid search
    best_idx = 1;
end

lambda = cfg.lambda(best_idx);

%% Train classifier on the full training data (using the best lambda)
YX = Y*X0;
sumyxN = sum(YX)'/N;
X = X0;

w = TrustRegionDoglegGN(logfun, w0, cfg.tolerance, cfg.max_iter, 1);

%% Set up classifier struct
if cfg.bias > 0
    cf.w = w(1:end-1);
    cf.b = w(end);

    % Bias term needs correct scaling
    cf.b = cf.b * cfg.bias;

    if cfg.correct_bias
        % Correct the bias term such that it is in between the class means
        o = X(:, 1:end-1) * cf.w;
        cf.b = - ( mean(o(clabel==1)) + mean(o(clabel==-1)) )/2;
    end
else
    cf.w = w;
    cf.b = 0;
end
cf.lambda = lambda;


%%
%%% Logistic regression objective function. Given w, data X and
%%% regularisation parameter lambda, returns
%%% f: function value at point w
%%% g: value of gradient at point w
%%% h: value of the Hessian at point w
%%%
%%% Based on formulas (2)-(4) in:
%%% Lin C Weng R Keerthi S (2007). Trust region Newton methods for
%%% large-scale logistic regression. Proceedings of the 24th international
%%% conference on Machine learning - ICML '07. pp: 561-568

%     function [f,g,h] = lr_objective(w)
%         % Evaluate exponential for all samples
%         sigma = logreg_fix(YX*w); %1./(1+exp(-YX*w));
%
%         % Function value (loss)
%         f = sum(-log(sigma))/N + lambda * 0.5 * (w'*w);
%
%         % Gradient
%         if nargout>1
%             g = YX'*sigma/N - sumyxN + lambda * w;
%         end
%
%         % Hessian
%         if nargout>2
%             h = lambda * I + (X' .* (sigma .* (1 - sigma))') * X/N;
%         end
%     end
%
%     function [f,g,H] = lr_objective_tanh(w)
%         % Evaluate exponential for all samples
%         sigma = 0.5 + 0.5 * tanh(YX*w/2);
%
%         % Function value (loss)
%         f = sum(-log(sigma))/N + lambda * 0.5 * (w'*w);
%
%         % Gradient
%         if nargout>1
%             g = YX'*sigma/N - sumyxN + lambda * w;
%         end
%
%         % Hessian
%         if nargout>2
%             H = lambda * I + (X'.*(sigma .* (1 - sigma))') * X/N;
%         end
%     end

%     function [g,h] = lr_gradient(w)
%         % Logistic gradient and Hessian
%
%         sigma = logreg_fix(YX*w);
%
%         % Gradient
%         g = YX'*sigma/N - sumyxN + lambda * w;
%
%         % Hessian of loss function (serves here as gradient)
%         if nargout>1
%             h = lambda * I + (X .* (sigma .* (1 - sigma)))' * X/N;  % faster
%         end
%     end

    function [g,h] = lr_gradient_and_hessian_tanh(w)
        % Logistic gradient and Hessian expressed using the hyperbolic
        % tangent
        % Using the identity: 1 / (1 + exp(-x)) = 0.5 + 0.5 * tanh(x/2)
        sigma = 0.5 + 0.5 * tanh(YX*w/2);

        % Gradient
        g = (sigma' * YX)'/N - sumyxN + lambda * w;

        % Hessian
        if nargout>1
            h = lambda * I + bsxfun(@times, X, sigma .* (1 - sigma))' * X/N;  % faster to first multiply X by sigma(1-sigma)
        end
    end

%     function xo = log_logreg_fix(x)
%         % This is a fix to the LOG logistic loss function found on
%         % http://fa.bianp.net/blog/2013/numerical-optimizers-for-logistic-regression/
%         % also used in the LIBLINEAR code. It prevents exp from overflow.
%         xo = x;
%         xo(x>=0) = log(1+exp(-x(x>=0)));
%         xo(x<0)  = log(1+exp(x(x<0))) - x(x<0);
%     end

    function xo = logreg_fix(x)
        % This is a fix to the logistic loss function found on
        % http://fa.bianp.net/blog/2013/numerical-optimizers-for-logistic-regression/
        % and used in the LIBLINEAR code. It prevents exp from overflowing.
        % Logistic loss: 1./1+exp(-YX*w);
        xo = x;
        xo(x>=0) = 1./(1+exp(-x(x>=0)));
        xo(x<0)  = exp(x(x<0))./(1+exp(x(x<0)));
    end
end