Tuning sklearn Models with hyperopt

by Kehang Han on 2020-02-02 | tags: automl

Hyper-parameter tuning has been one of the least wanted tasks for a data scientist. In the past, people have to do in either grid search or random search strategy. Recently, there's been a trend of using bayesian optimization strategy, e.g., AutoML. I've been wanting to try out this python library called hyperopt. In this post, I'll share what I'd use it in two scenarios.

1. Base Scenario

I'd like to build a sklearn classfier that gives best test performance. To make it more concrete, I have narrowed my classfier type to KNeighborsClassifier and I'd like to find the best values of its following hyper-parameters from the constraints below:

1.1 Prepare Data

For the demo purpose, let's choose the well-known dataset iris from sklearn.

from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
x = iris.data
y = iris.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

1.2 Translate Constraints

Now let's translate the constraints of the hyper parameters into hyperopt's seach space.

from hyperas.distributions import choice

space = {'n_neighbors': choice('n_neighbors',range(3,11)),
         'algorithm': choice('algorithm',['ball_tree','kd_tree']),
         'leaf_size': choice('leaf_size',range(1,50)),
         'metric': choice('metric', ["euclidean","manhattan", "chebyshev","minkowski"])
        }

1.3 Define Objective

The goal for us is to find best hyper-parameters that achieve lowest test error, which can be translated into the following code.

def objective_func(space_sample):

    ## parse the hyper-parameter sample
    n_neighbors = space_sample['n_neighbors']
    algorithm = space_sample['algorithm']
    leaf_size = space_sample['leaf_size']
    metric = space_sample['metric']

    ## build the classfier based on the hyper-parameters
    clf = KNeighborsClassifier(n_neighbors=n_neighbors,
                           algorithm=algorithm,
                           leaf_size=leaf_size,
                           metric=metric,
                           )

    ## train the classifier
    clf.fit(x_train,y_train)

    ## evaluate test performance
    y_pred_test = clf.predict(x_test)
    loss = mean_squared_error(y_test,y_pred_test)
    return loss

1.4 Start Tuning

Now we have data, constraints and objective ready, it's time to start tuning for the best parameters.

best_classifier = fmin(objective_func, space, algo=tpe.suggest, max_evals=100)
print(best_classifier)

After trying 100 times in less than 10 seconds, hyperopt comes back with a pretty good set of hyper-parameters. In my experiment, it ends up with this {n_neighbors: 6, algorithm: 1, leaf_size: 13, metric: 0} (it's index based, equivalent to {n_neighbors: 9, algorithm: kd_tree, leaf_size: 13, metric: euclidean})

2. Distributed Scenario

All the previous steps are done in a single machine. When each evaluation is computationally expensive and many evaluations are required, such sequencial tuning doesn't scale well. The great thing about hyperopt is that it allows tuning in distributed fashion.

The additional component we need is a work distributor/broker, for which hyperopt uses MongoDB. The idea is the main program (which executes fmin) spawns training jobs (one job per set of hyper-parameters) and get them registered into MongoDB. On the other hand, an array of workers (called hyperopt-mongo-worker) can be launched, which connects to MongoDB to fetch and execute training jobs.

2.1 Setup MongoDB

You only need to install MongoDB and launch it. Guidelines on installation of MongoDB can be found here.

2.2 Start Tuning

All the data, constraints and objective are the same as base scenario. The only modification needed is add an object called trials to the call of fmin. The trials object basically allow this main program to register jobs to a designated databse-table (here would be iris database, jobs table) in MongoDB.

from hyperopt.mongoexp import MongoTrials

trials = MongoTrials('mongo://localhost:27017/iris/jobs', exp_key='exp1')
best_classifier = fmin(objective_func, space, trials=trials, algo=tpe.suggest,max_evals=100)
print(best_classifier)

2.3 Launch Worker

Since the training jobs are only registered, we need a bunch of workers to do the real work. Launching a worker is also made easy.

## create a working directory 
## for the worker
mkdir worker
cd worker

## ideally use the same python environment
## as the main program 
## use conda environment as a example (pip environment also fine)
conda activate tuning_env
hyperopt-mongo-worker --mongo=localhost:27017/iris --poll-interval=0.1

At the end, you should get the same best hyper-parameters in shorter period of time.

Concluding Words

hyperopt has made the machine learning model tuning easy and efficient. For a data scientist who likes to try out, remeber the four steps: