condvis

Example: Blog comments data

The blog comments data can be found on the UCI repository. The original prediction task is for the number of comments on a blog over the next 24 hours, but we will tackle the simpler task of predicting if there will be any comments over the next 24 hours, recasting the problem as a binary classification. The main purpose of this example is to demonstrate the use of condvis when model predict methods go slightly off the beaten track, as happens with the xgboost package. We will train an ensemble of 5 boosted tree classifiers, in the form of a cross-validated committee, where we only attempt to optimise one parameter, the number of boosting iterations -- that is

randomly split the data into 5 folds
train 5 models, holding out one of the folds each time for validation
choose the number of boosting iterations for each model which minimises the validation loss

Read in the data

We can start with the basics and read in the data using data.table.

library(data.table)
train <- fread("data/blogData_train.csv", data.table = FALSE,
  stringsAsFactors = TRUE)
colnames(train)[ncol(train)] <- "comments"

This gives us a dataframe with 52,397 observations on 281 variables. The last column is the number of comments, so we name that one for clarity. Next, we'll do a bit of tidying and set up our binary target (or response).

## Remove predictors that have only one unique value.
removevars <- which(vapply(train, function(x) length(
  unique(x)), integer(1L)) <= 1)
train <- train[, -removevars]

## Create binary target, and remove 'comments' column.
train$target <- as.integer(train$comments > 0)
train$comments <- NULL

Train the ensemble

## Define the parameters we will use for the boosted tree
## classifiers.
classifier.params <- list(booster = "gbtree", objective =
  "binary:logistic", eta = 0.03, gamma = 0.1, max_depth = 6,
  min_child_weight = 1, colsample_bytree = 0.7, eval_metric
  = "logloss")

## Formula to create model matrix.
f <- formula(paste0("~ 0+", paste(setdiff(colnames(train),
  c("comments", "target")), collapse = "+")))

## Function to train an XGBoost model, stopping when further
## iterations increase the loss on the validation set.
trainXGB <- function(params, formula, data, trainindex,
  nrounds = 1500){
  x.train <- xgb.DMatrix(model.matrix(formula, data =
    data[trainindex, ]), label = target[trainindex])
  x.validate <- xgb.DMatrix(model.matrix(formula, data
    = data[-trainindex, ]), label = target[-trainindex])
  suppressWarnings(
    structure(list(model = xgb.train(nrounds = nrounds,
      params = params, data = x.train, watchlist = list(
      "validate" = x.validate, "train" = x.train),
      print.every.n = 20, nthread = 4, early.stop.round =
      200), formula = formula), class = "xgbpred")
  )
}

Visualise the ensemble

library("condvis")
library("xgboost")

train$target <- jitter(train$target, amount = 0.05)

ceplot(train, classifiers, "target", varimp$Feature[1],
  varimp$Feature[2:10], xsplotpar = list(xlim = c(0,
  50)), threshold = 0.6)