Conditional Visualisation for
Statistical Models
The blog comments data can be found on the UCI
repository. The original prediction task is for the number of
comments on a blog over the next 24 hours, but we will tackle the
simpler task of predicting if there will be any comments over
the next 24 hours, recasting the problem as a binary classification.
The main purpose of this example is to demonstrate the use of
condvis when model predict
methods go slightly
off the beaten track, as happens with the xgboost
package. We will train an ensemble of 5 boosted tree classifiers, in
the form of a cross-validated committee, where we only attempt to
optimise one parameter, the number of boosting iterations -- that is
We can start with the basics and read in the data using data.table.
library(data.table)
train <- fread("data/blogData_train.csv", data.table = FALSE,
stringsAsFactors = TRUE)
colnames(train)[ncol(train)] <- "comments"
This gives us a dataframe with 52,397 observations on 281 variables. The last column is the number of comments, so we name that one for clarity. Next, we'll do a bit of tidying and set up our binary target (or response).
## Remove predictors that have only one unique value.
removevars <- which(vapply(train, function(x) length(
unique(x)), integer(1L)) <= 1)
train <- train[, -removevars]
## Create binary target, and remove 'comments' column.
train$target <- as.integer(train$comments > 0)
train$comments <- NULL
## Define the parameters we will use for the boosted tree
## classifiers.
classifier.params <- list(booster = "gbtree", objective =
"binary:logistic", eta = 0.03, gamma = 0.1, max_depth = 6,
min_child_weight = 1, colsample_bytree = 0.7, eval_metric
= "logloss")
## Formula to create model matrix.
f <- formula(paste0("~ 0+", paste(setdiff(colnames(train),
c("comments", "target")), collapse = "+")))
## Function to train an XGBoost model, stopping when further
## iterations increase the loss on the validation set.
trainXGB <- function(params, formula, data, trainindex,
nrounds = 1500){
x.train <- xgb.DMatrix(model.matrix(formula, data =
data[trainindex, ]), label = target[trainindex])
x.validate <- xgb.DMatrix(model.matrix(formula, data
= data[-trainindex, ]), label = target[-trainindex])
suppressWarnings(
structure(list(model = xgb.train(nrounds = nrounds,
params = params, data = x.train, watchlist = list(
"validate" = x.validate, "train" = x.train),
print.every.n = 20, nthread = 4, early.stop.round =
200), formula = formula), class = "xgbpred")
)
}
library("condvis")
library("xgboost")
train$target <- jitter(train$target, amount = 0.05)
ceplot(train, classifiers, "target", varimp$Feature[1],
varimp$Feature[2:10], xsplotpar = list(xlim = c(0,
50)), threshold = 0.6)