Chapter 22 Split data into train and test subsets
Here you can find several simple approaches to split data into train and test subset to fit and to test parameters of your model.
We want to take 0.8 of our initial data to train our model.
Data: datasets::iris
.
- First approach is to create a vector containing randomly selected row ids and to apply this vector to split data.
= sample(nrow(iris), nrow(iris)*0.8)
inTrain
# split data
= iris[inTrain, ]
train = iris[-inTrain, ] test
- The same idea to split data as before using
caret
package.
The advantage is thatcreateDataPartition
function allows to split data manytimes
and use these subsets to estimate parameters of our model.
library(caret)
<- createDataPartition(iris$Species, p=.8,
trainIndex list = FALSE, # if FALSE - create a vector/matrix, if TRUE - create a list
times = 1) # how many subsets
# split data
<- iris[trainIndex, ]
train <- iris[-trainIndex, ] test
- Another approch is to create a logical vecotor containing randomly distributed true/false and apply this vector to subset data.
= sample(c(TRUE, FALSE), nrow(iris), replace = T, prob = c(0.8,0.2))
inTrain
# select data
= iris[inTrain, ]
train = iris[!inTrain, ] test
- Using
caTools
.
library(caTools)
= sample.split(iris, SplitRatio = .8)
inTrain = subset(iris, inTrain == TRUE)
train = subset(iris, inTrain == FALSE) test
- Using
dplyr
library(dplyr)
$id <- 1:nrow(iris)
iris<- iris %>% dplyr::sample_frac(.8)
train <- dplyr::anti_join(iris, train, by = 'id') test