Train Test Split

Description

The data is split randomly into train data and test data. Ideally, the split is in the ratio of 70:30 or 80:20 for train and test.

Why to use

To evaluate the accuracy of the model with an unknown dataset.

When to use

The dataset contains a large number of rows.

When not to use

Limited data is available.

Prerequisites


Input

Any dataset that contains any form of data – Textual, Categorical, Date, Numerical data.

Output

Dataset split into two parts – Train data and Test data.

Statistical Methods used

  • Confusion Matrix

  • F Score

  • Adjusted R Square

  • R Square

  • Root Mean Square Error

Limitations

If the data is limited, then there is a possibility of high bias.

Train Test Split is located under Model Studio () under Sampling in Data Preparation, in the left task pane . Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of Train- Test Split.

The train-test split is a technique to evaluate the accuracy of a model. It is used to make predictions on a large dataset. It is appropriate where a good quick estimate of the model performance is required.

In this technique, the input dataset is divided into two datasets, train and test. The train dataset is used to fit the model by getting the model trained on the input dataset. The expected output of the data is known. The test dataset is used to make predictions on unknown data. It evaluates the performance of the model on new data.

The train-test split is used when sufficiently large data is available. The data in each of the train and test sets should ideally represent the problem. There should be enough records to cover all common and uncommon cases of the problem or situation. If the dataset size is not optimum, it may overfit or underfit the model.

Properties of Train-Test Split

The available properties of Train Test Split are as shown in the figure given below.















The table given below describes the different fields present on the properties of Train-Test Split.

Field

Description

Remark

Task Name

It is the name of the task selected on the workbook canvas.

You can click the text field to edit or modify the name of the task as required.

Test Percentage

It is the percentage to divide input data into test data. The remaining percentage is train data.

The default value is 0.2. It indicates that the dataset is split into 20% test data and the remaining 80% as train data.

Random Seed

It is the value that builds a pattern in random data. This helps ensure that the data is split in the same pattern every time the code is rerun.

Interpretation of Train-Test Split

The data is split into the train dataset and test dataset.

The split percentage is selected considering the points mentioned below.

  • The train set represents the dataset sufficiently
  • The test set represents the dataset sufficiently
  • The computational cost of evaluating the model
  • The computational cost of training the model

The common split percentage is:

Train: 80%, Test: 20%

Train: 70%, Test: 30%

Example of Train-Test Split

Consider a flower dataset with 150 records. A snippet of input data is shown in the figure given below.

We apply Train Test Split on the input data. The input dataset is split into train records and test records randomly based on the Test Percentage parameter given in the properties.

The segmentation of records into train and test is displayed in the data column trainTestTagIntern, as shown in the figure below.

Further, we apply the Classification model Adaboost on the split data.

The result for Train data is displayed in the figure given below.


As shown in the figure above, the Adaboost model's accuracy for the Train data is 0.9417.

The result of the Test data for the Adaboost model is displayed in the figure given below.

|


As shown in the figure above, the accuracy of the Test data is 0.8667.

Example of Train-Test Split for image dataset

Now, we apply Train test split to an image dataset. Consider an image dataset with 100 images. The dataset can be viewed as follows

We apply Train Test Split on the input data. The input dataset is split into train records and test records randomly based on the Test Percentage parameter given in the properties.

The segmentation of records into train and test is displayed in the data column TrainTestSplit, as shown in the figure below.

Similarly, you can use Train Test Split and test any other Classification or Regression models' performance.

Table of Contents