Data partitions

Summary

Goal

The goal of this script is to split the available dataset into a training set (for fitting predictive models) and a test set (for validating them and calculating performance metrics).

Remarks
  • Due to the small size of the dataset, a two-partition approach (train and test) will be used. Other approaches with more partitions (e.g., train, validation, and test) could be considered, as they may offer advantages in producing unbiased estimates of test error.
  • It is important to ensure that the randomness in creating the partitions does not lead to differing distributions of the response variable between train and test. Therefore, the partitions will be created by stratifying based on the salary variable, using percentiles.
  • All models to be fitted will use the same partitions.
  • An 80/20 split was chosen, reserving 80% for train and 20% for test.

Libraries and modules

import pandas as pd
from sklearn.model_selection import train_test_split

Create partitions

# Load data
datos = pd.read_csv('../data/clean_data.csv')

# usar percentiles para respetar la distribucion de salary en ambos sets
intervalos = pd.qcut(x = datos['salary'], q = 5, labels = False, duplicates = 'drop')

# Split
train, test = train_test_split(datos, test_size = 0.2, random_state = 89, stratify = intervalos)

# Save
train.to_csv('../data/train_set.csv', index = False)
test.to_csv('../data/test_set.csv', index = False)

Checking similar distribution for salary

train['salary'].describe()
count       298.000000
mean     100486.577181
std       48077.674467
min       30000.000000
25%       55000.000000
50%       95000.000000
75%      140000.000000
max      250000.000000
Name: salary, dtype: float64
test['salary'].describe()
count        75.000000
mean     101400.000000
std       48403.986881
min       35000.000000
25%       57500.000000
50%       95000.000000
75%      140000.000000
max      220000.000000
Name: salary, dtype: float64