import pandas as pd
from sklearn.model_selection import train_test_split
Data partitions
Summary
Goal
The goal of this script is to split the available dataset into a training set (for fitting predictive models) and a test set (for validating them and calculating performance metrics).
Remarks
- Due to the small size of the dataset, a two-partition approach (train and test) will be used. Other approaches with more partitions (e.g., train, validation, and test) could be considered, as they may offer advantages in producing unbiased estimates of test error.
- It is important to ensure that the randomness in creating the partitions does not lead to differing distributions of the response variable between train and test. Therefore, the partitions will be created by stratifying based on the salary variable, using percentiles.
- All models to be fitted will use the same partitions.
- An 80/20 split was chosen, reserving 80% for train and 20% for test.
Libraries and modules
Create partitions
# Load data
= pd.read_csv('../data/clean_data.csv')
datos
# usar percentiles para respetar la distribucion de salary en ambos sets
= pd.qcut(x = datos['salary'], q = 5, labels = False, duplicates = 'drop')
intervalos
# Split
= train_test_split(datos, test_size = 0.2, random_state = 89, stratify = intervalos)
train, test
# Save
'../data/train_set.csv', index = False)
train.to_csv('../data/test_set.csv', index = False) test.to_csv(
Checking similar distribution for salary
'salary'].describe() train[
count 298.000000
mean 100486.577181
std 48077.674467
min 30000.000000
25% 55000.000000
50% 95000.000000
75% 140000.000000
max 250000.000000
Name: salary, dtype: float64
'salary'].describe() test[
count 75.000000
mean 101400.000000
std 48403.986881
min 35000.000000
25% 57500.000000
50% 95000.000000
75% 140000.000000
max 220000.000000
Name: salary, dtype: float64