import sys
import os
"code"))
sys.path.append(os.path.join(os.getcwd(), from modulos import metrics, analyze_residuals, scatter_plot_real_vs_pred
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import root_mean_squared_error, r2_score
import joblib
# Cargar las particiones
= pd.read_csv('../data/train_set.csv')
train_df = pd.read_csv('../data/test_set.csv') test_df
Salary prediction using a linear regression model
Summary
The goal of this script is to fit a multiple linear regression model to predict the response variable salary based on the remaining available variables. Additionally, statistical inference methodologies are applied to estimate and quantify the level of association between the predictors and the response, measuring the degree of confidence in the conclusions. The technical conditions that the model must meet to ensure valid conclusions are assessed, and finally, the model is used to make predictions on the test data to evaluate its performance.
Why Use Regression
- Although it is a classic and simple methodology, regression models are highly useful and flexible, offering the advantage of clearly interpreting the effect of predictors on the response variable, along with probabilistic conclusions.
- This model serves as a baseline for comparison, any more advanced technique should outperform it to be considered.
Performance in Predicting Salary
- The model shows a very good performance when predicting salaries with new data, with \(R^2: 0.8912\), which indicates that 89.12% of the variance in the response (salary) is explained by the predictors in the model.
- On average, the salary predictions made by the model with new data differ from the actual values by USD 15,819.82.
- Similar values of these metrics in train and test data indicate low risk of overfitting.
Set | \(R^2\) | RSME |
---|---|---|
Train | 0.9231 | 13372.68 |
Test | 0.8912 | 15819.82 |
Results
The overall model is highly significant (\(F\)-stat: 428.9, p-value < 0.0001).
All predictors make significant contributions, as supported by ANOVA results.
No major violations of model assumptions were detected.
Key effects. On average, and holding all other factors constant:
- A Master’s degree increases salary by $18,390, and a PhD increases it by $23,180 compared to individuals with a Bachelor’s degree.
- Being in a Leadership role increases salary by $14,050, and being a Senior employee increases it by $13,590, compared to being a Junior.
- Each additional year of age increases salary by $2540.94 and each additional year of experience increases it by $2497.27
- Men earn $7999.69 more than women.
Methodology
Multiple linear regression is a statistical method used to model the relationship between a numerical response variable (Y) and multiple explanatory variables (X_1, X_2, …, X_k). The general form of the model is expressed as:
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_kX_k + \epsilon\]
where \(\beta_0\) represents the intercept, \(\beta_1, \beta_2, ..., \beta_k\) are the regression coefficients associated with each explanatory variable, and \(\epsilon\) is the error term, which captures the variability in the response variable that is not explained by the predictors.
This model is widely used for both predictive and inferential purposes. Inferentially, the coefficients \(\beta_i\) provide information about the strength and direction of the relationship between each explanatory variable and the response variable. Hypothesis testing can be performed to evaluate whether specific predictors have a statistically significant effect on the response, and confidence intervals can also be constructed to quantify the uncertainty around the estimated parameters.
To ensure the validity of the model, certain technical conditions must be verified. These include linearity (the relationship between predictors and the response is linear), independence of errors, homoscedasticity (constant variance of the errors), and normality of the error terms. Diagnostic tools such as residual plots, normality tests, and quantile-quantile (Q-Q) plots are commonly used to assess these assumptions.
If the assumptions are not met, the model can be improved by proposing a more sophisticated structure (e.g., variable transformations, variance structures that do not assume independence or homoscedasticity, interactions between predictors, etc.). The model can also be estimated using techniques such as LASSO or Ridge, which generally provide estimators with lower variance.
In this illustrative case, only a basic linear regression model is fitted, which still achieves excellent performance.
Libraries and data
Training
The model shows a very good performance, with \(R^2: 0.923\), which indicates that 92.3% of the variance in the response (salary) is explained by the predictors in the model.
On average, the salary predictions made by the model differ from the actual values by USD 13,372.68.
The model underestimates the response for outlier cases with very high real salaries.
# Model fitting
= smf.ols('salary ~ age + gender + educ + title_cat + exp', data = train_df).fit()
modelo modelo.summary()
Dep. Variable: | salary | R-squared: | 0.923 |
Model: | OLS | Adj. R-squared: | 0.921 |
Method: | Least Squares | F-statistic: | 428.9 |
Date: | Wed, 29 Jan 2025 | Prob (F-statistic): | 2.12e-154 |
Time: | 19:01:25 | Log-Likelihood: | -3221.4 |
No. Observations: | 295 | AIC: | 6461. |
Df Residuals: | 286 | BIC: | 6494. |
Df Model: | 8 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | -4.022e+04 | 1.54e+04 | -2.616 | 0.009 | -7.05e+04 | -9962.527 |
gender[T.Male] | 7998.6856 | 1607.752 | 4.975 | 0.000 | 4834.158 | 1.12e+04 |
educ[T.Master's] | 1.839e+04 | 2115.086 | 8.695 | 0.000 | 1.42e+04 | 2.26e+04 |
educ[T.PhD] | 2.318e+04 | 2873.911 | 8.066 | 0.000 | 1.75e+04 | 2.88e+04 |
title_cat[T.Leadership] | 1.405e+04 | 3501.546 | 4.012 | 0.000 | 7155.489 | 2.09e+04 |
title_cat[T.Other] | 375.2237 | 2746.080 | 0.137 | 0.891 | -5029.868 | 5780.315 |
title_cat[T.Senior] | 1.359e+04 | 2640.661 | 5.146 | 0.000 | 8390.865 | 1.88e+04 |
age | 2540.9370 | 565.687 | 4.492 | 0.000 | 1427.499 | 3654.375 |
exp | 2497.2653 | 641.125 | 3.895 | 0.000 | 1235.344 | 3759.187 |
Omnibus: | 56.291 | Durbin-Watson: | 1.626 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 289.301 |
Skew: | 0.642 | Prob(JB): | 1.51e-63 |
Kurtosis: | 7.678 | Cond. No. | 774. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Save predictions in train data
'pred_salary'] = modelo.predict(train_df)
train_df[
# Performance metrics
# puede haber NaN en casos con datos faltantes en predictoras
= train_df.dropna(subset = 'pred_salary')
train_sin_na_pred 'salary'], train_sin_na_pred['pred_salary'], "train") metrics(train_sin_na_pred[
Métricas para train:
- R2: 0.9231
- RMSE: 13372.6827
'salary', 'pred_salary', "Training data", label = " - Salary (USD)") scatter_plot_real_vs_pred(train_df,
Inference
The overall model is highly significant, as indicated by the F-test (F-stat: 428.9, p-value < 0.0001), which means the predictors collectively explain the dependent variable significantly better than a model without them.
Additionally, each predictor individually makes a significant contribution to the model, considering the presence of the other variables, as indicated by the F-tests summarized in the ANOVA table.
From the estimation of the coefficients of the model it can be concluded with a 95% of confidence that, holding other factors constant:
- Each additional year of age increases salary by an amount between USD 1,427 and USD 3,654.
- Each additional year of experience increases salary by an amount between USD 1,235 and USD 3,759.
- Men earn between USD 4,834 and USD 11,200 more than women.
- Compared to having a Bachelor’s degree, a Master’s degree increases salary by an amount between USD 14,200 and USD 22,600, and a PhD increases it by an amount between USD 17,500 and USD 28,800.
- Compared to job positions related to being “Junior”, the increase in salaries lies between USD 8,390 and USD 18,800 for “Senior”, and between USD 7155 and USD 20,900 for “Leadership”. Other job positions don’t differ significantly from “Junior”.
print(f"Test Global F: {modelo.summary().tables[0].data[0][3]}")
anova_lm(modelo)
Test Global F: 0.923
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
gender | 1.0 | 3.739986e+09 | 3.739986e+09 | 20.275749 | 9.773076e-06 |
educ | 2.0 | 3.227025e+11 | 1.613512e+11 | 874.740489 | 1.321158e-122 |
title_cat | 3.0 | 1.718697e+11 | 5.728990e+10 | 310.588252 | 1.254207e-89 |
age | 1.0 | 1.317438e+11 | 1.317438e+11 | 714.228383 | 9.819006e-80 |
exp | 1.0 | 2.798577e+09 | 2.798577e+09 | 15.172047 | 1.222542e-04 |
Residual | 286.0 | 5.275445e+10 | 1.844561e+08 | NaN | NaN |
Evaluation of technical conditions
An exploratory residual analysis was conducted, and no significant violations of the model assumptions were detected, allowing the inferential conclusions to be considered valid. The residual distribution appears to have slightly heavy tails, but this is minor.
analyze_residuals(modelo)
Prediction in test data
The model shows a very good performance when predicting salaries with new data, with \(R^2: 0.8912\), which indicates that 89.12% of the variance in the response (salary) is explained by the predictors in the model.
On average, the salary predictions made by the model with new data differ from the actual values by USD 15,819.82.
As expected, these metrics are slightly worse than the ones for the train data, with no indication of overfitting.
# Métricas para el conjunto de prueba
'pred_salary'] = modelo.predict(test_df)
test_df[= test_df.dropna(subset = 'pred_salary')
test_sin_na_pred 'salary'], test_sin_na_pred['pred_salary'], "test") metrics(test_sin_na_pred[
Métricas para test:
- R2: 0.8912
- RMSE: 15819.8234
'salary', 'pred_salary', "Testing data", label = " - Salary (USD)") scatter_plot_real_vs_pred(test_df,
Saving the trained model
'../model_outputs/mod_reg_lin.pkl') joblib.dump(modelo,
['../model_outputs/mod_reg_lin.pkl']