A Data Science Sample Project
The goal of this problem is to train a model that successfully predicts job salary based on a dataset with the following predictor variables:
For this sample project, two supervised learning methods have been chosen: a linear regression model (LRM) and a large language model (LLM) based on the Transformer architecture.
Before training the models, the data was explored to address missing values, inconsistencies in observations, and to reduce the number of levels in categorical variables, among other preprocessing steps.
The overall model is highly significant and all predictors make significant contributions, as supported by the corresponding statistical tests1.
The model shows a good performance when predicting salaries with new data:
Similar values of these metrics in train and test data indicate low risk of overfitting.
Set | \(R^2\) | RSME |
---|---|---|
Train | 0.9231 | 13372,68 |
Test | 0.8912 | 15819.82 |
Based on the model estimates, on average1:
Based on the model estimates, on average1: each additional year of age increases salary by 2,540.94 and each additional year of experience increases it by USD 2,497.27.
Using a LLM could be a good idea for this problem because it has the ability to capture complex patterns and relationships in text data, which may contain valuable insights for predicting salary beyond traditional numerical features.
The model shows a good performance when predicting salaries with new data:
Similar values of these metrics in train and test data indicate low risk of overfitting.
Set | \(R^2\) | RSME |
---|---|---|
Train | 0.9094 | 14445.76 |
Test | 0.8836 | 16404.05 |
Both trained models showed similar performance, with the linear regression model (LRM) slightly outperforming the DistilBERT model (LLM).
\(R^2\) | LRM | LLM |
---|---|---|
Train | 0.9231 | 0.9094 |
Test | 0.8912 | 0.8836 |
RMSE | LRM | LLM |
---|---|---|
Train | 13372.68 | 14445.76 |
Test | 15819.82 | 16404.05 |
Hyperparameter tuning for the LLM model (learning rate, batch size, epochs, etc.) was not performed and the training was done briefly in a personal computer. Exploring these issues could significantly improve its performance.
Other machine learning techniques and aggregation methods (discriminant analysis, random forests, SVM, bagging, boosting, etc.) and aggretation methods could be evaluated and their hyperparameters could be tuned by cross-validation.
Choose any of these models and make predictions for new data in this app.