Default Detector

Published on June 01, 2024 • By Arman Drismir • 1 minute read

I built a svelte/flask app around the ML model I made for a CPSC 330 class project.

Dataset & Metrics

This dataset is imbalanced, 22% of clients default while 78% do not. For this reason F1 score and Recall are the metrics we want to pay a lot of attention to. F1 score will give us a good indication of overall performance and Recall will tell us how well our model preforms when it sees the clients who do default.

The Kaggle page has descriptions for what values in the columns represent. Link to dataset.

Best Model

I was able to get a test accuracy of 82% using CatBoost with hyperparameters I found using RandomizedSearch. The results were less than I expected since a dummy classifier scores 78%. Running RandomizedSearch on a more powerful computer would have probably allowed me to squeeze a few more points of accuracy. The overall f1, recall, and precision scores are good but when looking specifically at the default class they are very bad. It would be interesting to train a new model that sacrifices accuracy for recall in the DEFAULT class.

	precision	recall	f1-score
PAID	0.84	0.95	0.89
DEFAULT	0.68	0.37	0.48
average	0.81	0.82	0.80

All Models

Linear Regression

Train score: 0.810524, improved to 0.8106 with HPO (+0.000076).

SVC

Train score: 0.8177, improved to 0.8183 with HPO (+0.0006)

Gradient Boosted Trees

Train score: 0.8190, decreased to 0.8189 with HPO (-0.0001)

CatBoost

Train score: 0.8175, improved to 0.8214 with HPO (+0.0039)