Heart Attack Risk Prediction Using XGBoost

Objective:

Developed a machine learning model to predict the likelihood of a heart attack based on patient lifestyle and health factors such as age, sedentary hours, exercise frequency, and medical history.

Key Steps:

Data Preprocessing:

Challenge: The dataset contained missing values and categorical data.
Solution: Addressed missing data by removing rows with missing values (dropna()) and converted categorical variables (e.g., gender, family history) into numerical formats using binary encoding, ensuring that the model could process all data effectively.

Handling Class Imbalance:

Challenge: The dataset was imbalanced, with far fewer instances of heart attack risk (minority class) compared to non-risk cases.
Solution: Applied SMOTE (Synthetic Minority Over-sampling Technique) to oversample the minority class and balance the dataset. This improved the model’s ability to correctly identify heart attack risks without being biased toward the majority class.

Feature Scaling:

Challenge: Features like age and sedentary hours had varying scales, which could skew the model’s performance.
Solution: Used StandardScaler to normalize the feature values, ensuring all inputs contributed equally to the model’s predictions.

Modelling with XGBoost:

Implementation: Leveraged XGBoost, a gradient boosting algorithm, due to its performance with structured/tabular data and ability to handle complex relationships between features. Fine-tuned hyperparameters such as learning_rate and n_estimators to optimize model performance.

Cross-Validation:

Challenge: Ensuring the model’s generalization ability across different subsets of the data.
Solution: Used cross-validation with two folds to evaluate the model on multiple splits of the data, helping prevent overfitting and ensuring the model performed consistently across different samples.

Model Evaluation:

Challenge: Initial models showed poor precision and recall, particularly in detecting heart attack risks (the minority class).
Solution: Investigated the impact of class imbalance and feature scaling, leading to better classification performance after oversampling and applying scaling techniques. Evaluated model performance using metrics such as accuracy, precision, recall, F1-score, and ROC AUC score to assess how well the model distinguished between high-risk and low-risk individuals.
Outcome: After addressing class imbalance and tuning hyperparameters, the model provided valuable insights into heart attack risk predictions, achieving improved accuracy and more balanced precision and recall scores. Visualized results using ROC curves to highlight model performance.