PythonPlaza.com

Gradient Boosting

A machine learning method called Gradient Boosting creates an ensemble by combining several weak prediction models. Decision trees, which are sequentially trained to reduce errors and increase accuracy, are commonly used as these weak models. Gradient boosting can efficiently capture intricate correlations between features by combining several decision tree regressors or decision tree classifiers.

Gradient boosting's capacity to iteratively minimize the loss function is one of its main advantages. One loss function used to assess how well a machine learning model matches actual data is Mean Squared Error (MSE). MSE determines the mean of the squared discrepancies between the observed and expected values.

MAE (Mean Absolute Error) quantifies the average magnitude of errors for Gradient Boosting Regression.

Mean Absolute Error
It calculates the average discrepancy between a dataset's actual and forecasted values. Without taking direction into account, it displays the deviation between predicted and actual values.
1. Determined by utilizing absolute differences
2. Easy to calculate and understand
3. Handles every mistake equally
4. Not as susceptible to significant errors as MSE
5. Frequently employed to assess regression models

USE CASE 1: Use Gradient Boosting with scikit-learn to predict whether a loan will default. Dependent variable: Default (0 = No Default, 1 = Default) Independent variables (3): Income (monthly income, e.g., 1000–10000) CreditScore (300–850) LoanAmount (1000–50000).

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score # ----------------------------------- # 1. Load data from Excel # ----------------------------------- data = pd.read_excel("loan_data.xlsx") df = pd.DataFrame(data) print("Dataset Preview:") print(data.head()) # ----------------------------------- # 2. Define features and target # ----------------------------------- X = df[["Income", "CreditScore", "LoanAmount"]] y = df["Default"] # ----------------------------------- # 3. Split into training and testing # ----------------------------------- X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) gb_model = GradientBoostingClassifier( n_estimators=100, # number of trees learning_rate=0.1, # step size max_depth=3, # tree depth random_state=42 ) gb_model.fit(X_train, y_train) y_pred = gb_model.predict(X_test) y_prob = gb_model.predict_proba(X_test)[:, 1] print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred)) print("\nClassification Report:\n", classification_report(y_test, y_pred)) print("\nROC-AUC Score:", roc_auc_score(y_test, y_prob)) Step 9: Predict default for a new customer new_customer = [[4500, 620, 16000]] # Income, CreditScore, LoanAmount default_prediction = gb_model.predict(new_customer) default_probability =gb_model.predict_proba(new_customer)[0][1] print("Default Prediction:", default_prediction[0]) print("Probability of Default:", default_probability)

USE CASE 2: Customer Churn example using Gradient Boosting with scikit-learn in Python. We’ll assume 4 independent variables, for example: Tenure (months with company) - 1–60 months MonthlyCharges (amount billed per month) - 30–120 ContractType (0=Month-to-month, 1=One-year, 2=Two-year) SupportCalls (number of calls to support) 0–10 The dependent variable is Churn (0=Stay, 1=Churn)..

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score # ----------------------------------- # 1. Load data from Excel # ----------------------------------- #sample data can be exported to #excel from the URL # https://www.pythonplaza.com/categorical_customer_churn_1_or_0.html data = pd.read_excel("customer_data.xlsx") print("Dataset Preview:") print(data.head()) # ----------------------------------- # 2. Define features and target # ----------------------------------- X = df[["Tenure", "MonthlyCharges", "ContractType", "SupportCalls"]] y = df["Churn"] # ----------------------------------- # 3. Split into training and testing # ----------------------------------- X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) gb_model = GradientBoostingClassifier( n_estimators=100, # number of trees learning_rate=0.1, # step size max_depth=3, # tree depth random_state=42 ) gb_model.fit(X_train, y_train) y_pred = gb_model.predict(X_test) y_prob = gb_model.predict_proba(X_test)[:, 1] print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred)) print("\nClassification Report:\n", classification_report(y_test, y_pred)) print("\nROC-AUC Score:", roc_auc_score(y_test, y_prob)) #Predict churn for a new customer new_customer = [[8, 92, 0, 5]] # Tenure, MonthlyCharges, ContractType, SupportCalls churn_prediction = gb_model.predict(new_customer) churn_probability = gb_model.predict_proba(new_customer)[0][1] print("Churn Prediction:", churn_prediction[0]) print("Probability of Churn:", churn_probability) #Interpreting the results (business view) 1 → High risk of churn ⚠️ 0 → Likely to stay ✅ Use probability (e.g., churn > 0.6) to trigger retention offers

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score # ----------------------------------- # 1. Load data from Excel # ----------------------------------- #sample data can be exported to #excel from the URL Get the Categorical learning Styles data in Excel data = pd.read_excel("Categorical_learning_Styles.xlsx") print("Dataset Preview:") print(data.head()) # ----------------------------------- # 2. Define the data # ----------------------------------- X = df[['prefers_diagrams', 'prefers_lectures', 'prefers_notes', 'prefers_hands_on']] y = df['learning_style'] # Encode categorical target labels le = LabelEncoder() y_encoded = le.fit_transform(y) # ----------------------------------- # 3. Split into training and testing # ----------------------------------- X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) gb_model = GradientBoostingClassifier( n_estimators=100, # number of trees learning_rate=0.1, # step size max_depth=3, # tree depth random_state=42 ) gb_model.fit(X_train, y_train) y_pred = gb_model.predict(X_test) y_prob = gb_model.predict_proba(X_test)[:, 1] print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred)) print("\nClassification Report:\n", classification_report(y_test, y_pred)) print("\nROC-AUC Score:", roc_auc_score(y_test, y_prob)) #Predict with sample data new_students = np.array([ [5, 1, 2, 1], # Likely Visual [1, 5, 3, 2], # Likely Auditory [2, 1, 5, 2], # Likely Reading/Writing [1, 2, 1, 5] # Likely Kinesthetic ]) # Predict encoded labels predictions_encoded = gb_model.predict(new_students) # Convert numeric predictions back to original labels predictions = le.inverse_transform(predictions_encoded) print("Predicted Learning Styles:") print(predictions)

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score # ----------------------------------- # 1. Load data from Excel # ----------------------------------- #sample data can be exported to #excel from the URL Get Disease Classification in Excel data = pd.read_excel("patient_dosage_response.xlsx") print("Dataset Preview:") print(data.head()) df = pd.DataFrame(data) # ---------------------------------- # 2. Separate Features and Target # ---------------------------------- X = df[['Age', 'BloodPressure', 'Cholesterol', 'FamilyHistory']] y = df['Disease'] # ----------------------------------- # 3. Split into training and testing # ----------------------------------- X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) gb_model = GradientBoostingClassifier( n_estimators=100, # number of trees learning_rate=0.1, # step size max_depth=3, # tree depth random_state=42 ) gb_model.fit(X_train, y_train) y_pred = gb_model.predict(X_test) y_prob = gb_model.predict_proba(X_test)[:, 1] print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred)) print("\nClassification Report:\n", classification_report(y_test, y_pred)) print("\nROC-AUC Score:", roc_auc_score(y_test, y_prob)) new_patients = np.array([ [45, 150, 230, 1], # High risk [28, 118, 175, 0] # Low risk ]) predictions = gb_model.predict(new_patients) print("Disease Predictions:") print(predictions)

Supervised Machine Learning Algorithms

Gradient Boosting