ML - Payer Prediction

Introduction

This document describes the process of loading and using an existing machine learning model to identify credit payments and subsequently determine the insurance payer names. The model uses a RandomForestClassifier, with TF-IDF vectorization and cosine similarity for processing and prediction. The process includes loading the model, filtering transactions, processing narrations, vectorizing text and making predictions.

Import Required Libraries

The necessary libraries are imported to handle data manipulation, model loading, and predictions.

import pandas as pd
import re
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

Explanation:

pandas: Used for data manipulation and analysis.
re: Provides support for regular expressions.
pickle: Used for loading and saving the machine learning model and vectorizer.
sklearn: Contains various machine learning algorithms and utilities.
RandomForestClassifier: Used for classification tasks.
TfidfVectorizer: Converts text to a matrix of TF-IDF features.
cosine_similarity: Computes the cosine similarity between two vectors.

Load the Pre-Trained Model and Vectorizer

The previously saved model and TF-IDF vectorizer are loaded from their respective pickle files.

with open('BT_2_Credit.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)

with open('tfidf_vectorizer_1.pkl', 'rb') as vectorizer_file:
    loaded_vectorizer = pickle.load(vectorizer_file)

BT_2_Credit.pkl: The file containing the pre-trained RandomForestClassifier model.
tfidf_vectorizer_1.pkl: The file containing the pre-trained TF-IDF vectorizer.

Load the Dataset

The bank transaction data is loaded from a CSV file.

# Load the dataset
file_path = 'BankTransactionStaging.csv'
df = pd.read_csv(file_path)

Filter Transactions with Deposit Value > 0

Only the transactions with deposit values greater than zero are considered.

# Filter transactions with deposit value > 0
df_filtered = df[df['Deposit'] > 0].copy()

Process Narration Text

A function is defined to process the 'Narration' column, removing specific unwanted characters and normalizing the text.

# Function to process Narration
def process_narration(text):
    # Remove leading and trailing whitespace
    cleaned_text = text.strip()
    # Remove "NEFT" case insensitive
    if cleaned_text.lower().startswith("neft"):
        cleaned_text = cleaned_text[4:].strip()
    # Define symbols to remove
    symbols_to_remove = '*-.,?/()\'":'
    # Remove specified symbols
    cleaned_text = ''.join([char for char in cleaned_text if char not in symbols_to_remove])

    return cleaned_text.strip().lower()

# Process the Narration column in the new dataset
df_filtered['normalized_narration'] = df_filtered['Narration'].apply(process_narration)

Perform TF-IDF Vectorization

The normalized_narration column is vectorized using the TF-IDF vectorizer.

# Perform TF-IDF Vectorization
vectorizer = TfidfVectorizer(lowercase=True)
# Transform the normalized narration
X = vectorizer.fit_transform(df_filtered['normalized_narration'])

Making Predictions

The pre-trained RandomForestClassifier model predicts whether each transaction is an insurance payment based on the TF-IDF matrix.

# Make predictions using the loaded model
y_pred = loaded_model.predict(X)

Filtering Predicted Insurance Transactions

Filters out transactions predicted as insurance payments.

df_filtered['is_insurance_pred'] = y_pred
df_filtered = df_filtered[df_filtered['is_insurance_pred'] == 1]

Cleaning Narration

clean_narration: Further cleans and normalizes the narrations by removing specific patterns and replacing them with standard insurance company names.

# Function to clean NEFT descriptions
def clean_narration(bank_narra):
    # Remove "NEFT" prefix
    bank_narra = str(bank_narra)
    if bank_narra.startswith("NEFT"):
        bank_narra = bank_narra[4:].strip()
    bank_narra = re.sub(r'(?i)agarwal.*', '', bank_narra).strip()
    bank_narra = re.sub(r'(?i).*?UIIC.*', 'UIIC', bank_narra)
    bank_narra = re.sub(r'(?i).*?BHEL.*', 'BHEL', bank_narra)
    bank_narra = re.sub(r'(?i).*?ABHICL.*', 'ABHICL', bank_narra)
    bank_narra = re.sub(r'(?i).*?TATA AIG.*', 'TATA AIG Health Insurance', bank_narra)
    bank_narra = re.sub(r'(?i).*?SBI GENE.*', 'SBI GENERAL INSURANCE CO LTD', bank_narra)
    bank_narra = re.sub(r'(?i).*?SBIGEN.*', 'SBI GENERAL INSURANCE CO LTD', bank_narra)
    bank_narra = re.sub(r'(?i).*?NATIONAL INSU.*', 'NATIONAL INSURANCE COMPANY', bank_narra)
    bank_narra = re.sub(r'(?i).*?ACKO.*', 'ACKO GENERAL', bank_narra)
    bank_narra = re.sub(r'(?i).*?BHARATSANCHARNIGAMLD.*', 'BSNL', bank_narra)
    bank_narra = re.sub(r'(?i).*?GoDigitG*', 'GoDigitG', bank_narra)
    bank_narra = re.sub(r'(?i).*?ISRO.*', 'Indian Space Research Organisation(ISRO)', bank_narra)
    bank_narra = re.sub(r'(?i).*?PAY AND ACC.*', 'CGHS', bank_narra)
    bank_narra = re.sub(r'(?i).*?AYUSHMAN.*', 'AYUSHMAN', bank_narra)
    bank_narra = re.sub(r'(?i).*?DIRECTORATE OF HEALTH SERVICE.*', 'Anishi Andaman', bank_narra)
    bank_narra = re.sub(r'(?i).*?MHFW.*', 'CGHS', bank_narra)
    bank_narra = re.sub(r'(?i).*?MMTC.*', 'MMTC', bank_narra)
    bank_narra = re.sub(r'(?i).*?R C F.*', 'RCF', bank_narra)
    bank_narra = re.sub(r'(?i).*?ORIENTAL INSURANC.*', 'THE ORIENTAL INSURANCE', bank_narra)
    bank_narra = re.sub(r'(?i).*?UNITED INDIA INSUR.*', 'UIIC', bank_narra)
    bank_narra = re.sub(r'(?i).*?MEDI ASSIST.*', 'Mediassist India TPA', bank_narra)
    bank_narra = re.sub(r'(?i).*?TREASURY OFFICE SECRETRIATE JAIPUR.*', 'RGHS (Rajasthan Govt Health Scheme)', bank_narra)
    bank_narra = re.sub(r'(?i).*?ARMY.*', 'ECHS', bank_narra).strip()
    bank_narra = re.sub(r'(?i).*?MINISTRY OF.*', 'ECHS', bank_narra).strip()
    bank_narra = re.sub(r'(?i).*?MIN OF DEF.*', 'ECHS', bank_narra).strip()
    bank_narra = re.sub(r'(?i).*?CDA CHENN.*', 'ECHS', bank_narra).strip()
    bank_narra = re.sub(r'(?i).*?PCDA.*', 'ECHS', bank_narra).strip()
    bank_narra = re.sub(r'(?i).*?AAO KOLKAT.*', 'ECHS', bank_narra).strip()
    bank_narra = re.sub(r'(?i).*?PCDARD.*', 'ECHS', bank_narra).strip()
    bank_narra = re.sub(r'(?i).*?MMFW.*', 'CGHS', bank_narra)
    bank_narra = re.sub(r'(?i).*?UTIITS.*', 'CGHS', bank_narra)
    bank_narra = re.sub(r'(?i).*?BAJAJ ALLIA.*', 'BAJAJ ALLIANZ GENERAL INSURANCE', bank_narra)
    bank_narra = re.sub(r'(?i).*?ICICI.*', 'ICICI Lombard General Insurance', bank_narra)
    bank_narra = re.sub(r'(?i).*?AIRPORTS AUTHO.*', 'AIRPORT AUTHORITY OF INDIA', bank_narra)
    bank_narra = re.sub(r'(?i)DR .*', '',bank_narra)
    return bank_narra.strip().upper()

Preparing for TF-IDF Vectorization

Extracts the cleaned narrations for further processing.

# Prepare the data for TF-IDF Vectorization
cleaned_narration = df_filtered["cleaned_normalized_narration"].tolist()

Reference Insurance Company Names

A list of known insurance company names for identifying insurance payers.

A list of known insurance company names for identifying insurance payers.# Reference insurance company names
insurance_companies = [
   "Health India Insurance TPA Services Pvt Ltd", "ZUNO GENERAL INSURANCE", "MAHANADI COAL FIELDS","JAWAHARLAL NEHRU PORT AUTHORITY","PARADIP PORT TRUST","V.O.C CHIDAMARANAR PORT TRUST","BHEL","YASHASVINI",
    "Go Digit General Insurance", "UNIVERSAL SOMPO GENERAL INSURANCE", "UIIC", "ABHICL","SVAAS WELLNESS LIMIT","ORDINANCE FACTORY","MMTC","RCF","TPA Receipts"]

Ensure all elements in both lists are strings

Converts all elements in narrations and insurance_companies lists to strings, ensuring uniform data type for further processing.

# Ensure all elements in both lists are strings
narrations = [str(desc) for desc in narrations]
insurance_companies = [str(company) for company in insurance_companies]
print(len(narrations),len(insurance_companies))

TF-IDF Vectorization

Uses TF-IDF vectorization to transform textual data narrations and insurance_companies into numerical representations, essential for similarity calculations.

# Perform TF-IDF Vectorization
vectorizer = TfidfVectorizer(lowercase=True)
tfidf_matrix = vectorizer.fit_transform(narrations + insurance_companies)

Cosine similarity between narration and insurance companies

Calculates cosine similarity between NEFT descriptions and insurance companies based on their TF-IDF representations, aiding in matching NEFT descriptions to relevant insurance companies.
Determines the most similar insurance company for each NEFT description based on cosine similarity scores, facilitating accurate categorization.
Calculates and formats similarity percentages for documentation and analysis, providing insights into the strength of matches.

# Compute cosine similarity between NEFT descriptions and insurance companies
cosine_similarities = cosine_similarity(tfidf_matrix[:len(narrations)], tfidf_matrix[len(narrations):])

# Assign the closest matching insurance company to each NEFT description
matched_indices = cosine_similarities.argmax(axis=1)
matched_insurance_companies = [insurance_companies[idx] for idx in matched_indices]

# Extract similarity percentages
similarity_percentages = [cosine_similarities[i, idx] for i, idx in enumerate(matched_indices)]

# Convert similarity percentages to a percentage format
similarity_percentages = [similarity * 100 for similarity in similarity_percentages]

Initialize the List for Matched Insurance Companies and Iterate over Cosine Similarities

A list is initialized to store insurance companies or "TPA Receipts" based on cosine similarity thresholds.
Iterates over each row in the cosine similarities matrix to determine the most similar insurance company or label as "TPA Receipts" if the maximum similarity is below 0.1.

# Initialize the list to store the matched insurance companies or "TPA Receipts"
matched_insurance_companies = []

# Iterate over each row in the cosine similarities matrix
for similarities in cosine_similarities:
    max_similarity = np.max(similarities)
    if max_similarity < 0.1:
        matched_insurance_companies.append("TPA Receipts")
    else:
        matched_insurance_companies.append(insurance_companies[similarities.argmax()])

Ensure Consistent Length with DataFrame

Ensures that the matched_insurance_companies list matches the length of the original DataFrame df_filtered

# Ensure matched_insurance_companies has the same length as df_filtered
num_rows = len(df_filtered)
matched_insurance_companies = matched_insurance_companies[:num_rows]

Assign Matched Insurance Companies and Matched % to DataFrame

Assigns the matched insurance company names and their respective similarity percentages to the DataFrame df_filtered.

# Assign to the DataFrame
df_filtered['Matched Insurance Company'] = matched_insurance_companies
df_filtered['Matched %'] = similarity_percentages

Label Encoding of Target Variable

Encodes the categorical variable 'Matched Insurance Company' using LabelEncoder to prepare it for model training.

# Label encode the target variable
label_encoder = LabelEncoder()
df_filtered['Encoded'] = label_encoder.fit_transform(df_filtered['Matched Insurance Company'])

Split Data into Training and Testing Sets

Splits the TF-IDF matrix and encoded labels into training and testing sets for model training and evaluation.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix[:len(narrations)], df_filtered['Encoded'], test_size=0.2, random_state=42)

Define Models for Evaluation

Defines a dictionary of models to evaluate, starting with a RandomForestClassifier and SVM.

models = {'Random Forest': RandomForestClassifier(), 'SVM':SVC()}

Train and Evaluate Models

Trains model, evaluates its accuracy, and generates a classification report on the test set.

# Train and evaluate models
results = {}
for model_name, model in models.items():
 model.fit(X_train, y_train)
 y_pred = model.predict(X_test)
 accuracy = accuracy_score(y_test, y_pred)*100
 class_report = classification_report(y_test, y_pred)
 results[model_name] = {
  'accuracy': accuracy,
  'classification_report': class_report
 }
 print(f"Model: {model_name}")
 print(f"Accuracy: {accuracy}%")

Identify the Best Performing Model

Identifies the best performing model based on accuracy score.

best_model_name = max(results, key=lambda name: results[name]['accuracy'])
best_model = models[best_model_name]

Predictions and Comparison

Uses the best performing model to make predictions on the test set, decodes predicted and actual labels, and creates a DataFrame to compare them.

# Use the best model to make predictions on the test set
best_model.fit(X_train, y_train)
y_pred_best = best_model.predict(X_test)

Save Model Artifacts

Saves the best performing model, TF-IDF vectorizer, and label encoder using pickle for future use.

# Save the model, TF-IDF vectorizer, and label encoder using pickle
with open(f"{best_model}.pkl", 'wb') as file:
    pickle.dump(best_model, file)

with open('vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(vectorizer, vectorizer_file)

with open('label_encoder.pkl', 'wb') as encoder_file:
    pickle.dump(label_encoder, encoder_file)

Conclusion

This document has outlined the process of utilizing a pre-trained machine learning model to identify credit payments and determine insurance payer names from bank transaction narrations. The approach leverages a RandomForestClassifier model trained on TF-IDF vectorized data and utilizes cosine similarity to match NEFT descriptions with known insurance company names.