ML - Finding Credit Payments

Introduction

The purpose of this project is to develop a machine learning model that filters credit payments from bank transactions. The model processes a dataset containing bank transaction records, identifying transactions that are credit payments based on specific criteria. This automated process improves efficiency and accuracy in financial operations, particularly for identifying insurance-related transactions.

Data Collection

The dataset used for this project is a CSV file named BankTransactionStaging.csv, containing bank transaction records. The dataset includes the following columns relevant to this task:

Based On: Indicates the basis of the transaction (e.g. ClaimBook, Settlement Advice, OR file, Manual).
Description: Detailed description of the transaction.
Deposit: The deposit value of the transaction.

Transactions with a "Based On" value of "ClaimBook," "Settlement Advice," "OR file," or "Manual" are already considered credit payments.

# Load data
data = pd.read_csv('BankTransactionStaging.csv',low_memory=False)

Data Preprocessing

Filtering Transactions: Transactions with a deposit value greater than 0 were selected for further processing.
Creating Target Variable: A new binary column is_insurance was created to indicate whether a transaction is insurance-related or not. This is based on the presence of a non-empty string in the "Based On" column.
Processing Descriptions: The process_narration function was defined to:
- Remove leading and trailing whitespace.
- Remove the string "NEFT" (case insensitive).
- Remove specified symbols (*-.,?/()'":).
- Convert the text to lowercase.

# Filter transactions with deposit value > 0 and include both "Insurance Pattern" and other patterns
filtered_data = data[data['Deposit'] > 0].copy()

# Create a new binary column for the target variable
filtered_data['is_insurance'] = filtered_data['Based On'].apply(lambda x: 0 if not isinstance(x, str) or x.strip() == '' else 1)

# Function to process description
def process_narration(text):
    # Remove leading and trailing whitespace
    cleaned_text = text.strip()
    # Remove "NEFT" case insensitive
    if cleaned_text.lower().startswith("neft"):
        cleaned_text = cleaned_text[4:].strip()
    # Define symbols to remove
    symbols_to_remove = '*-.,?/()\'":'
    # Remove specified symbols
    cleaned_text = ''.join([char for char in cleaned_text if char not in symbols_to_remove])

    return cleaned_text.strip().lower()

#filtered_data['normalized_description'] = filtered_data['Narration'].apply(normalize_text)
filtered_data['normalized_narration'] = filtered_data['Narration'].apply(process_narration)

Feature Extraction

The text data in the normalized_narration column was vectorized using the TfidfVectorizer from scikit-learn, transforming the text descriptions into numerical features suitable for machine learning.

# Prepare features and target variable
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(filtered_data['normalized_narration'])
y = filtered_data['is_insurance']

Model Selection

The model chosen for this task is a Random Forest Classifier, implemented using the scikit-learn library. This model was selected for its simplicity and effectiveness in handling classification tasks.

# Define the model to be trained
model = RandomForestClassifier()

Training the Model

Splitting Data: The dataset was split into training and testing sets with an 80-20 split ratio.
Training: The Random Forest Classifier was trained on the training set.
Evaluation: The model's performance was evaluated using accuracy score and classification report on the test set.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Evaluation

The model's performance was evaluated using the accuracy score and a detailed classification report, which provides precision, recall, and F1-score for each class.

# Train and evaluate the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred) * 100
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}%")
print(f"Classification Report:\n{class_report}")

Results

Accuracy: The model achieved an accuracy of 95%.

Classification Report: Detailed classification metrics are provided for each class (insurance-related and not insurance-related).

Deployment

The trained model and the TF-IDF vectorizer were saved using the pickle module, allowing for future use without retraining.

# Save the trained model
with open('BT_2_Credit.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)

# Save the TF-IDF vectorizer
with open('tfidf_vectorizer_1.pkl', 'wb') as vectorizer_file:
    pickle.dump(vectorizer, vectorizer_file)

Conclusion

The developed machine learning model successfully identifies credit payments from bank transactions, specifically focusing on insurance-related transactions. The use of text preprocessing and the Random Forest Classifier resulted in an effective and efficient solution.