ML - Finding Credit Payments
Introduction
The purpose of this project is to develop a machine learning model that filters credit payments from bank transactions. The model processes a dataset containing bank transaction records, identifying transactions that are credit payments based on specific criteria. This automated process improves efficiency and accuracy in financial operations, particularly for identifying insurance-related transactions.
Data Collection
The dataset used for this project is a CSV file named BankTransactionStaging.csv, containing bank transaction records. The dataset includes the following columns relevant to this task:
Based On: Indicates the basis of the transaction (e.g. ClaimBook, Settlement Advice, OR file, Manual).
Description: Detailed description of the transaction.
Deposit: The deposit value of the transaction.
Transactions with a "Based On" value of "ClaimBook," "Settlement Advice," "OR file," or "Manual" are already considered credit payments.
# Load data
data = pd.read_csv('BankTransactionStaging.csv',low_memory=False)
Data Preprocessing
Filtering Transactions: Transactions with a deposit value greater than 0 were selected for further processing.
Creating Target Variable: A new binary column is_insurance was created to indicate whether a transaction is insurance-related or not. This is based on the presence of a non-empty string in the "Based On" column.
Processing Descriptions: The process_narration function was defined to:
Remove leading and trailing whitespace.
Remove the string "NEFT" (case insensitive).
Remove specified symbols (*-.,?/()'":).
Convert the text to lowercase.
# Filter transactions with deposit value > 0 and include both "Insurance Pattern" and other patterns
filtered_data = data[data['Deposit'] > 0].copy()
# Create a new binary column for the target variable
filtered_data['is_insurance'] = filtered_data['Based On'].apply(lambda x: 0 if not isinstance(x, str) or x.strip() == '' else 1)
# Function to process description
def process_narration(text):
# Remove leading and trailing whitespace
cleaned_text = text.strip()
# Remove "NEFT" case insensitive
if cleaned_text.lower().startswith("neft"):
cleaned_text = cleaned_text[4:].strip()
# Define symbols to remove
symbols_to_remove = '*-.,?/()\'":'
# Remove specified symbols
cleaned_text = ''.join([char for char in cleaned_text if char not in symbols_to_remove])
return cleaned_text.strip().lower()
#filtered_data['normalized_description'] = filtered_data['Narration'].apply(normalize_text)
filtered_data['normalized_narration'] = filtered_data['Narration'].apply(process_narration)
Feature Extraction
The text data in the normalized_narration column was vectorized using the TfidfVectorizer from scikit-learn, transforming the text descriptions into numerical features suitable for machine learning.
# Prepare features and target variable
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(filtered_data['normalized_narration'])
y = filtered_data['is_insurance']
Model Selection
The model chosen for this task is a Random Forest Classifier, implemented using the scikit-learn library. This model was selected for its simplicity and effectiveness in handling classification tasks.
# Define the model to be trained
model = RandomForestClassifier()
Training the Model
Splitting Data: The dataset was split into training and testing sets with an 80-20 split ratio.
Training: The Random Forest Classifier was trained on the training set.
Evaluation: The model's performance was evaluated using accuracy score and classification report on the test set.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Model Evaluation
The model's performance was evaluated using the accuracy score and a detailed classification report, which provides precision, recall, and F1-score for each class.
# Train and evaluate the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred) * 100
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}%")
print(f"Classification Report:\n{class_report}")
Results
Accuracy: The model achieved an accuracy of 95%.
Classification Report: Detailed classification metrics are provided for each class (insurance-related and not insurance-related).
Deployment
The trained model and the TF-IDF vectorizer were saved using the pickle module, allowing for future use without retraining.
# Save the trained model
with open('BT_2_Credit.pkl', 'wb') as model_file:
pickle.dump(model, model_file)
# Save the TF-IDF vectorizer
with open('tfidf_vectorizer_1.pkl', 'wb') as vectorizer_file:
pickle.dump(vectorizer, vectorizer_file)
Conclusion
The developed machine learning model successfully identifies credit payments from bank transactions, specifically focusing on insurance-related transactions. The use of text preprocessing and the Random Forest Classifier resulted in an effective and efficient solution.