Credit Card Fraud Detection in Python using deep learning

In this Tutorial, we will learn how to use Python and Deep Learning on CSV file and SMOTE method to convert imbalance data into balance data.

Credit Card Fraud Detection using Logistic regression, Naive Bayes regression, Random forest regression, Decision Tree Regression, Artificial neural network, and Convolutional neural network in Python.

In Machine Learning and Data Science, we often come across a term called Imbalanced Data Distribution, it happens when observations in one of the classes are much higher or lower than the other classes. examples such as Fraud Detection, Anomaly Detection, etc.

WHAT IS SMOTE? :

There are mainly two methods such as SMOTE and Near Miss Algorithm that are useful in the imbalance dataset to convert it in balance dataset so here we are using smote which stands for SMOTE (synthetic minority oversampling technique). It aims to balance class distribution by randomly replicating the minority class and make dataset's shape same for example, if I have two classes 1 and 2 if 2 is minority class then smote will randomly replicate the data and make them same like 1 = 10000 and 2 = 200 then after applying smote both classes will have same values 1 = 10000 and 2 = 10000.

if you want to learn more then the below link is great. it includes how to use smote on regression and also on the classification.

Dataset = https://drive.google.com/file/d/18CESTWFo6l1vTlWPGvL2ZmrqDUS_ICoD/view?usp=sharing

STEP:1 First we are going to some preprocessing on the data

BEFORE APPLYING SMOTE:-

#below code are used directly on the dataset we haven't applied smote for imbalanced data
#becoz the dataset is hugely imbalanced

import pandas as pd
train_df = pd.read_csv("/kaggle/input/creditcard.csv")
X = train_df.drop(columns={'Class'})
y = train_df['Class']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
y_test = y_test.ravel()
y_train = y_train.ravel()

The above code is used to just make our dependent variable and independent variable X_train,y_train, x_test, and y_test in which our dependent variable is class. This column has 0 and 1 as their output in which 0 is not fraud transaction and 1 is a fraud transaction.

#to see how correlated whole dataset with each column
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(20,14))
corr = X.corr()
sns.heatmap(corr)

STEP:2 Now its time to apply all the regression methods such as Logistic regression, naive Bayes, random forest, decision tree, and ANN

1)Logistic regression

# fitting logistic regression to the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

# predictiing the test result
y_pred = classifier.predict(X_test)

# making the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
import seaborn as sns
sns.heatmap(cm, annot=True)

#find accuracy
from sklearn.metrics import accuracy_score
print('logistic regression:',accuracy_score(y_test,y_pred))

# find classification report
from sklearn.metrics import f1_score , precision_score , recall_score
print('f1_score:',f1_score(y_test,y_pred))
print('precision_score:',precision_score(y_test,y_pred))
print('recall_score:',recall_score(y_test,y_pred))

2)Naive Bayes Regression

# Fitting naive byes classifier to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train,y_train)

# Predicting the Test set results
y_pred2 = classifier.predict(X_test)

# making the confusion matrix
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(y_test, y_pred2)
import seaborn as sns
sns.heatmap(cm2, annot=True)

#find accuracy
from sklearn.metrics import accuracy_score
print('naive byes:',accuracy_score(y_test,y_pred2))

# find classification report
from sklearn.metrics import f1_score , precision_score , recall_score
print('f1_score:',f1_score(y_test,y_pred2))
print('precision_score:',precision_score(y_test,y_pred2))
print('recall_score:',recall_score(y_test,y_pred2))

3)Decision Tree Regression

# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred3 = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm3 = confusion_matrix(y_test, y_pred3)
import seaborn as sns
sns.heatmap(cm3, annot=True)

#find accuracy
from sklearn.metrics import accuracy_score
print('decision tree:',accuracy_score(y_test,y_pred3))

# find classification report
from sklearn.metrics import f1_score , precision_score , recall_score
print('f1_score:',f1_score(y_test,y_pred3))
print('precision_score:',precision_score(y_test,y_pred3))
print('recall_score:',recall_score(y_test,y_pred3))

4)Random Forest Regression

# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred4 = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm4 = confusion_matrix(y_test, y_pred4)
import seaborn as sns
sns.heatmap(cm4, annot=True)

#find accuracy
from sklearn.metrics import accuracy_score
print('random forest:',accuracy_score(y_test,y_pred4))

# find classification report
from sklearn.metrics import f1_score , precision_score , recall_score
print('f1_score:',f1_score(y_test,y_pred4))
print('precision_score:',precision_score(y_test,y_pred4))
print('recall_score:',recall_score(y_test,y_pred4))

5)Artificial neural network

# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense

# Initialising the ANN
classifier = Sequential()
classifier.add(Dense(10, activation = 'relu', input_dim = 30))
classifier.add(Dense(10, activation = 'relu'))
classifier.add(Dense(1,  activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.summary()
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 1000, epochs = 20)

# Predicting the Test set results
y_pred5 = classifier.predict(X_test).round()

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm5 = confusion_matrix(y_test, y_pred5)
sns.heatmap(cm5, annot=True)

#find accuracy
from sklearn.metrics import accuracy_score
print('ANN:',accuracy_score(y_test,y_pred5))

# find classification report
from sklearn.metrics import f1_score , precision_score , recall_score
print('f1_score:',f1_score(y_test,y_pred5))
print('precision_score:',precision_score(y_test,y_pred5))
print('recall_score:',recall_score(y_test,y_pred5))

Output:-

logistic regression accuracy: 0.9991924440855307
precision_score: 0.8767123287671232
recall_score: 0.6336633663366337
Confusion Matrix : [[56852 9
37 64]]

naive byes accuracy: 0.9787402127734279
precision_score: 0.06708268330733229
recall_score: 0.8514851485148515
confusion matrix : [[55665 1196
15 86]]
decision tree accuracy: 0.9991222218320986
precision_score: 0.7383177570093458
recall_score: 0.7821782178217822
confusion matrix : [[56833 28
22 79]]

random forest accuracy: 0.9994557775359011
precision_score: 0.9166666666666666
recall_score: 0.7623762376237624
confusion matrix : [[56854 7
24 77]]

ANN accuracy: 0.973203110845827
precision_score: 0.845360824742268
recall_score: 0.8118811881188119
confusion matrix : [[56846 15
19 82]]

STEP:3 AFTER APPLYING SMOTE

now we are going to apply smote and all the regression methods.

y.value_counts()

OUTPUT:-

0 284315
1 492
Name: Class, dtype: int64

We have clearly imbalanced data. It's very common when treating of frauds.

fraud = train_df[train_df['Class'] == 1]
valid = train_df[train_df['Class'] == 0]

print("Fraud transaction statistics")
print(fraud["Amount"].describe())
print("\nNormal transaction statistics")
print(valid["Amount"].describe())

OUTPUT:-

Fraud transaction statistics
count 492.000000
mean 122.211321
std 256.683288
min 0.000000
25% 1.000000
50% 9.250000
75% 105.890000
max 2125.870000
Name: Amount, dtype: float64

Normal transaction statistics
count 284315.000000
mean 88.291022
std 250.105092
min 0.000000
25% 5.650000
50% 22.000000
75% 77.050000
max 25691.160000
Name: Amount, dtype: float64

as you can see in the above output the mean of fraud transaction is more compare to mean of Normal transaction.

NOTE:- open command prompt and type this command to install smote in your system pip install imblearn if you don't have smote

print("before applying smote:",format(sum(y_train == 1)))
print("before applying smote:",format(sum(y_train == 0)))

# import SMOTE module from imblearn library
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=2)
X_train, y_train = sm.fit_sample(X_train, y_train)

print('After applying smote X_train: {}\n'.format(X_train.shape))
print('After applying smote y_train: {}\n'.format(y_train.shape))

print("After applying smote label '1': {}\n".format(sum(y_train == 1)))
print("After applying smote label '0': {}\n".format(sum(y_train == 0)))

OUTPUT:-

before applying smote: 391
before applying smote: 227454

After applying smote X_train: (454908, 30)

After applying smote y_train: (454908,)

After applying smote label '1': 227454

After applying smote label '0': 227454

in the above output when we apply smote (which is an oversampling method) the sum of fraud (label = 0) had only 391 and after applying smote both fraud and the normal transaction have become the same. It aims to balance class distribution by randomly increasing minority class examples by replicating them.

below OUTPUTS are of all above regression methods after applying smote :

SMOTE + logistic regression accuracy: 0.975667989185773
precision_score: 0.6440677966101695
recall_score: 0.9405940594059405
confusion matrix : [[53513 1403
1 97 ]]

SMOTE + naive byes accuracy: 0.9764931006636003
precision_score: 0.6285310734463277
recall_score: 0.8811881188118812
confusion matrix : [[56000 1303
12 89]]

SMOTE + decision tree accuracy: 0.9964713317650363
precision_score: 0.31343283582089554
recall_score: 0.8316831683168316
confusion matrix : [[57000 1802
17 84]]

SMOTE + random forest accuracy: 0.999420666409185
precision_score: 0.8469387755102041
recall_score: 0.8217821782178217
confusion matrix : [[57001 18
18 83]]

SMOTE + ANN accuracy: 0.9982362206383203
precision_score: 0.31386861313868614
recall_score: 0.8514851485148515
confusion matrix : [[57004 1902
15 86]]

OUTPUT of ANN(after applying smote):-

Epoch 1/20
455/455 [==============================] - 2s 3ms/step - loss: 0.2695 - accuracy: 0.8751
Epoch 2/20
455/455 [==============================] - 2s 4ms/step - loss: 0.0938 - accuracy: 0.9611
Epoch 3/20
455/455 [==============================] - 2s 3ms/step - loss: 0.0606 - accuracy: 0.9770
Epoch 4/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0434 - accuracy: 0.9852
Epoch 5/20
455/455 [==============================] - 2s 3ms/step - loss: 0.0325 - accuracy: 0.9898
Epoch 6/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0258 - accuracy: 0.9928
Epoch 7/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0215 - accuracy: 0.9942
Epoch 8/20
455/455 [==============================] - 2s 4ms/step - loss: 0.0184 - accuracy: 0.9952
Epoch 9/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0163 - accuracy: 0.9958
Epoch 10/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0147 - accuracy: 0.9963
Epoch 11/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0135 - accuracy: 0.9967
Epoch 12/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0125 - accuracy: 0.9971
Epoch 13/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0117 - accuracy: 0.9973
Epoch 14/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0111 - accuracy: 0.9975
Epoch 15/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0105 - accuracy: 0.9977
Epoch 16/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0100 - accuracy: 0.9979
Epoch 17/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0096 - accuracy: 0.9979
Epoch 18/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0092 - accuracy: 0.9980
Epoch 19/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0088 - accuracy: 0.9981
Epoch 20/20
455/455 [==============================] - 1s 3ms/step - loss: 0.0085 - accuracy: 0.9982
SMOTE+ANN: 0.9982362206383203

classification_report:

     precision    recall   f1-score   support

           0       1.00      0.98      0.99     56861
           1       0.06      0.94      0.12       101

    accuracy                           0.98     56962
   macro avg       0.53      0.96      0.55     56962
weighted avg       1.00      0.98      0.99     56962

so if you analyze the above output and without applying smote output both are quite similar or after applying smote the accuracy
may decrease but you can see the recall value is much better of after applying smote compare to the previous one.

NOTE:-high recall + high precision: the class is perfectly handled by the
model

low recall + high precision: the model can’t detect the class well
but is highly trustable when it does

high recall + low precision: the class is well detected but the
the model also includes points of other classes in it

low recall + low precision: the class is poorly handled by the
model

STEP:4 convolutional neural network

Now I am going to apply a convolutional neural network for which we have to apply a function called reshape to do reshaping the data.

X_train = X_train.reshape(X_train.shape[0] , X_train.shape[1],1)
X_test = X_test.reshape(X_test.shape[0] , X_test.shape[1],1)

X_train.shape , X_test.shape

OUTPUT:-

((454908, 30, 1), (56962, 30, 1))

MODEL:-

Here I have applied various layers such as convolutional1D, Batch normalization, dropout, flatten, dense layer, activation functions and optimizer.

1)covolutional1D:-This layer creates a convolution kernel that is convolved with the layer input over one dimension
to produce a tensor of outputs

2)Batch normalization:-by this layer the data will be standardized and the meaning of this is they will have a mean of 0 and a standard deviation of 1.

3)Dropout:- This layer will off some of the neurons to prevent overfitting.

4)flatten:- flatten layer is used to convert the 2D matrix of features into a vector and after that, it will going to fed in the dense layer.

5)Dense:- This layer is a fully connected layer.

6)relu(rectified linear unit) :- it is an activation function and is defined as y = max(0, x).

7)sigmoid:- sigmoid is exists between (0 to 1).so when we have binary output then we have to use sigmoid but when we have a multi-class classifier problem then we have to use softmax function.

8)Adam:- adam is an optimizer to update the weights.

import tensorflow as tf
from tensorflow.keras.optimizers import Adam
# Initialising the CNN
classifier = tf.keras.models.Sequential()
classifier.add(tf.keras.layers.Convolution1D(32 , 2 , activation='relu',input_shape=X_train[0].shape))
classifier.add(tf.keras.layers.BatchNormalization())
classifier.add(tf.keras.layers.Dropout(0.2))

classifier.add(tf.keras.layers.Convolution1D(64 , 2 , activation='relu'))
classifier.add(tf.keras.layers.BatchNormalization())
classifier.add(tf.keras.layers.Dropout(0.2))

classifier.add(tf.keras.layers.Convolution1D(128 , 2 , activation='relu'))
classifier.add(tf.keras.layers.BatchNormalization())
classifier.add(tf.keras.layers.Dropout(0.2))

classifier.add(tf.keras.layers.Flatten())
classifier.add(tf.keras.layers.Dense(units=256, activation='relu'))
classifier.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))
classifier.compile(optimizer=Adam(lr = 0.0001), loss='binary_crossentropy', metrics=['accuracy'])
classifier.summary()

history = classifier.fit(X_train, y_train, batch_size = 100, epochs = 10 , validation_data=(X_test,y_test),verbose=1)

# Predicting the Test set results
y_pred = classifier.predict(X_test).flatten().round()

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
import seaborn as sns
sns.heatmap(cm, annot=True)

#find accuracy
from sklearn.metrics import accuracy_score
print('CNN:',accuracy_score(y_test,y_pred))

# find classification report
from sklearn.metrics import f1_score , precision_score , recall_score , classification_report
print('classification_report:',classification_report(y_test,y_pred))
print('f1_score:',f1_score(y_test,y_pred))
print('precision_score:',precision_score(y_test,y_pred))
print('recall_score:',recall_score(y_test,y_pred))

OUTPUT:-

Epoch 1/10
4550/4550 [==============================] - 26s 6ms/step - loss: 0.0029 - accuracy: 0.9991 - val_loss: 0.0105 - val_accuracy: 0.9988
Epoch 2/10
4550/4550 [==============================] - 27s 6ms/step - loss: 0.0022 - accuracy: 0.9994 - val_loss: 0.0100 - val_accuracy: 0.9991
Epoch 3/10
4550/4550 [==============================] - 25s 6ms/step - loss: 0.0022 - accuracy: 0.9994 - val_loss: 0.0110 - val_accuracy: 0.9989
Epoch 4/10
4550/4550 [==============================] - 27s 6ms/step - loss: 0.0022 - accuracy: 0.9994 - val_loss: 0.0096 - val_accuracy: 0.9992
Epoch 5/10
4550/4550 [==============================] - 25s 6ms/step - loss: 0.0018 - accuracy: 0.9995 - val_loss: 0.0105 - val_accuracy: 0.9990
Epoch 6/10
4550/4550 [==============================] - 27s 6ms/step - loss: 0.0017 - accuracy: 0.9995 - val_loss: 0.0107 - val_accuracy: 0.9989
Epoch 7/10
4550/4550 [==============================] - 26s 6ms/step - loss: 0.0016 - accuracy: 0.9995 - val_loss: 0.0102 - val_accuracy: 0.9993
Epoch 8/10
4550/4550 [==============================] - 27s 6ms/step - loss: 0.0013 - accuracy: 0.9996 - val_loss: 0.0109 - val_accuracy: 0.9990
Epoch 9/10
4550/4550 [==============================] - 29s 6ms/step - loss: 0.0015 - accuracy: 0.9996 - val_loss: 0.0119 - val_accuracy: 0.9989
Epoch 10/10
4550/4550 [==============================] - 27s 6ms/step - loss: 0.0013 - accuracy: 0.9996 - val_loss: 0.0103 - val_accuracy: 0.9993

classification_report:     
          
      precision    recall  f1-score   support

           0       1.00      1.00      1.00     56861
           1       0.78      0.85      0.82       101

    accuracy                           1.00     56962
   macro avg       0.89      0.93      0.91     56962
weighted avg       1.00      1.00      1.00     56962

CNN + SMOTE: 0.9993153330290369
precision_score: 0.7818181818181819
recall_score: 0.8514851485148515
confusion matrix : [[56864 0
64 34]]

CONCLUSION:- in this project, we have learned how to apply smote on an imbalanced dataset and perform a convolutional neural network and artificial neural network . Artificial neural network and Convolutional neural network have best fitted on the dataset.

Coders Packet

Credit Card Fraud Detection in Python using deep learning

STEP:1 First we are going to some preprocessing on the data

BEFORE APPLYING SMOTE:-

STEP:2 Now its time to apply all the regression methods such as Logistic regression, naive Bayes, random forest, decision tree, and ANN

1)Logistic regression

2)Naive Bayes Regression

3)Decision Tree Regression

4)Random Forest Regression

5)Artificial neural network

Output:-

STEP:3 AFTER APPLYING SMOTE

NOTE:- open command prompt and type this command to install smote in your system pip install imblearn if you don't have smote

STEP:4 convolutional neural network

Comments