In this Tutorial, we will learn how to use Python and Deep Learning on CSV file and SMOTE method to convert imbalance data into balance data.
Credit Card Fraud Detection using Logistic regression, Naive Bayes regression, Random forest regression, Decision Tree Regression, Artificial neural network, and Convolutional neural network in Python.
In Machine Learning and Data Science, we often come across a term called Imbalanced Data Distribution, it happens when observations in one of the classes are much higher or lower than the other classes. examples such as Fraud Detection, Anomaly Detection, etc.
WHAT IS SMOTE? :
There are mainly two methods such as SMOTE and Near Miss Algorithm that are useful in the imbalance dataset to convert it in balance dataset so here we are using smote which stands for SMOTE (synthetic minority oversampling technique). It aims to balance class distribution by randomly replicating the minority class and make dataset's shape same for example, if I have two classes 1 and 2 if 2 is minority class then smote will randomly replicate the data and make them same like 1 = 10000 and 2 = 200 then after applying smote both classes will have same values 1 = 10000 and 2 = 10000.
if you want to learn more then the below link is great. it includes how to use smote on regression and also on the classification.
Dataset = https://drive.google.com/file/d/18CESTWFo6l1vTlWPGvL2ZmrqDUS_ICoD/view?usp=sharing
#below code are used directly on the dataset we haven't applied smote for imbalanced data #becoz the dataset is hugely imbalanced import pandas as pd train_df = pd.read_csv("/kaggle/input/creditcard.csv") X = train_df.drop(columns={'Class'}) y = train_df['Class'] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.fit_transform(X_test) y_test = y_test.ravel() y_train = y_train.ravel()
The above code is used to just make our dependent variable and independent variable X_train,y_train, x_test, and y_test in which our dependent variable is class. This column has 0 and 1 as their output in which 0 is not fraud transaction and 1 is a fraud transaction.
#to see how correlated whole dataset with each column import matplotlib.pyplot as plt import seaborn as sns plt.figure(figsize=(20,14)) corr = X.corr() sns.heatmap(corr)
# fitting logistic regression to the training set from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state=0) classifier.fit(X_train, y_train) # predictiing the test result y_pred = classifier.predict(X_test) # making the confusion matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) import seaborn as sns sns.heatmap(cm, annot=True) #find accuracy from sklearn.metrics import accuracy_score print('logistic regression:',accuracy_score(y_test,y_pred)) # find classification report from sklearn.metrics import f1_score , precision_score , recall_score print('f1_score:',f1_score(y_test,y_pred)) print('precision_score:',precision_score(y_test,y_pred)) print('recall_score:',recall_score(y_test,y_pred))
# Fitting naive byes classifier to the Training set from sklearn.naive_bayes import GaussianNB classifier = GaussianNB() classifier.fit(X_train,y_train) # Predicting the Test set results y_pred2 = classifier.predict(X_test) # making the confusion matrix from sklearn.metrics import confusion_matrix cm2 = confusion_matrix(y_test, y_pred2) import seaborn as sns sns.heatmap(cm2, annot=True) #find accuracy from sklearn.metrics import accuracy_score print('naive byes:',accuracy_score(y_test,y_pred2)) # find classification report from sklearn.metrics import f1_score , precision_score , recall_score print('f1_score:',f1_score(y_test,y_pred2)) print('precision_score:',precision_score(y_test,y_pred2)) print('recall_score:',recall_score(y_test,y_pred2))
# Fitting Decision Tree Classification to the Training set from sklearn.tree import DecisionTreeClassifier classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred3 = classifier.predict(X_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm3 = confusion_matrix(y_test, y_pred3) import seaborn as sns sns.heatmap(cm3, annot=True) #find accuracy from sklearn.metrics import accuracy_score print('decision tree:',accuracy_score(y_test,y_pred3)) # find classification report from sklearn.metrics import f1_score , precision_score , recall_score print('f1_score:',f1_score(y_test,y_pred3)) print('precision_score:',precision_score(y_test,y_pred3)) print('recall_score:',recall_score(y_test,y_pred3))
# Fitting Random Forest Classification to the Training set from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred4 = classifier.predict(X_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm4 = confusion_matrix(y_test, y_pred4) import seaborn as sns sns.heatmap(cm4, annot=True) #find accuracy from sklearn.metrics import accuracy_score print('random forest:',accuracy_score(y_test,y_pred4)) # find classification report from sklearn.metrics import f1_score , precision_score , recall_score print('f1_score:',f1_score(y_test,y_pred4)) print('precision_score:',precision_score(y_test,y_pred4)) print('recall_score:',recall_score(y_test,y_pred4))
# Importing the Keras libraries and packages from keras.models import Sequential from keras.layers import Dense # Initialising the ANN classifier = Sequential() classifier.add(Dense(10, activation = 'relu', input_dim = 30)) classifier.add(Dense(10, activation = 'relu')) classifier.add(Dense(1, activation = 'sigmoid')) classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy']) classifier.summary() # Fitting the ANN to the Training set classifier.fit(X_train, y_train, batch_size = 1000, epochs = 20) # Predicting the Test set results y_pred5 = classifier.predict(X_test).round() # Making the Confusion Matrix from sklearn.metrics import confusion_matrix import seaborn as sns cm5 = confusion_matrix(y_test, y_pred5) sns.heatmap(cm5, annot=True) #find accuracy from sklearn.metrics import accuracy_score print('ANN:',accuracy_score(y_test,y_pred5)) # find classification report from sklearn.metrics import f1_score , precision_score , recall_score print('f1_score:',f1_score(y_test,y_pred5)) print('precision_score:',precision_score(y_test,y_pred5)) print('recall_score:',recall_score(y_test,y_pred5))
logistic regression accuracy: 0.9991924440855307
precision_score: 0.8767123287671232
recall_score: 0.6336633663366337
Confusion Matrix : [[56852 9
37 64]]
naive byes accuracy: 0.9787402127734279
precision_score: 0.06708268330733229
recall_score: 0.8514851485148515
confusion matrix : [[55665 1196
15 86]]
decision tree accuracy: 0.9991222218320986
precision_score: 0.7383177570093458
recall_score: 0.7821782178217822
confusion matrix : [[56833 28
22 79]]
random forest accuracy: 0.9994557775359011
precision_score: 0.9166666666666666
recall_score: 0.7623762376237624
confusion matrix : [[56854 7
24 77]]
ANN accuracy: 0.973203110845827
precision_score: 0.845360824742268
recall_score: 0.8118811881188119
confusion matrix : [[56846 15
19 82]]
now we are going to apply smote and all the regression methods.
y.value_counts()
OUTPUT:-
0 284315
1 492
Name: Class, dtype: int64
We have clearly imbalanced data. It's very common when treating of frauds.
fraud = train_df[train_df['Class'] == 1] valid = train_df[train_df['Class'] == 0] print("Fraud transaction statistics") print(fraud["Amount"].describe()) print("\nNormal transaction statistics") print(valid["Amount"].describe())
OUTPUT:-
Fraud transaction statistics
count 492.000000
mean 122.211321
std 256.683288
min 0.000000
25% 1.000000
50% 9.250000
75% 105.890000
max 2125.870000
Name: Amount, dtype: float64
Normal transaction statistics
count 284315.000000
mean 88.291022
std 250.105092
min 0.000000
25% 5.650000
50% 22.000000
75% 77.050000
max 25691.160000
Name: Amount, dtype: float64
as you can see in the above output the mean of fraud transaction is more compare to mean of Normal transaction.
print("before applying smote:",format(sum(y_train == 1))) print("before applying smote:",format(sum(y_train == 0))) # import SMOTE module from imblearn library from imblearn.over_sampling import SMOTE sm = SMOTE(random_state=2) X_train, y_train = sm.fit_sample(X_train, y_train) print('After applying smote X_train: {}\n'.format(X_train.shape)) print('After applying smote y_train: {}\n'.format(y_train.shape)) print("After applying smote label '1': {}\n".format(sum(y_train == 1))) print("After applying smote label '0': {}\n".format(sum(y_train == 0)))
OUTPUT:-
before applying smote: 391
before applying smote: 227454
After applying smote X_train: (454908, 30)
After applying smote y_train: (454908,)
After applying smote label '1': 227454
After applying smote label '0': 227454
in the above output when we apply smote (which is an oversampling method) the sum of fraud (label = 0) had only 391 and after applying smote both fraud and the normal transaction have become the same. It aims to balance class distribution by randomly increasing minority class examples by replicating them.
below OUTPUTS are of all above regression methods after applying smote :
SMOTE + logistic regression accuracy: 0.975667989185773
precision_score: 0.6440677966101695
recall_score: 0.9405940594059405
confusion matrix : [[53513 1403
1 97 ]]
SMOTE + naive byes accuracy: 0.9764931006636003
precision_score: 0.6285310734463277
recall_score: 0.8811881188118812
confusion matrix : [[56000 1303
12 89]]
SMOTE + decision tree accuracy: 0.9964713317650363
precision_score: 0.31343283582089554
recall_score: 0.8316831683168316
confusion matrix : [[57000 1802
17 84]]
SMOTE + random forest accuracy: 0.999420666409185
precision_score: 0.8469387755102041
recall_score: 0.8217821782178217
confusion matrix : [[57001 18
18 83]]
SMOTE + ANN accuracy: 0.9982362206383203
precision_score: 0.31386861313868614
recall_score: 0.8514851485148515
confusion matrix : [[57004 1902
15 86]]
OUTPUT of ANN(after applying smote):-
Epoch 1/20 455/455 [==============================] - 2s 3ms/step - loss: 0.2695 - accuracy: 0.8751 Epoch 2/20 455/455 [==============================] - 2s 4ms/step - loss: 0.0938 - accuracy: 0.9611 Epoch 3/20 455/455 [==============================] - 2s 3ms/step - loss: 0.0606 - accuracy: 0.9770 Epoch 4/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0434 - accuracy: 0.9852 Epoch 5/20 455/455 [==============================] - 2s 3ms/step - loss: 0.0325 - accuracy: 0.9898 Epoch 6/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0258 - accuracy: 0.9928 Epoch 7/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0215 - accuracy: 0.9942 Epoch 8/20 455/455 [==============================] - 2s 4ms/step - loss: 0.0184 - accuracy: 0.9952 Epoch 9/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0163 - accuracy: 0.9958 Epoch 10/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0147 - accuracy: 0.9963 Epoch 11/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0135 - accuracy: 0.9967 Epoch 12/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0125 - accuracy: 0.9971 Epoch 13/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0117 - accuracy: 0.9973 Epoch 14/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0111 - accuracy: 0.9975 Epoch 15/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0105 - accuracy: 0.9977 Epoch 16/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0100 - accuracy: 0.9979 Epoch 17/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0096 - accuracy: 0.9979 Epoch 18/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0092 - accuracy: 0.9980 Epoch 19/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0088 - accuracy: 0.9981 Epoch 20/20 455/455 [==============================] - 1s 3ms/step - loss: 0.0085 - accuracy: 0.9982 SMOTE+ANN: 0.9982362206383203
classification_report:
precision recall f1-score support 0 1.00 0.98 0.99 56861 1 0.06 0.94 0.12 101 accuracy 0.98 56962 macro avg 0.53 0.96 0.55 56962 weighted avg 1.00 0.98 0.99 56962
so if you analyze the above output and without applying smote output both are quite similar or after applying smote the accuracy
may decrease but you can see the recall value is much better of after applying smote compare to the previous one.
NOTE:-high recall + high precision: the class is perfectly handled by the
model
low recall + high precision: the model can’t detect the class well
but is highly trustable when it does
high recall + low precision: the class is well detected but the
the model also includes points of other classes in it
low recall + low precision: the class is poorly handled by the
model
Now I am going to apply a convolutional neural network for which we have to apply a function called reshape to do reshaping the data.
X_train = X_train.reshape(X_train.shape[0] , X_train.shape[1],1) X_test = X_test.reshape(X_test.shape[0] , X_test.shape[1],1) X_train.shape , X_test.shape
OUTPUT:-
((454908, 30, 1), (56962, 30, 1))
MODEL:-
Here I have applied various layers such as convolutional1D, Batch normalization, dropout, flatten, dense layer, activation functions and optimizer.
1)covolutional1D:-This layer creates a convolution kernel that is convolved with the layer input over one dimension
to produce a tensor of outputs
2)Batch normalization:-by this layer the data will be standardized and the meaning of this is they will have a mean of 0 and a standard deviation of 1.
3)Dropout:- This layer will off some of the neurons to prevent overfitting.
4)flatten:- flatten layer is used to convert the 2D matrix of features into a vector and after that, it will going to fed in the dense layer.
5)Dense:- This layer is a fully connected layer.
6)relu(rectified linear unit) :- it is an activation function and is defined as y = max(0, x).
7)sigmoid:- sigmoid is exists between (0 to 1).so when we have binary output then we have to use sigmoid but when we have a multi-class classifier problem then we have to use softmax function.
8)Adam:- adam is an optimizer to update the weights.
import tensorflow as tf from tensorflow.keras.optimizers import Adam # Initialising the CNN classifier = tf.keras.models.Sequential() classifier.add(tf.keras.layers.Convolution1D(32 , 2 , activation='relu',input_shape=X_train[0].shape)) classifier.add(tf.keras.layers.BatchNormalization()) classifier.add(tf.keras.layers.Dropout(0.2)) classifier.add(tf.keras.layers.Convolution1D(64 , 2 , activation='relu')) classifier.add(tf.keras.layers.BatchNormalization()) classifier.add(tf.keras.layers.Dropout(0.2)) classifier.add(tf.keras.layers.Convolution1D(128 , 2 , activation='relu')) classifier.add(tf.keras.layers.BatchNormalization()) classifier.add(tf.keras.layers.Dropout(0.2)) classifier.add(tf.keras.layers.Flatten()) classifier.add(tf.keras.layers.Dense(units=256, activation='relu')) classifier.add(tf.keras.layers.Dense(units=1, activation='sigmoid')) classifier.compile(optimizer=Adam(lr = 0.0001), loss='binary_crossentropy', metrics=['accuracy']) classifier.summary() history = classifier.fit(X_train, y_train, batch_size = 100, epochs = 10 , validation_data=(X_test,y_test),verbose=1) # Predicting the Test set results y_pred = classifier.predict(X_test).flatten().round() # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) import seaborn as sns sns.heatmap(cm, annot=True) #find accuracy from sklearn.metrics import accuracy_score print('CNN:',accuracy_score(y_test,y_pred)) # find classification report from sklearn.metrics import f1_score , precision_score , recall_score , classification_report print('classification_report:',classification_report(y_test,y_pred)) print('f1_score:',f1_score(y_test,y_pred)) print('precision_score:',precision_score(y_test,y_pred)) print('recall_score:',recall_score(y_test,y_pred))
OUTPUT:-
Epoch 1/10 4550/4550 [==============================] - 26s 6ms/step - loss: 0.0029 - accuracy: 0.9991 - val_loss: 0.0105 - val_accuracy: 0.9988 Epoch 2/10 4550/4550 [==============================] - 27s 6ms/step - loss: 0.0022 - accuracy: 0.9994 - val_loss: 0.0100 - val_accuracy: 0.9991 Epoch 3/10 4550/4550 [==============================] - 25s 6ms/step - loss: 0.0022 - accuracy: 0.9994 - val_loss: 0.0110 - val_accuracy: 0.9989 Epoch 4/10 4550/4550 [==============================] - 27s 6ms/step - loss: 0.0022 - accuracy: 0.9994 - val_loss: 0.0096 - val_accuracy: 0.9992 Epoch 5/10 4550/4550 [==============================] - 25s 6ms/step - loss: 0.0018 - accuracy: 0.9995 - val_loss: 0.0105 - val_accuracy: 0.9990 Epoch 6/10 4550/4550 [==============================] - 27s 6ms/step - loss: 0.0017 - accuracy: 0.9995 - val_loss: 0.0107 - val_accuracy: 0.9989 Epoch 7/10 4550/4550 [==============================] - 26s 6ms/step - loss: 0.0016 - accuracy: 0.9995 - val_loss: 0.0102 - val_accuracy: 0.9993 Epoch 8/10 4550/4550 [==============================] - 27s 6ms/step - loss: 0.0013 - accuracy: 0.9996 - val_loss: 0.0109 - val_accuracy: 0.9990 Epoch 9/10 4550/4550 [==============================] - 29s 6ms/step - loss: 0.0015 - accuracy: 0.9996 - val_loss: 0.0119 - val_accuracy: 0.9989 Epoch 10/10 4550/4550 [==============================] - 27s 6ms/step - loss: 0.0013 - accuracy: 0.9996 - val_loss: 0.0103 - val_accuracy: 0.9993
classification_report:
precision recall f1-score support 0 1.00 1.00 1.00 56861 1 0.78 0.85 0.82 101 accuracy 1.00 56962 macro avg 0.89 0.93 0.91 56962 weighted avg 1.00 1.00 1.00 56962
CNN + SMOTE: 0.9993153330290369
precision_score: 0.7818181818181819
recall_score: 0.8514851485148515
confusion matrix : [[56864 0
64 34]]
CONCLUSION:- in this project, we have learned how to apply smote on an imbalanced dataset and perform a convolutional neural network and artificial neural network . Artificial neural network and Convolutional neural network have best fitted on the dataset.
Submitted by Rahul Makwana (rahulmakwana)
Download packets of source code on Coders Packet
Comments