How to Implement Machine Learning on Fraud Detection

In financial services, particularly in the emerging field of mobile money transactions, there is a lack of publicly available data sets. Financial data sets are important to many researchers, especially those of us who work in the area of fraud detection. Part of the problem is the private nature of financial transactions, which means there are no publicly available data sets.To solve this problem, we found a synthetic dataset on Kaggle that was generated using an emulator called PaySim. Paysim uses aggregated data from private datasets to generate a synthetic dataset that resembles normal transaction operations and injects malicious behavior for later evaluation of the performance of the fraud detection method.

Data Resource: Kaggle
Feature Selection
From the raw data, we can see that transaction is divided into five types, and the overall data is divided into two categories based on fraudulent and non-fraudulent data. Common sense suggests that there are some types of fraud that are almost impossible to commit. This research is mainly aimed at detection of fraud. Given the data imbalance, we might be able to get the data closer to balance by cutting back on non-fraudulent transaction types.
Obviously, except for Cash Out and Transfer, there is no risk of fraud in other transaction types. Therefore, we only need to keep the above transaction types for further study.
From the top heat map, we can see that there are many features containing repeated information.Therefore, I merge all the columns of original and destination account amount changes respectively. Therefore, I merge all the columns of original and destination account amount changes respectively. Finally, the step, type, isFraud, DestAmount, OrigAmount can represent all the valid information in the original chart.
As can be seen from correlation, step correlations are very weak. Therefore, the most effective feature for fraud detection is to calculate the combined new features: DestAmount and OrigAmount.
And we need to split the case into Cash Out and Transfer.
Oversampling - SMOTE

Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance. The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important. One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

Click for details
After oversamling by SMOTE, the data is quiet balanced. In this way, many data over-fitting problems can be avoided, or the situation that KNN and others cannot fit unbalanced data.
Moreover, because the fraud data is small and the relative value is unusual, it is easy to identify as outliers by statistical models. So I kept all the data, believing that this was the best way to maximize the raw data and prevent information loss. More details about data cleaning Click Here
Cash-out Fraud Detection
Due to the characteristics of Scatter plot before, I think KNN will have a good performance. At the same time, Logistic Regression also has certain advantages in processing such data. Here's why the two models were selected, as well as the Confusion Matrix for the predicted results.
Because of the authority of evaluation, we use raw data that are not involved in the training at all. However, due to the huge gap between original data fraud and non-fraud, it is difficult to see intuitive results, that's why the following factors are all macro average. Furthermore, I evaluate them again with fraud-only data. Click Here to more details

KNN

The theory is mature and the idea is simple, which can be used for both classification and regression
It can be used for nonlinear classification
The training time complexity is lower than that of algorithms such as support vector machine
There is no assumption on data, with high accuracy and insensitivity to outliers

Logistic Regression

Simple implementation, widely used in industrial problems;

The calculation amount is very small, the speed is very fast.

For logistic regression, multicollinearity is not a problem, which can be solved by combining with L2 regularization.
The calculation cost is not high, easy to understand and implement;

Macro Average of Evaluation Factors

Because we only focus on detecting the fraud transaction. Logistic Regression has fewer errors here, thus we can use it for Cash Out fraud.
Transfer Fraud Detection
According to the scatter plot behind, we can see the transfer transaction data meets the same issue with cash-out: the important data has risk to be determined as outliers. So I just skip the data cleaning, then swap the dataset then process it in the same methods.

Macro Average of Evaluation Factors

Logistic regression performes not bad in this case too, but not as well as KNN. Despite LR is better to classify the non-fraud transaction, but KNN is way much better than it on fraud transaction.
Conclusion

Anti-fraud has always been one of the hottest and most challenging projects. Anti-fraud programs often involve clients having no idea what fraud is and what it is not. In other words, the definition of fraud is vague. At a minimum, anti-fraud may seem like a binary classification, but when you think about it, it's actually a multi-class classification because each different type of fraud can be treated as a separate type. A single type of fraud almost does not exist, and the methods of fraud change from day to day. Even industries that deal with fraud for years, such as banks and insurance companies, must constantly update their detection methods.
Based on this data, my method is to simplify the data and get the most related features.Because in the data scient field, more features are not always better.
After the pre-processing, I found that the amount changes of original account and destination account ware highly related to the fraud transaction.
I find that there are specific patterns for different types of transactions:

1. CASH-IN: Non-fraud
2. DEBIT: Non-fraud
3. PAYMENT: Non-fraud
4. CASH-OUT: When the amount of the original account decreses and destination account increases, this transaction has high risk to be fraud.
5. TRANSFER: When the amount of the original account decreses and destination account stays still, this transaction has high risk to be fraud.

That's why I excluded other transaction types other than CASH-OUT and TRANSFER. And the plots insipred me that KNN should be easily and accurately classify the fraud transactions. And the result showed that I am right. In the future, if there are more transactions data coming, I will suggest the facility follow the steps to detect the fraud:

1. Only extract the CASH-OUT and the TRANSFER transaction.
2. Apply my trained Logistic Regression model to detect CASH-OUT fraud.
3. Apply my trained KNN model to detect TRANSFER fraud.

Future Improvement

If I have enough time, I can try more models, such as XGBoost, Multilayer Perceptron, Decision Tree, Random Forest, etc., and conduct parameter tuning and tuning.

By comparing the Confusion Matrix of multiple models, we can more intuitively choose the optimal model and the optimal solution under different conditions.

Combining different data augmentation and model embedding, we may be able to achieve more direction and better results.

Given enough time, this interface can have more interactive effects, as well as a more concise and beautiful layout.

It may be possible to connect to a trained model so that the user can enter a Feature to verify whether there is a potential fraud risk.

Info

A world adventurer and problem solver. Always be curious about new things and enthusiastic about big challenges. With data as the rope, down-to-earth climb the hills of life.

Subscribe