Credit Card Fraud Detection Techniques
There is digital evidence, in Credit card Fraud Detection there are structured and unstructured records and data that can be key elements in identifying fraud patterns and should be exploited. For example, multiple transactions of different amounts from a single origin, that is why, the geographical positioning of the transactions, the consultation of multiple accounts from the same origin, the allocation of credits or loans to users with Low profile at credit level recommendations.
1. Know the organization. Know the normal operation of the company, what are the normal models and processes of the day today. For example, a number of transactions and behaviors that are within the normal range.
2. Integrate all the information within a platform that is not rigid, that allows it to be flexible, have multiple connectors, have expansion techniques, and obtain information. Bring as much information as these contents are processed in the organization.
3. Have defined security parameters within the organization. Having a fraud prevention stance implies developing a security protocol manual within the business for credit card fraud detection. For example, the process has an alert A and B, if the intermediate process fails, another process C is performed.
4. Have a user profile. This will allow you to identify trends and purchase amounts, if you buy more in certain periods of time, if these purchase amounts are normal for the user, etc. This generates greater alertness and response to certain fraud processes with a greater coverage range.
5. Technology has to adapt to business, not my business to technology. Organizations today have to adapt to customers to have a better economic performance, if it works vice versa the control capacity is lost. Technological processes of prevention must be established for efficient action.
Identity a fraud in credit card fraud detection:
According to the AEECF (Spanish Association of Companies Against Fraud), online fraud is growing exponentially due to the proliferation of electronic devices. Phishing attacks (a technique by which it is intended to trick the user into providing confidential information: personal data, number of credit cards, keys, or passwords) have experienced an 87% growth in recent years.
Among the most common credit card fraud detection the AEECF considers that identity fraud, the one in which the fraudster impersonates the third party for personal gain, is the most common attack in their organizations. In second place is application fraud, which can be first-person fraud or third-party fraud. First-person fraud is committed by the offender by altering their data to deceive the entity or harm the credit decision, while third-party fraud assumes identity theft or the use of fictitious identities.
Generally, identity and application fraud are hidden in the losses of bad payers, that is, customers who are late in payment. For this reason, analysts do not identify it as fraud (except in rare cases) and the company never recovers from losses.
This context together with the increasing availability of data on customers, their interaction with companies, their behavior habits, etc., as well as the advancement of technology, has made the application of artificial intelligence techniques key to the detection and fight of fraud.
Solution proposed to fight fraud
Until now, companies have had conventional fraud management systems, based on rule engines. Such systems understand known frauds and have rules built to detect them, which can be based on defaults, existing debt on delinquent lists, lists of former bad customers, etc. But these systems are not agile and are always one step behind the constant innovations in fraud techniques. Even so, these systems provide valuable information that can be used with new technologies based on artificial intelligence. In this way, the knowledge acquired by companies for years is not lost through credit card fraud detection
Therefore, a possible solution for the detection of identity or request fraud is to combine both techniques: the conventional rule engines and machine learning, so that the decisions coming from the rule engines are used as input variables for the machine learning algorithms.
In addition, the results from the rule engines will be used as a reference to subsequently evaluate the results of the machine learning algorithm and thus, see what this algorithm contributes to fraud detection compared to conventional rule engines.
Taking all this into account and following the usual methodology for machine learning projects known as CRISP-DM, the steps to be followed will be: understanding the business/problem, understanding the data, preparing the data, modeling, evaluation, and implementation, such and as can be seen in the following diagrams:
1. Data and Descriptive Analysis
The first thing to do is a detailed descriptive analysis of the data to understand, select, clean, and transform it before feeding any algorithms to it. Be part of a supervised machine learning problem. Specifically, it is a classification problem, that is, a fraud detection model must be built that is capable of predicting a category (for example, true or false, or in this case “fraud” / “no fraud”).
Due to the nature of the problem, the data available to address it is data with class imbalance, that is, you do not have the same proportion of fraudulent records or observations as non-fraudulent ones. In this type of problem, it is not surprising that the data has a ratio of 99% of cases of non-fraud compared to 1% of cases of fraud. This will make it more difficult for the algorithm to find patterns of the latter. Therefore, certain class balancing techniques must be applied, which will be detailed later.
It must also be thought that for the algorithm to deliver real value, it must detect as many frauds as possible (true positives), but at the same time, it should not be too wrong (false positives). Because for companies, false positives are potential good customers that the algorithm is going to discard and, therefore, are potential benefits that are going to stop entering. This will be added to the losses coming from the fraudulent individuals that the algorithm does not detect as such (false negatives). Therefore, the algorithm must be as good as possible at detecting fraud and non-fraud.
The data available is made up of a series of variables that we can call “original” or “natural”, which are mostly the data that the applicant fills out when requesting registration for the service and/or product. Said data are, as mentioned above, liable to be false or belong to a third party and not to the person filling out the request. These cases, together with the potential fraud detected by the rule engines, are the ones that the algorithm must detect. Therefore, such data should have a special treatment that will be detailed later.
In addition to these “original” variables, the data is also made up of a set of “synthetic variables.” Some are created from the original variables, and others are generated by conventional rule engines. The latter, therefore, are the ones that will give the algorithm extra knowledge.
The variables generated by the rule engines, in addition to being able to use them as input variables for the machine learning algorithm, can also be used to quantify the results of the rule engines and to have a “rejection rate vs. fraud”. In this way, there will be a reference to evaluate what the machine learning algorithm will contribute to the solution of the problem
The explanation of the previous scheme would be the following:
- 96.82% of records not rejected (0) and that are not fraud (0).
- 3.18% of rejected records (1) and that are not fraud (0). Note also that rule engines may reject them for other reasons.
- 87% of records not rejected (0) and that are fraud (1). Fraudulent registrations that are not rejected by conventional rule engines.
- 13% of rejected records (1) and that they are fraud (1). Fraudulent registrations that are rejected by conventional rule engines.
The goal of applying machine learning algorithms to this case is that 87% of fraud cases that the rule engines are not able to detect.
2. Model Construction
2.1 Data enrichment and creation of synthetic variables
Enriching data using external data sources and creating other variables (synthetic variables) from given variables are common techniques in building machine learning models for credit card fraud detection.
It is common in problems of this type that, among the data available, there are many variables that do not provide raw information to a model. But they can be used to obtain more valuable data from external sources or to create other variables from them that do provide valid information.
For this use case, it is evident that very important information would be to be able to contrast whether the information that the applicant enters when filling out the application may be false or not. These data are personal data of the type: name, DNI, address, email, telephone, etc.
For this, these data are crossed with external data sources such as:
Validation of individual data (name and ID) by appearing on public web pages. (NOTE: This validation is not implemented in the fraud detection model explained in this article, but its potential has been evaluated and it is considered very interesting in order to propose it as an improvement to the model described here.)
Likewise, different techniques are used to validate and quantify the veracity and/or reliability of variables such as email address or telephone number, and synthetic variables are created with the results of said validations. In addition, unsupervised learning techniques (clustering) are used to create another synthetic variable from a segmentation of the records based on different variables. In this way, request groups with similar characteristics are created that will help the supervised learning algorithm to detect Credit card Fraud Detection.