Building the Data Science Use Case, with Pivigo
Ravelin is a product which protects online retailers against fraudulent customers, taking a new approach to fraud prevention by focusing on user behaviour, using holistic customer data and machine learning. The focus of Ravelin’s efforts are aimed towards Card Not Present (CNP) fraud. This is a major issue for merchants as they are typically liable for reimbursing cardholders when they are the victims of fraud.
The S2DS project involved the problem of protecting online merchants from fraudsters. Fraud schemes are becoming more complex and difficult to identify and in order to address this, the S2DS Data Science team object was to develop a model that could effectively detect fraudulent customers based on generic features such as: devices, orders, transactions, locations, and payment methods, that could be easily applied across a wide range of industries.
The objective of the project was achieved through analysis of historical data from the taxi-booking app Hailo, which was accessed and manipulated directly using PostgreSQL. The Python programming language and with a variety of methods from the Scikit-learn machine learning library , ‘Pandas’ and ‘Matplotlib’ were used for the preparation of the data, modelling and scaling. The large scale of the database and the requirement of in-memory analysis meant that computers required 16GB of RAM to work effectively.
We enjoyed being a part of S2DS. It lead to a real improvement in the core part of the product, allowed us to see our data problems from a fresh perspective and lead to us offering to hire one of the participants for a full time role. (Stephen Whitworth, Head of Data Science at Ravelin)
The Data Science team engineered ~40 features from the data to describe each customer, and trained several machine learning models using these features to assess feature usefulness and model performance. The most useful features included time between user registration and card registration, the number of customers using a single card, the mean value of the customer’s orders, their location, and the number of orders they made. The best performing model, correctly identified ~80% fraudsters when trained on appropriately weighted data with optimal parameters, with a false positive ratio about two to six times as good as current industry standard rates.