How to do Data Science on an evolving Website

Why UX design plays a key role in dataset consistency and how ensemble modeling can be used to deal with data coming from different time periods.

Massimo Belloni
Data Science @ HousingAnywhere
9 min readNov 30, 2018

--

At HousingAnywhere.com trust is our core business. Like all marketplaces, our added value is directly connected to the quality of suppliers on our platform. To ensure a seamless experience and a trusted environment, we put a lot of effort into detecting and excluding scammers as soon as possible, especially before they are able to interact with any of the searchers. The ever-increasing number of listings published and the availability of data collected over the past three years have prompted us to investigate automation more and more and see how we can leverage artificial intelligence in helping our Customer Solutions (CS) team with scammer and fraud prevention.

The encouraging results in the first offline tests led us to deploy our model in the daily operations, providing a semi-automated tool fed with live data. This added complexity to the problem setting: coping with data in a real-time environment made us reconsider the testing infrastructure, as we expect to provide reliable results well beyond a mere research perspective. The first experiments with live data showed a dramatic decrease in performance, leading us to challenge our entire set of assumptions. We eventually concluded that we had to start fresh, redesigning our pipeline from scratch and make new assumptions on what we were looking for.

Dealing with temporal data

In a competition-like classification environment, the performance of the model is often assessed with a randomized holdout: data is shuffled, the majority of which is randomly selected for training the model and a small subset (10%) is reserved to test its performance. Therefore, data used for the tests is sampled over the entire time interval rather than a specific time interval.

It soon became clear, however, that this approach would prove inadequate in the perspective of bringing this research to production. The purpose of our research is practical rather than academical: the resulting product needs to bring measurable improvements to the UX. Purely relying on an on-paper performance measurement would prove limiting, if not ineffective, when brought to production.

This led us to conclude that it would most likely be useless to assess the performance and fit the model using data coming from randomized time periods when, instead, on a live deployment it will only be used with future data which may show different dynamics.

This paradigm shift might feel like a complication in your data pipeline but, in this instance, it has been a crucial adjustment to our approach. The result of it was a sharp decline in the model accuracy compared to a randomized 10% testing set. The results are shown here below.

Fig. 1 — On the left the performances obtained with a randomized 10-fold cross-validation. On the right the performances obtained with an hold-out on most recent data.

What was the reason for the drop in performance? It took us some time to figure it out and had to run some interviews among the members of our CS team. What it became clear to us is that not only the behavior of the scammers has evolved (and adapted) over time, but also the underlying HousingAnywhere platform has.

Data is a moving target

Based on this discovery, we had to go back to the white-board and revise our assumptions. Looking at the first aspect, while it is true that some smart scammers have changed their behavior over time making it harder to detect them automatically, it is also true that some kind of patterns still remained the same. We had a new riddle at hand: how was it possible that the same features that in 2017 were so relevant for detecting scammers were not in 2018?

After some research, we decided to investigate if our dataset suffered from drifting. The machine learning paradigm holds its groundings in the assumption that the provided data are extracted from the same statistical distribution: in the end, the learning process comes down to finding a way to best approximate the parameters governing this distribution. Cross-validation and, more broadly speaking, randomized holdout often provides reliable results because training and testing sets have been generated from the same distribution. On the other hand, when this assumption doesn’t hold true, training a meaningful model and measuring reliable performances becomes challenging. In these cases deciding to go with a classic randomized split feels a bit like sweeping the dust under the carpet (at least, to us). It extracts data from the majority of the underlying distributions and averages all the possible errors but it isn’t anyhow relevant in a live environment in which the learned model has to prove its effectiveness on live data.

Besides of a theoretical description of what drifting is, we were still lacking a tool to check for it in an easy and practical way. Plugging a timestamp to each row in the dataset has been enough to design a new classifier that tries to correctly guess the time period in which the record has been generated using the features of the dataset. If the behavior of the variables was so time-dependent and changed so much over the last years that it could have been possible to correctly predict, for example, the exact moment a listing was created simply based on its features, then it would become very likely that there was little value-add for the purpose of automating scammer detection. The accuracy of such a model was pretty high (81%), meaning that some of the independent variables were good estimators of the target variable (the time period). Using a tree-based model for the classification allowed us to get some insights into the most relevant features and their behavior. This led us to discover some interesting aspects of our platform.

Fig. 2 — Evolution over time of a binary attribute

Figure 2 plots as an example of the evolution over time of a binary attribute that doesn’t cause any dataset shift: its behavior is seasonal over a year-long period but it is more or less constant in mean and with a quite predictable evolution. Discovering some correlations with scammers’ behavior in the past years could be directly leveraged for the upcoming ones.

Fig.3 — Evolution over the years of the attribute strongly related to the constraints UX imposed to advertisers during the on-boarding process.

Some other attributes, on the other hand, changed their natural behavior in such a way that they proved to be irrelevant to our analysis. The average time to create a new listing was particularly affected by it. The average time to create a listing seemed to have a strong influence on the detection of scammers, but in a context where it changes (increases or decreases) not due to the user’s behavior but to the UX of the platform (i.e. user flow and features), how can we rely on it when brought to production? A behavior that now is suspicious with respect to those characteristics might be completely fine in a couple of months, and this applies to the whole training procedure.

In other terms, we were stuck with a riddle: we aimed to automate the detection of possible scammers, but the features that would correctly identify a scammer today might not work tomorrow because the underlying platform has changed.

All our hypotheses have been verified directly with data: reducing the training window led to a performance much higher than the one obtained using year-long behaviors applying a temporal hold-out with test data coming from the weeks directly subsequent to the training set. The most interesting cases though are those in which some new features and constraints added to the platform changed the users’ online behavior in a such a way that it originated interesting dynamics useful for scammers detection.

Fig. 4

Figure 4 shows the average profile completeness score for each month in the last two years. As it can be seen, since August 2017 such average increased significantly thanks to the introduction of new flows during users’ registration.

Figure 5 helps us visualize the stark distinction between the average value for scammers and not scammers of the considered feature: imposing new constraints in UX has increased both the overall quality of the data collected on our platform and helped the classifier to distinguish between scammers and not scammers.

Fig. 5

Finally, another limitation we faced was the constant improvements in our internal scammer-prevention processes. In 2018 for instance, we developed Theia, an ML-based service that notifies our support team for each picture containing contact information shared by some suspicious advertisers on the platform. It’s clear that these features have a meaning just for a subset of the training dataset, but are fundamental for all the new incoming data instead.

The real challenge has been to extract the full potential from our dataset being able to leverage both the knowledge obtainable from old historical behaviors and the information coming from the newly created triggers. A fixed playbook on how to deal with drifting does not exist: removing completely some of the variables that cause the shift might look like an easy option, but it would hamper the accuracy of future detections.

Learning behaviors over-time

Dozens of experiments have been performed trying different approaches and subsets of the dataset. The best performances have been obtained by putting together an ensemble of four different classifiers trained for the same target but with slightly different datasets, with respect to both the used features and the time period from which the samples have been extracted.

An ensemble is a statistical model composed of different stand-alone classifiers that has better performance than the underlying models taken alone. Each of them returns a probability for the input listing to be a scam; the goal is to find the best possible way to weight each single decision and then get to the final result. The weights associated to each model have been assigned to obtain the best possible performance on the most recent data and to describe the dependence that each of the past dynamics has (on average) on scammers current behaviors.

The developed architecture is scalable and upgradable both in the direction of adjusting the weights to adapt to new behaviors (increasing the importance of decision taken by the most recent model, for example) and to the possibility of adding new models to the pool, trained with new features and triggers.

Fig.6 — Final performances obtained by models trained with different datasets. “Historical” uses the largest subset of consistent data available since 2016; “All” uses all the available features since 2016 without any analysis on their meaning and evolution; “Latest” uses the largest subset of data composed by all the available features.

Reasoning at this level of abstraction has proven to be key to the success of this research: each of the trained models was able to detect scammers that the others weren’t, and none of them completely over-performed the other three. The model trained with the largest available subset of consistent features had the best performances when taken individually, even better than those achieved by the model fed with the richest dataset available —that is, the total historical behaviors. These proved to be still the second best but they are not enough to be considered reliable. The ensemble, on the other hand, shows the strongest performance of all.

The results clearly show that the final model is consistent and that it learned some dynamics of our platform that are not easily translatable to fixed rules or static filters. Moving thresholds on the precision-recall curve allows to tune the classifier either to work as an automatic detector for the most trivial cases or as a reliable monitoring tool able to send notifications for the most suspicious behaviors. In both cases, the designed architecture is always able to attribute a meaningful probability score to each newly created listing that is actually correlated to the likelihood of that listing to be a scam. Even if a final manual check of newly inserted listing will always be needed to achieve the 100% detection (our company’s target), these probability scores will be a decisive tool for our support representatives in their fight against frauds and scams.

--

--