Graphic of different choices

How can algorithms help us to improve data quality?

In this article, we’d like to explore application fields for algorithms to help us improve our data quality. Before getting to some concrete examples, we have to outline an important aspect to keep in mind when approaching data quality this way.

Quality of algorithms = quality of data

Insight should lead our actions by giving them a structure. And insights are following the structure of the underlying data. By definition, structures are stable and withstand perturbations. This is exactly the reason why we believe in the value of high data quality. If you establish stable business routines that are based on flawed data or insights, the poor quality will persist in your actions. Data has a high longevity, and therefore its quality should be regarded as an asset that keeps paying off in the future.

A very good example for the longevity of data are training samples for algorithms. Any bias in the training data will be reproduced over and over again and possibly get amplified by the algorithm. We have seen many scary examples of such machine biases in the past and are just beginning to understand the implications (By the way, have you ever thought about talking to a data collector like Norstat about training samples for your machine learning projects?).

Our point here is that all algorithms have to have a high quality themselves if they are to improve data quality. Conversely, if algorithms are flawed, the data quality may become even worse. And, regardless how good algorithms may work someday in the future, they will only be able to help us reducing a loss of quality while processing it, but never be able to turn crappy input into high-value output.

This being said, let’s dive into some areas of how such algorithms may someday be applied in survey research.

Panel Recruitment

Recruiting to an Online Access Panel should be seen as the first stage of the sampling process for your project. If you don’t recruit with the highest standards to the panel, you’ll end up with a biased source for drawing project samples. It needs no further explanation that you cannot draw an unbiased sample from a biased panel. This is why we are so meticulous in panel recruitment. But how could algorithms help us to improve the recruitment quality?

  1. Keeping a panel in shape requires very complex decisions which may include trade-offs between different parameters. For example, we have to keep an eye on the composition of the panel and replace unsubscribes. At the same time, we have to forecast the required panel sizes to cope with all requests in the near future. And we’re restricted by the available budget and the feasible recruitment volume during a certain timeframe. So how should we allocate our resources? Algorithms may support our considerations by pointing out the most important demographics and recruitment channels to focus on right now and help us building a balanced panel with less effort.
  2. Once people subscribe to the panel, their identity needs to get verified, simply because we need to make sure that these people are who they say they are. If we recruit them via telephone we can be quite sure that we actually speak to a real person. While it still may not be as easy as it seems on the telephone, verifying the identity of online users is definitely something you cannot conclude during the first contact. Instead, it has to be seen as a process, where you keep gaining confidence about a member’s identity after having ensured that some basic requirements are met right at the beginning. Algorithms can help us to speed up that process by including more data points into a much more complex analysis. Such algorithms can also reveal, if two different persons are sharing the same email address, computer or panel account.
  3. Verifying users goes hand in hand with checking for duplicates. At a very superficial level, this is done by comparing personally identifiable information of different members, such as names, email or IP addresses. But it’s always worth having a deeper look at similar profiles, similar response patterns and possible connections between suspicious profiles or devices. Again, as finding the needle in the hay can be very time-consuming and complex, automation can increase the frequency and sophistication of such quality checks.

Recently, there have been reports about professional survey farms, where fake members are subscribed to panels in order to claim incentives at large scale. This phenomenon matches with our experience that online panels repeatedly become a target for fraudsters. We don’t want to reveal any details, but we do have automated algorithmic routines in place that prevent fraudulent subscriptions to our panel, flag anomalies in our user’s behavior and report suspicious attempts to redeem incentives.

Panel Profiling

Many of our panel members joined over a decade ago and their lives have changed during all these years, of course. All of them will have become older. Some got married, others divorced. Some got children, while the children of others may have left the family already. Some got promoted, some retired. Some moved to a new home, in some cases even to another city. They may have bought new cars and new domestic appliances. They may have changed their banks, insurances and phone provider. Whatever happened in the lives of our panelists, having updated profile information allows us to draw more accurate samples.

We already prompt our panel members to update all their profile variables regularly; so there is no need for a more sophisticated algorithm here. However, with over 500 data points for most of our panelists, some of the information may still not be accurate and we basically have to look for outliers. While the univariate method is quite simple (“show me all members whose age is higher than 120 years”), multivariate approaches are statistically much more complex (“show me all members whose combination of different variables is unusual”). For example, if you have a 16 year old person with an annual income of 50,000 Euros, age and income are probably within the range of normal values. However, the combination will be a visible outlier on the scatter plot. Algorithms can help to identify and flag these outliers.

Algorithms can also help to estimate the probability of certain missing values. For example, if we’d like to specifically target panel members with a high income for a study, but encounter a large quantity of panelists who did not to answer this profile question, we need to estimate their income based on other questions. We may, for example, invite those who own a house, have more than one car in their household or travel very often. Analogously, we could calculate the probability of any other missing variable, given the known correlations with what we have. This would allow us to draw our samples more precisely.

But careful! This is one of the cases we had in mind when writing our disclaimer in the introduction. We have to make sure the algorithm doesn’t harm the general quality of our sample. For example, if we actually invite frequent travelers instead of people with a high income, we may discover that our sample is biased: surprisingly, most of our respondents with a high income will travel frequently. Therefore, we’d have to make sure that the quality of our predictive model is good enough to improve the overall quality of our research.

Panel Maintenance

We are convinced that there is a strong link between the motivation of our panel members and the quality of their responses. In our next examples, algorithms support our efforts to give panelists a better membership experience and in this way make a contribution to data quality.

The purpose of participating in a panel is taking surveys. Everything that increases the likability of participating in surveys also somehow contributes to a positive membership experience. An important factor to boost response rates is the right timing of sending invitations. On a Monday morning, when your email inbox is overflowing, you’d probably rather ignore a survey invitation in order to cope with the more urgent stuff. In contrast, right after lunch, you might be still in the mood for break, so a diversion may be very welcome. Generally speaking, algorithms could help us to identify the right daytime for each panelist and postpone notifications to moments when they are likely to receive more attention.

This technique can go far beyond merely using the daytime and also include other data, such as usage patterns from the panel app (e.g. geolocation, gyroscope). For example, if panel members randomly twist their phones in the hand while being at home, they may experience a downtime and have a higher likability to respond to push notifications in that moment.

Sampling

Closely related to this is sampling automation. Little is more frustrating for panelists than being invited to a survey that has been closed already, either partially for a particular quota or completely. For this reason, you typically send smaller and smaller samples while the field progresses in order to approach the desired number of completes without getting overflowing quotas. For obvious reasons, this is quite labour intensive and can also become quite complex the more quotas you have. Automated sampling can help to minimise the loss of sample by sending survey invitations in smaller and more frequent batches than any human sampler could do. This is a technique we already apply for sample definitions that are not overly complex. In addition, statistically estimated profile information may be used in future, as long as such algorithms do not become a new source for flaws (see above).

Another technique to reduce the negative experience of screen outs and quota fails is routing. There are two fundamental ways to go about it. The kind of dumb way, that we’ve probably all seen somewhere in the past, is keeping respondents in an endless flow of survey screeners until they qualify. After reaching the end page of a survey, you immediately get the chance to qualify in another questionnaire. We are quite skeptical about this approach, as it may compromise the motivation of respondents and encourage speeding and other satisficing response behaviour.

However, there is a smarter way of thinking about routing. You invite panel members in an old fashioned way and tell them that a new survey is available for them. Once they click on the link in the invitation, they get routed to an open survey that best matches their profile. Even if the study to which they were originally assigned is closed, they will be allowed to participate in another survey. With this method of routing, the risk of compromising sample quality is considerably lower, as only a small overflow (from automated sampling) will be redirected. Beyond that, respondents will not encounter endless sequences of survey screeners, but actually respond to only one survey at a time. In any case, you need to have a smart algorithm in place that keeps track of all member profiles that haven’t responded yet, furthermore all target group definitions of available studies and finally make a perfect match. In this way, you would improve the motivation of panelists to participate.

During the Interview

Every study is unique. This makes it really hard to define general measures of quality control that fit all cases. However, algorithms can help to benchmark the response quality of an interview to all previous ones. Does a respondent move considerably faster through the questionnaire than others? Are responses in text boxes shorter or do they contain nonsense? And what about the variance in grid questions? All these indicators may draw the bigger picture and trigger different actions, if a certain threshold is reached. You could let the algorithm flag the interview for manual inspection, display a warning to the respondent, insert a red herring question to screen-out inattentive respondents or remove the whole interview from the data base, right away.

Another technique is deliberate priming of respondents to subconsciously boost their response quality. Here an intermediate page with snackable content is presented before relevant questions to get the respondent into the right mindset for the upcoming task. As this this technique isn’t equally effective for all respondents but may bloat the length of an interview, algorithms can help to present the right primers only to the right people in exactly the right moment. Again, these techniques have to be applied carefully with regard to the overall quality as they may also do some harm.

Until now, we have only spoken about online research, which doesn’t need further explanation. However, other methods of data collection are also subject to digitalization and may benefit from algorithms. Think about telephone interviews, for example. Algorithms could analyze the voice of the respondent and perform a sentiment analysis during the interview. This information may not only be helpful to contextualize the information when analyzing the data afterwards, but also give valuable feedback to interviewer while talking to the respondent. However, as stated, it’s really hard to define measures that fit every study.

Data Processing

After all data has been collected, usually a few more steps need to be taken before it can be analyzed. The first step consists in cleansing the data, i.e. removing cases that cannot be used for analysis. Given all the steps above, this shouldn’t take too much time and effort anymore. The next step is coding all unstructured data, especially open answers from text boxes. Algorithms may recognize if an existing code plan applies (e.g. list of brands in a certain category) or get trained to learn and apply a new code plan. Different languages may be automatically recognized and translated. Finally, all data may be weighted to adjust smaller discrepancies in the composition or to match it to different units of the basis (e.g. whether the feedback is representative for all inhabitants or all households).

So what?

Some of the techniques described in this article are in place already, others are still to be developed. And in addition to these “low-hanging fruits” there are plenty of other areas of application, where algorithms may facilitate the way we work with data.

Whatever we do, we strive for the best possible quality and hesitate to implement methods that may compromise our high standards. We’d love to hear from you if you like to learn more or have a question.

Streamlined data collection

Our comprehensive data collection solution supports you at every stage, from defining your target audience to survey scripting and result delivery. Managed with expertise, flexibility, and your specific needs in mind.

Learn more