At Quantifind, we use Statistics and Machine Learning to discover social media data patterns that impact our customers' KPIs. A major emphasis is to ensure that our algorithmic insights are representative of our customers' user base, and not just those users for which we can gather data. In this blog post, we discuss selection bias, whereby non-uniformly sampled data can induce bias on the inferred results, and briefly survey selection bias mitigation techniques.
Suppose that a biologist wants to estimate the average size of fish in a pond. She collects data for her experiment with a fishing net: scoop out a few specimens, measure their length, and compute their average. The fish caught are samples from the population of fish in the pond and, by the law of large numbers, we know that the average of the samples should be a good estimate of the expected fish size in the pond's population.
But what if the fishing net's mesh size is too large? What if the biologist did not know of a species small enough to go through the fishing net before it is scooped out of the water? What if different species are located on different depths of the pond? In any such case, the sampled specimens are no longer representative of the overall population: we say we have incurred selection bias.
Selection bias is not specific to life sciences research. It appears in economics, sociology and in the analysis of social media or Internet data in general. Not only is its presence universal, it often goes undetected; since we only have samples from the caught specimens, how do we know what the rest of the fish look like?
The first step towards mitigating the effect of selection bias is to assess whether our sampling procedure is likely to pick certain samples over others. In order to translate this concept to specific numbers, we will use a probabilistic model of sampling. We represent the samples randomly1 chosen from our population as a random variable U, and the samples obtained via our (possibly biased) sampling procedure by Û. One way to model the relationship between these two random variables is to assume that they are generated by the laws
where S is a nuisance binary random variable that indicates whether a given sample U is selected or not. In the context of social media data analysis, we can think of U as users sampled from the global U.S. population, and Û as those sampled from the subset of U.S. citizens who are social media users, see Figure 2 for a diagram.
Under this probabilistic model, detecting selection bias can be formulated as a question about the two generating probability laws PU, PU|S: If they are "equal" (which is equivalent to assuming that the random variable S is independent of U, i.e. the uniform sampling assumption), then there is no selection bias and hence samples from Û are as good as samples from U. Unfortunately, while there are a number of statistical tools to answer such a question (e.g. Kullback-Leibler divergence, Mutual Information), they all rely on the ability to gather samples to (partially) estimate PU, which is typically inaccessible (otherwise we wouldn't have a selection bias problem to begin with!). How to escape then from this seemingly chicken-and-egg problem?
The first step is to realize that, in most occasions, we only care to do inference about certain features of the population: In our biology example, our main concern is to get a representative sample as far as size is concerned. Getting more or less specimens of a given age group, for example, is irrelevant so long as the relationship between size and age is statistically independent. Mathematically, we can express this observation by defining the feature of interest as a function ƒ(U) of the original random variable U. A sufficient condition for our inference on ƒ(U) from samples of Û to be unbiased is that:
Notice that, while seemingly trivial, this observation is actually important: It allows us to transform the problem from one of comparing the distributions of arbitrarily complex random variables to one of controlling whether the answers to the questions we care about (represented by the values taken by the function ƒ(U)) are being impacted by our sampling procedure or not.
Fig. 1. Venn diagram describing the relationship among the overall population, a brand's customers, and the users that publish data on social media.
The second step is to leverage the existence of known invariants. For example, if we know from the US census that about 40% of citizens are over age of 45, we observe a problem if we have only 10% of our samples in that age group. Other, subtler, examples of invariant properties are the frequency of certain language features across different datasets, or the distribution of activity levels across different subsamples of the same dataset (see Figure 2 for examples). We can again formulate this observation using the same mathematical language: For each of the I invariant properties that we know about, we define a function gi(U) that takes a sample U and returns 1 if the property is verified, 0 otherwise. Then we can gather evidence on the presence of selection bias by comparing
The left hand side of the above expression can be estimated from our (possibly biased) sample, while the right hand side is typically estimated from an unbiased reference dataset (e.g. U.S. census for demographic features). A statistical justification for these comparisons can be obtained by posing the problem as that of multiple hypothesis testing, where the null hypothesis is that the population frequencies of the allegedly invariant features are the same across the sample and the reference dataset. From that perspective, it is clear that such comparisons only provide evidence on the presence of selection bias but not on the absence of it; that is, we can detect bias if the null hypothesis is rejected but cannot confirm there is none otherwise.
There are two strategies to fight selection bias: prevention and mitigation. Prevention strategies consist in adjusting the sample acquisition procedure so as to avoid introducing bias. The most common form of it is stratified sampling and its variants. In this blog post, we are interested in situations where stratified sampling is too expensive to implement or where the data has already been sampled in a possibly biased manner. We will hence focus on strategies for mitigation.
Fig. 2. Left: Example of age distribution bias. The age buckets corresponding to population 45 and older are severely under-expressed in social media users (census data adatped from http://www.census.gov/prod/cen2010/briefs/c2010br-03.pdf). Right: Example of activity level bias. The sample subset has a significantly different activity level profile than the social media subset obtained via uniform sampling of activity levels. Here activity is defined as number of social media posts for a given user.
Not surprisingly, there has been a lot of research work in the Statistics and Machine Learning community on bias-correcting strategies (see references at the end of this article). In Machine Learning, the problem is usually restricted to the covariate shift problem, whereby we learn a predictive model from training features whose distribution is potentially different from the test data on which the model will be typically evaluated , , . Another line of work focuses on density estimation problems . Most of this research builds upon the following observation, inspired by the Importance Sampling literature :
The quantity PU(U)/PU|S(U) is called the importance weight, which we will assume is defined for all U where ƒ(U) is non-zero. Since we have samples Û ∼ PU|S, we can estimate the right hand side of this equation as a weighted empirical average:
In layman terms, equation (2) justifies the common sense approach of down weighting samples that are overly represented relative to the population and up weighting those that are underrepresented.
Notice that a large variety of Statistical Inference and Machine Learning problems can be posed as operations on the expectation of a functional: For instance, in supervised learning ƒ(U;Θ) is the loss function associated to a given sample and a choice of parameters Θ, and we want to minimize the empirical risk with respect to Θ ,. Other inference problems that don't fit within the framework of (1), for instance posterior sampling, can sometimes be tackled with Rejection Sampling, described in Algorithm 3.
Fig. 3. Simple Rejection Sampling approach to generate samples from PU from samples of the auxiliary distribution, which in our setting is PU|S. The finite constant M is chosen so that PU < MPU|S. Note that existence of M is a more strict condition than that for existence of the Importance Sampling weights. See  for more information regarding its properties and its relationship to more sophisticated techniques.
The reader will have noticed a common theme in both the application of (2) and Algorithm 3: They both rely on having access to the importance weights,
These weights are not available in practice and have to be estimated. The plug-in estimator (that is, replacing PU and PU|S by their empirical frequencies) is often not practical since non-parametric density estimates typically require a very large number of samples (a need that grows with the ambient dimension of the outcome space of U). Researchers in Machine Learning have proposed several alternative approaches to the naive plug-in estimator. In , the authors proposed Kernel Mean Matching, which leverages a property of a class of high-dimensional feature mappings that enables the comparison of distributions via the expectation they define on the feature space. This theoretical property enables the estimation of β(U) via a Quadratic Program, of dimension proportional to the number of samples available. Another interesting approach, with computational complexity less heavily linked to sample size is proposed in , where they estimate the weights β(U) by minimizing the sample estimate of the Kullback-Leibler divergence between PU and β(U)PU|S.
Finally, a most recent line of work revolves around finding β(U) that minimizes a measure of discrepancy for the loss function at hand, relative to that loss evaluated under the unbiased sample distribution , . The key observation is that the choice of β needs to be good as far as the model cost function is concerned, and not necessarily a good approximation of PU(U)/PU|S(U) (in fact,  shows that the wide range of values that PU(U)/PU|S(U) can take can harm the performance of the learning algorithm, even if those values are obtained from the "true" distributions).
The problem of correcting for selection bias is as important as it is complicated. It is important because the advent of larger and larger datasets does not imply that those datasets are obtained by sampling uniformly at random from the underlying population. On the contrary, larger datasets may render the detection of present bias harder since there are more dimensions to control for.
On the other hand, large datasets require compromises in terms of the computational complexity of applicable bias-mitigation approaches, hence there is a need for approaches that allow us to clearly trade-off accuracy and computational cost. The techniques reviewed in this blog post, while not exhaustive, illustrate some of the current research trends as well as some of the more mature methodology regarding selection bias mitigation.
1By "randomly," here we mean uniformly at random.
 Steffen Bickel, Michael Brückner, and Tobia Scheffer. Discriminative learning under corvariate shift. The Journal of Machine Learning Research, 10:2137-2155, 2009.
 Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103-126, 2014.
 Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Algorithmic learning theory, pages 38-53. Springer, 2008.
 Miroslac Dudík, Steven J. Phillips, and Robert E. Schapire. Correcting sample selection bias in maximum entropy density estimation. In Advances in neural information processing systems, pages 323-330, 2005.
 Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, and Alex J. Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601-608, 2006.
 Art B. Owen. Monte Carlo theory, methods and examples. 2013.
 Hidetoshi Shimodaira. Improving predictive interference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227-244, 2000.
 Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V. Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1422-1440, 2008.
 Biana Zadrozny. Learning and evaluating classifiers under sample selection bias. In Processings of the twenty-first international conference on Machine learning, page 114. ACM, 2004.