In the age of big data, companies are collecting massive volumes of information faster than ever before. But as businesses rely more heavily on data-driven insights, a critical question emerges: Can we trust the results if our data isn’t truly representative? Hereafter, we explore whether classical statistical assumptions that almost everybody (knowingly or unknowingly) relies on still hold in the context of modern data sources, such as web or social media data, and what the implications are for decision-makers.
In classical statistics, the concept of “representativity” is fundamental. It refers to how well a sample of data reflects the broader population it aims to represent. For example, if a company wants to understand consumer preferences across Germany, a random sample of the German population is drawn, and the surveyed group should mirror the country’s demographics. The sample data is analyzed, and the results are extrapolated to the population. This is commonly referred to as inference (Deville & Särndal, 1992).
Figure 1: Basic setting of inferential statistics for parameter estimation
However, with big data, samples are often collected passively and without clear structure – think of social media posts or website access data. This lack of clarity around who the data actually represents introduces uncertainty and potential bias. Yet, there’s no universal agreement on what exactly makes data “representative,” making it a difficult concept to define or enforce.
Challenges in data collection and analysis
Traditional surveys remain a gold standard due to their controlled design, but they are costly and time-intensive. Conversely, big data offers speed, volume, and affordability – but at a cost:
Unknown population boundaries: We often don’t know who is included in big data sets.
Selection bias: Certain groups are more likely to be captured than others. For example, younger users dominate social media platforms.
Duplicate entries and missing context: Individuals may be overrepresented or lack key demographic information.
Case studies show these challenges clearly. Figure 2 depicts the percentages of demographic cohorts using selected social media in the USA, 2023.
Figure 2: Social media usage per age in the USA, 2023 (Statista, 2024)
It becomes evident that TikTok is disproportionately used by younger people, while LinkedIn is largely used by working age adults. Although these platforms offer data on millions of people at a fraction of the cost of a classical survey, the surveyed group does not mirror the country’s demographics. The impact of these issues depends on the specific goals of analysis. If an organization aims at analyzing population characteristics, non-representative data can lead to significant errors. The effects are even more pronounced when trying to analyze characteristics that depend on demographic or behavioral subgroups, such as preferences and consumer decisions (Rivers, 2007).
Statistical solutions
While there’s no one-size-fits-all remedy, several correction methods can help mitigate bias in big data:
Propensity Scoring: This technique estimates the likelihood of an individual being included in the dataset and adjusts for under- or overrepresentation accordingly (Valliant & Dever, 2011)
Calibration and Weighting: Known population totals (like age distribution) are used to reweight the data, making it more representative (Münnich et al., 2012).
Matching Techniques: Data from controlled surveys are used to “fill in the gaps” of big data samples (Kim & Fuller, 2004).
Imputation Methods: Missing data is statistically predicted using patterns from other sources (Kim et al. 2021).
Model-Based Approaches: Sophisticated algorithms attempt to account for selection bias by modeling its effects directly (Münnich et al., 2019).
Each method requires careful application and expert judgment. Without clear metadata – the “data about the data” – these corrections can be challenging or even impossible. An example of such a correction is depicted in Figure 3.
Figure 3: Estimation of a population parameter under big data
The vertical yellow vertical line highlights where the true numerical value of a population parameter of interest lies. The pink vertical line depicts the estimated value obtained under a non-corrected big data sample, where the non-dashed pink line is the corresponding probability density of the estimator. It is shifted significantly to the left, and therefore systematically underestimates the population parameter. The blue vertical line marks the estimated value under a propensity scoring approach, where the non-dashed blue line is again the probability density of the estimator. We see that, although not perfect, the adjusted estimate is much closer to the true value, indicating a successful correction of the bias.
Conclusion
For businesses and public institutions, the key takeaway is nuanced: big data can be valuable, but only if its limitations are understood and addressed. Blind reliance on large datasets without regard for representativity may lead to faulty conclusions, wasted resources, or biased strategies. Organizations should involve both subject-matter experts and data methodologists to interpret findings responsibly. In many cases, blending traditional survey methods with big data sources offers the most robust results.
So, does representativity still matter in the era of big data? The answer might be a cautious “yes – but it depends.” While big data has revolutionized analytics, offering unprecedented access to real-time insights, the underlying principles of sound statistical inference still apply. Businesses must be aware that data quantity does not guarantee data quality. Representativity remains a critical – if complex – foundation for trustworthy analysis.
References
Deville, J.-C., Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Society, Vol. 87, No. 418, pp. 376-382.
Kim, J. K., Fuller, W. A. (2004). Inference procedures for hot deck imputation. Biometrika, Vol. 91, pp. 559-578.
Kim, J.K., Park, S. Chen, Y., et al. (2021). Combining non-propability and probability survey samples through mass imputation. Journal of the Royal Statistical Society – Series A: Statistics in Society, Vol. 184, No. 3, 941-963.
Münnich, R., Gabler, S., Ganninger, M. (2012). Stichprobenoptimierung und Schätzung im Zensus 2011. Statistik und Wissenschaft, Vol. 21.
Münnich, R., Burgard, J. P., Krause, J. (2019). Adjusting selection bias in German health insurance records for regional prevalence estimation. Population Health Metrics, Vol. 17, No. 1.
Rivers, D. (2007). Sampling for web surveys. Proceedings of the Survey Research Methods Section of the American Statistical Association, pp. 1-26.
Valliant, R., Dever, J.A. (2011). Estimating propensity adjustments for volunteers in web surveys. Sociological Methods & Research, Vol. 40, No. 1, pp. 105-137.