Big Data cover

Big Data

by Viktor Mayer-Schonberger and Kenneth Cukier

Explore how big data is revolutionizing our lives by transforming raw information into actionable insights. This book provides a comprehensive look at how individuals and companies are harnessing this power, the future implications of a data-rich society, and the ethical challenges it presents.

Big Data and the Revolution of How We Understand the World

Have you ever wondered how Google knows a flu outbreak is coming before doctors do—or how your smartphone can guess what you’re about to type before you finish? In Big Data, Viktor Mayer-Schönberger and Kenneth Cukier argue that a profound shift is underway: we are entering a world where massive datasets, instead of human intuition or small samples, reveal insights, correlations, and predictive patterns that transform how we live, work, and think.

For centuries, we’ve relied on small data and causation—trying to understand “why” things happen. Mayer-Schönberger and Cukier contend that in the age of digital abundance, we can often settle for “what,” because correlations can predict outcomes faster and more usefully than traditional causal models. The heart of their argument is simple: when data becomes massive, messy, and interconnected, the sheer quantity changes the quality of what can be known.

The Shift from Small to Big

The authors open with stories that make this transformation tangible. When Google engineers discovered that flu-related search queries mirrored Centers for Disease Control data, they realized they could track influenza activity week by week—without medical tests or lab reports. Similarly, computer scientist Oren Etzioni scanned billions of airline tickets to create Farecast, a service predicting whether fares would rise or fall so consumers could buy at the right time. These examples show how big data doesn’t rely on deep causal reasoning—it finds patterns, correlations, and probabilities based on vast amounts of information.

This represents more than technological progress: it’s a shift in how society interacts with knowledge. Once, data lived in static records—library catalogs, census tables, accounting ledgers. Now, it’s dynamic, streaming from sensors, financial transactions, search engines, and social media, creating an ever-growing ocean of information that can be mined in ways never imagined.

Three Transformations of the Big Data Era

Mayer-Schönberger and Cukier outline three monumental changes. First, we can analyze all the data instead of just a sample. Statistical sampling, born of necessity when we couldn’t handle full datasets, now gives way to “N=all,” where the entire dataset is examined. Second, big data lets us tolerate messiness. Instead of obsessing over perfect accuracy, we accept noise and inconsistency because large quantities compensate for imprecisions. Third, and most radically, we move from seeking causation to finding correlations. Knowing what often tells us enough to act; knowing why may no longer be required.

This philosophical shift parallels what astronomers experienced centuries ago when telescopes enlarged their view—it changed the nature of what could be known. (Note: In a similar way, Daniel Kahneman’s work in behavioral economics highlights how data challenges our intuitive cause-seeking mindset.) The authors emphasize that this isn’t about abandoning reason—it’s about embracing probability over certainty and recognizing that good enough insights, drawn from vast data, often outperform exact but narrow ones.

Why Big Data Matters to You

Big data affects everyone. It already determines the ads you see online, how your credit score is calculated, and even how hospitals detect disease outbreaks. But its influence reaches beyond business—it changes philosophy. If correlations replace causation, what happens to science, justice, and free will? Mayer-Schönberger and Cukier invite you to reflect on the implications: as algorithms make decisions that even their designers struggle to explain, society must balance data’s power with accountability and human values.

Ultimately, Big Data argues that our world is entering a new epistemological era. Data itself—once a passive record of reality—becomes an active agent shaping how reality is understood. The book doesn’t just describe a technical revolution; it presents a cultural and intellectual one. It’s about learning to live in a world where knowledge comes not from knowing all the details, but from listening to what the data tells us—even when we don’t fully understand why.


From Sampling to N=All

For most of human history, we’ve lived in a world of information scarcity. We couldn’t collect or process everything, so we relied on statistical sampling: using small, representative subsets to infer insights about the whole. Viktor Mayer-Schönberger and Kenneth Cukier explain that big data abolishes this constraint. With technology powerful enough to process all available information, we can shift from sampling to “N=all”—examining the complete dataset rather than fragments.

The Origins of Sampling

The authors trace sampling’s evolution through history. The ancient census existed largely as a brute-force tally, while seventeenth-century merchants used small records to estimate population sizes. John Graunt, a British haberdasher, pioneered early statistical inference during a plague, estimating London’s population without counting everyone. By the nineteenth century, government agencies like the U.S. Census Bureau refined sampling methods to make large-scale measurement manageable. Jerzy Neyman’s twentieth-century work on random sampling revolutionized the practice—showing that selecting as few as 1,100 random people could represent millions with a low margin of error.

These methods reflected the limits of their time: when data collection was slow and expensive, sampling was the best we could do. But sampling, the authors note, comes at a cost—inaccuracy and loss of granularity. As sample size shrinks, subtle trends vanish, outliers disappear, and small communities are blurred into statistical averages.

The Revolution of N=All

Big data changes everything. Sensors, mobile phones, and digital platforms gather information automatically, making full datasets affordable to analyze. When Google studies billions of search queries to track flu outbreaks, it’s examining the entire population of online behavior, not a sample. When Oren Etzioni’s Farecast uses 200 billion airline prices to forecast fare changes, it’s analyzing “N=all.”

The benefits are twofold. First, you gain granularity—you can zoom into subgroups, locations, or moments without losing detail. Second, you can detect anomalies. For example, Cynthia Rudin’s study of New York City’s manhole explosions identified hidden correlations across tens of thousands of incidents by analyzing every record. Sampling would have missed the pattern entirely.

Why It Matters

The shift to N=all alters how knowledge is produced. In social science, traditional surveys and small studies give way to large-scale observational data. One researcher, Albert-László Barabási, analyzed mobile phone logs from an entire nation to map social networks—a feat impossible with sampling. This level of data richness redefines what we can study: instead of simplified hypotheses tested on samples, we now explore reality in all its messy detail. (In his work on network theory, Barabási finds that removing individuals who connect distant communities can collapse social structures—an insight visible only at large scale.)

For you, N=all means decisions—from marketing to medicine—can rely on full visibility, not partial guesses. It’s like moving from reading one paragraph of a story to seeing the entire book. In a world drowning in data, looking at everything isn’t just possible; it’s powerful.


Embracing Messiness Over Exactitude

Do you always trust precise numbers? Mayer-Schönberger and Cukier warn that in a big-data world, precision is not always the highest virtue. The second transformation they highlight is embracing messiness: allowing errors and imperfections in exchange for scale and speed. In small datasets, inaccuracy corrupts results; in massive ones, roughness doesn’t matter as much—and often leads to better insights.

When More Trumps Better

The authors show this through historical and modern examples. For centuries, scientists pursued perfect measurement. Lord Kelvin declared, “To measure is to know.” This obsession with precision worked when observations were few. But today, quantity beats perfection. Microsoft researchers Michele Banko and Eric Brill tested algorithms for grammar checking with corpuses from one million to one billion words. As they added data, accuracy skyrocketed—even for simple models. Google engineers went further, training translation software on a trillion words of messy, unfiltered web text. Despite grammatical chaos, Google Translate outperformed older, rule-based systems with smaller, curated data.

The lesson? With enough data, errors average out. Imperfection becomes acceptable because the larger picture is clearer. The authors quote Peter Norvig’s insight: “Simple models and a lot of data trump more elaborate models based on less data.”

Messiness Creates Flexibility

Messy data is also more adaptable. Instead of carefully structured databases, modern systems like Hadoop process information distributed across millions of servers, tolerant of gaps, duplicates, and noise. For example, Visa cut its processing time for billions of transactions from a month to under fifteen minutes by accepting imperfect data and focusing on broad patterns, not surgical detail.

BP’s refinery sensors show the same principle in action: rather than counting only precise readings, engineers monitor overwhelming volumes of imperfect data. The overall pattern reveals which crude oils corrode pipes faster—knowledge invisible in smaller, precise samples. Messiness, in short, becomes a tradeoff for insight.

Why “Good Enough” Is Good Enough

Our culture still prizes exactitude—your bank balance or medical diagnosis must be right. But for many decisions, an approximate answer is sufficient. Databases, once rigidly organized, now evolve toward flexibility: “noSQL” systems accept variable formats, enabling broader use. Pat Helland, a leading database designer, concluded, “If you have too much data, then ‘good enough’ is good enough.” (Note: This echoes psychologist Herbert Simon’s concept of “satisficing”—choosing an adequate solution rather than an optimal one.)

For you, embracing messiness means learning to trust trends over details. It’s the difference between counting every penny in the cash register and estimating the economy by millions of transactions. Big data invites you to loosen perfectionism, because when the dataset grows, the truth lies not in the decimal points but in the direction of the curve.


The Rise of Correlation over Causation

Why is Amazon so good at recommending exactly what you’ll like? It’s not because it knows why you want it—it’s because it sees patterns in what you do. Mayer-Schönberger and Cukier identify the third great shift of big data: replacing causation with correlation. In a world of vast information, we don’t need to understand every cause; we can let patterns speak for themselves.

From “Why” to “What”

In 1998, Amazon engineer Greg Linden realized the company’s recommendation system didn’t need to compare customers—it only needed to find associations among products. By analyzing billions of item pairings, Amazon created “item-to-item” recommendations that predicted what books, movies, or gadgets you’d buy next. The algorithm doesn’t know why you want a Hemingway novel—it only knows that people who buy Fitzgerald often buy Hemingway too. Sales surged, and Amazon replaced its human editors—the “village geniuses” writing reviews—with data. In this model, causation (“why”) is secondary; correlation (“what”) drives results.

Predictive Power Without Understanding

Correlation finds relationships, not explanations. Walmart discovered that before a hurricane, sales of Pop-Tarts spike. The company doesn’t ask why families crave pastries during storms—it simply stocks more. Similarly, credit-score algorithms predict who’s reliable, even though owning a car has nothing causal to do with taking medication as prescribed. The link is statistical, not logical—but useful. The authors call this a new empiricism: letting numbers predict behavior without insisting on theories to explain them.

Transforming Science and Society

This shift challenges centuries of tradition. Since Bacon and Newton, Western thought prized causal knowledge. Yet Mayer-Schönberger and Cukier argue that big data’s strength lies in prediction, not understanding. Even Google’s search engine doesn’t “know” meaning—it ranks results based on how pages link and how users click. Correlations drive most of what we consider intelligence today, from self-driving cars to predictive maintenance.

Sometimes, correlations reveal mysteries faster than causal science. Dr. Carolyn McGregor’s neonatal research found that premature babies with stable vital signs were actually more likely to develop infections—a discovery made entirely by analyzing correlations, counter to medical intuition. What mattered was prediction, not explanation. (Note: This resonates with Nassim Taleb’s argument in Black Swan—that we often invent causes after the fact rather than truly knowing them.)

For you, this means rethinking how you judge evidence. You don’t always need to know why your customers prefer one product or why your students learn better on certain days. Big data gives you another kind of knowledge—the ability to see patterns that let you act decisively, even when the deeper “why” remains unknown.


Datafication: Turning Life into Numbers

How does a 19th-century naval officer link to your smartphone? Through datafication—the process of turning everyday phenomena into measurable information. Mayer-Schönberger and Cukier trace this principle back to Commander Matthew Fontaine Maury, who transformed old ship logs into data to map winds, currents, and optimal sea routes. By cataloging over a million observations, he revolutionized navigation—making the oceans predictable through data.

From Maury to Modern Life

Maury’s work shows that datafication predates computers. It’s not digitization—turning analog into bits—but the act of quantifying the unquantified. Today, Professor Shigeomi Koshimizu in Japan applies this logic to ergonomics: analyzing 360 pressure points on car seats to identify drivers by their posture. The contours of your body become a data signature. Datafication means seeing everything—traffic, health, emotions—as measurable patterns.

Digitization vs. Datafication

Digitization simply stores information; datafication interprets it. When Google scanned 20 million books for its Books project, that was digitization. But when it applied optical recognition to turn words into searchable text, that was datafication. Suddenly, historians could track when terms like “freedom” or “democracy” first spiked in literature. Entire fields—like “culturomics”—emerged from this ability to study cultural evolution through numbers.

Expanding the Boundaries

Datafication now seeps into every domain. Facebook has datafied relationships, Twitter has datafied sentiments, and LinkedIn has datafied our professional experiences. Sandy Pentland’s research on reality mining uses mobile data to identify illness patterns before symptoms emerge. IBM’s patents for smart floors quantify human movement and even detect falls. Each of these turns the invisible into measurable insight—making the world more legible and predictable.

This mindset reshapes how value is created. Data becomes the new oil of the economy. Maury’s charts saved ships and time; today, data saves lives, optimizes cities, and enables prediction from health to crime. As the authors conclude, we are building a global infrastructure of comprehension, an Encyclopédie of human behavior expressed in numbers.

For you, datafication means awareness: the next innovation might arise not from new technology, but from seeing ordinary phenomena—how people sit, spend, or speak—as data waiting to be captured and reimagined.


The New Value of Data

In the industrial age, factories and land drove wealth; in the information age, data does. Mayer-Schönberger and Cukier explain that data’s value no longer stems from its primary use—like completing a transaction—but from its potential reuse. This concept of “option value” transforms business strategy.

From Captcha to ReCaptcha

To illustrate, the authors recount Luis von Ahn’s invention of Captcha: the squiggly letters proving you're human when logging online. Millions typed these daily, generating valuable human input that was discarded—until von Ahn’s upgraded version, ReCaptcha, reused this effort to decode scanned books that computers couldn’t read. Each login not only verified identity but helped digitize history. The data, previously waste, became gold.

Basic Reuse and Merging

The authors show countless examples of data reuse. Google refines its search and spell-check from users’ typos—"data exhaust" turned into new products. Amazon uses purchase history for recommendations, while logistics firm SWIFT sells insights from global payment flows to economists. Combining datasets multiplies value: the Danish Cancer Society merged mobile phone records, health statistics, and income data to test links between call usage and tumors, uncovering insights invisible in isolation.

Valuing the Priceless

Data’s economic worth is vast yet intangible. When Facebook went public in 2012, its physical assets totaled $6 billion, but its market valuation exceeded $100 billion—nearly all due to the data of its users. Gartner analysts estimated each piece of Facebook content as worth about five cents, translating human interactions into corporate value. The authors predict future balance sheets will list data beside cash and property—a new asset class of the digital economy.

For you, the takeaway is clear: what matters isn’t collecting data for one purpose, but recognizing its endless reuse. Like von Ahn’s ReCaptcha, your information streams may harbor hidden value waiting to be unlocked, reused, or recombined for entirely new ends.


Risks and Ethics in the Age of Big Data

If data is power, can it also oppress? Mayer-Schönberger and Cukier dedicate a sobering chapter to the dark side of big data—privacy violations, predictive punishment, and what they call the “dictatorship of data.” Information abundance, they warn, reshapes not just business but freedom itself.

The Death of Privacy

Personal data collection now exceeds anything imagined in Orwell’s 1984. From Facebook profiles to GPS trails, surveillance is embedded in everyday tools. Even anonymization fails: when AOL and Netflix released supposedly scrubbed datasets, researchers reidentified individuals by cross-analyzing searches and movie ratings. With enough data, identity reassembles itself. The authors note, “Perfect anonymization is impossible.”

Predictive Punishment

The greater threat, however, is predictive justice—punishing not actions but probabilities. Drawing on Philip K. Dick’s Minority Report, they warn that crime prediction models like those used in Memphis and Richmond risk penalizing people based on correlations. Statisticians such as Richard Berk claim they can forecast murders with 75% accuracy using parole data. But acting on such predictions erases free will. You’re no longer judged for what you did, but for what the data predicts you might do.

The Dictatorship of Data

Overreliance on data can blind decision makers. Robert McNamara, obsessed with body counts in Vietnam, mistook numbers for truth, driving futile strategies. Similar faith in metrics today can distort education, business, and governance. Google once tested 41 shades of blue for toolbar color—proof of data’s paralyzing overanalysis. The authors caution that when “truth” becomes quantitative, we risk mistaking what can be measured for what truly matters.

For you, this chapter is a reminder: data gives power, but responsibility must grow in equal measure. Numbers can illuminate, but they can also deceive or confine. Big data demands not just computation, but conscience.


Controlling Big Data: Accountability and Human Agency

After diagnosing big data’s dangers, Mayer-Schönberger and Cukier propose safeguards—new principles for accountability, justice, and transparency. Just as the printing press birthed freedom of speech laws, the data revolution calls for a similar rethinking of governance.

From Consent to Accountability

Current privacy laws rely on “notice and consent”—you’re told how data will be used and must agree. But that fails when uses evolve. Instead, the authors advocate holding data users accountable for responsible reuse. Firms should assess risks, implement safeguards, and be legally liable for misuse. They also suggest data expiration dates to prevent “permanent memory,” ensuring people can outlive their digital footprints.

Protecting Free Will and Justice

To preserve human agency, society must judge people by actions, not algorithms. Governments should never punish based solely on predictive data. In private contexts—credit, hiring, or healthcare—the authors propose transparency: allowing individuals to see, challenge, or disprove algorithms affecting them. Big data must empower, not dictate—a principle echoing Amartya Sen’s arguments in Development as Freedom that freedom means capability to act, not prediction of behavior.

The Rise of the Algorithmist

Because algorithms may become too complex for ordinary oversight, Mayer-Schönberger and Cukier envision a new profession: algorithmists. These impartial auditors—like financial accountants or ombudsmen—would certify algorithms for fairness, accuracy, and bias. Serving inside companies or as external regulators, they’d ensure transparency in the black box of data-driven decisions.

Their proposal extends to competition, too. To avoid monopolistic “data barons” like Google or Facebook, antitrust principles from the industrial age must adapt to ensure open markets for data exchange. You benefit most from this world when accountability keeps pace with innovation—when we tame data’s power without dimming its potential.

Dig Deeper

Get personalized prompts to apply these lessons to your life and deepen your understanding.

Go Deeper

Get the Full Experience

Download Insight Books for AI-powered reflections, quizzes, and more.