It's the Data, Stupid!

“Which large language model should we use?” “Which agent is smarter?” These are the questions that fly around corporate meeting rooms these days. Every time a new benchmark score lands, the industry shifts in its seat. But when you look at why AI projects actually fail in the field, the model’s intelligence is rarely the culprit. The problem is almost always the data. In the 1992 U.S. presidential race, Clinton’s strategist James Carville posted a now-famous reminder on the campaign-office wall: “It’s the economy, stupid.” Today’s AI industry needs the same kind of self-warning sign. It’s the data, stupid.

The e-discovery firm TCDI applied generative AI tools to its litigation work and reported that most of the disappointing results traced back to messy input data — and used the slogan as a blog-post title. No matter how dazzling the AI features, leaving duplicates and omissions in your data is enough to collapse the output.

Andrew Ng has long argued that “most AI research is model-centric” and pushed for a shift toward data-centric AI. Academic groups, including ones at MIT, have shown that systematically improving the data tends to deliver more performance than swapping in a new model. In one manufacturing defect-detection case, swapping through several state-of-the-art models produced no improvement; unifying the labeling guide and gathering edge cases lifted accuracy by more than 16 percentage points. The conclusion, in Ng’s framing, is that the question should not be how to tune the code but how to systematically change the data. The research paradigm has to move its center of gravity — from building better models to figuring out how to produce better data.

The importance of data shows up along three axes. First, quality determines performance. In medical imaging and retail demand forecasting, small but well-curated datasets have repeatedly outperformed much larger but messier ones. In facial recognition, training data skewed toward particular ethnicities has produced serious misidentification, and the social fallout has been severe. Quantity cannot substitute for quality — that’s no longer an opinion, it’s an empirical fact. Second, without governance there is no trust. If you cannot trace where the data came from, what the labeling standards were, and where bias entered the pipeline, there’s no basis on which to ask anyone to trust the model’s output. The EU AI Act’s mandate of data governance for high-risk AI systems sits in exactly this logic. Third, data quality is a cost question. A 2024 Fivetran survey put the loss from low-quality data at an average of 6 percent of corporate revenue — about $406 million per company at the high end. The hours spent comparing model licenses would have paid better returns if poured into data infrastructure instead.

Korea is no exception. Plenty of companies have set up AI centers, but a recurring criticism is that they lack any dedicated team for data integrity. Flashy agent frameworks get deployed while the underlying problems with the internal documents those agents are supposed to reference — duplicates, version mismatches, missing metadata — get left untouched. During the previous big-data wave, the belief that “if you just pile it up, value will appear” produced giant data lakes; data piled up without a question to answer turned out to be empty. Without an end-to-end design that locks the problem definition, the data, and the model together, the same failure keeps repeating.

The answer is the data-centric loop. Analyze the model’s weaknesses, engineer data that addresses them, retrain, and analyze again. A three-way collaboration sits at the heart of this loop: domain experts design the labeling standards, data engineers build the pipelines, and AI engineers train the models. The job doesn’t end with a one-time dataset; the process of continuously upgrading it has to be managed as an operational asset. How fast and how accurately a company can turn this loop will define the next round of AI competitiveness.

If your next meeting is about “which model is winning,” pause for a moment and ask instead: “Do we have data we can trust?” Hand a top chef rotten ingredients and you won’t get a good dish; hand a top model dirty data and you won’t get a good result. It’s the data, stupid.