Featured image of post The Middle Path: A Practical Framework for AI Training and Copyright

The Middle Path: A Practical Framework for AI Training and Copyright

What if AI companies could legally train on copyrighted data while creators still got paid? Here's a framework that might actually work for both sides.

You know that feeling when you’re watching two groups argue past each other, and you can see a solution that neither side is considering? That’s exactly where I found myself in the AI copyright debate.

On one side: “AI companies are stealing our work!” On the other: “We’re just teaching machines to read!” And here I am, an AI engineer, thinking: what if they’re both right?

The Problem That Broke My Brain

Last week, the Thomson Reuters v. Ross Intelligence decision dropped like a bomb in the AI community. The court essentially said: training AI on copyrighted data without permission? Not fair use. Full stop.

My first reaction was panic. As someone who’s trained models, I know we’ve all used web data. The entire field seemed built on shaky legal ground. But then I had a conversation that completely reframed how I see this issue.

The Reading Rights Paradox

Here’s where things get philosophically weird. When I train an LLM on text, what’s actually happening? The model isn’t storing copies of books in some digital filing cabinet. It’s extracting patterns, relationships, concepts - kind of like how your brain processes information when you read.

Think about it this way: If I read 1,000 cookbooks and then create my own recipes, have I “copied” those books? My brain has been fundamentally shaped by that knowledge, but I’m not reproducing pages verbatim.

But wait, you might be thinking, “That’s different! You’re human!”

And you’re right. The scale changes everything.

Where the Analogy Breaks Down

An LLM doesn’t just read one book - it ingests millions. And unlike me struggling to remember where I read that great pasta recipe, an LLM can effectively reconstruct passages with the right prompting. It’s less like a student reading textbooks and more like a photocopier with an attitude.

This is where my engineering brain kicked in. The problem isn’t that AI learns from copyrighted content. The problem is when it regurgitates that content without attribution or compensation.

The False Dilemma We’ve Created

The current debate feels like we’re stuck between two extremes:

  1. Ban all copyright training → Innovation grinds to a halt
  2. Allow unlimited training → Creators get screwed

But what if there’s a third path?

A Framework That Actually Makes Sense

After days of back-and-forth discussions (and yes, arguing with myself in the shower), here’s the framework I’ve landed on:

The Core Principle: “If You Profit, Contributors Profit”

It’s stupidly simple: If an AI company makes money using models trained on copyrighted data, the copyright holders should get a proportional cut. But here’s where it gets interesting…

The Two-Path System

Path 1: Pay and Play

  • AI companies can train on copyrighted content
  • They pay licensing fees proportional to their profits
  • They must implement verified safeguards against reproduction
  • They share revenue when their AI drives traffic away from original sources

Path 2: Open Source Everything

  • Can’t afford licensing? Open source your model
  • Full transparency about training data required
  • Community benefits from the technology
  • Creates competition that pushes even big companies toward openness

The Technical Safeguards (This Is Where It Gets Nerdy)

Here’s where my engineering background comes in handy. We can use reinforcement learning to actively punish models that reproduce copyrighted content. Think of it like training a dog - every time the model spits out something too close to its training data, it gets a negative reward signal.

But who decides what “too close” means?

The Committee Solution

Imagine a government committee (stay with me here) that:

  • Has equal representation from tech companies and content creators
  • Sets standards based on peer-reviewed research
  • Updates policies as fast as AI develops
  • Reports to legislative bodies for accountability

“But committees are slow!” you’re thinking. That’s why this one would operate more like the FDA’s emergency approval process - fast when needed, careful when it counts.

The Reality Check Moment

As I was developing this framework, I realized something that almost killed the whole idea: with modern agentic AI, this might all be moot.

Today’s AI can browse the web in real-time, read content directly, and provide summaries without ever “training” on the data. It’s like trying to regulate photocopiers when everyone has a camera phone.

But that’s exactly why we need this framework now - to establish principles before the technology outpaces any possible regulation.

Why This Could Actually Work

The beauty of this system is that it aligns incentives:

For AI companies:

  • Legal clarity instead of lawsuit roulette
  • Access to high-quality training data
  • Competitive pressure to innovate or open-source

For content creators:

  • Compensation for their work
  • Attribution and traffic back to sources
  • Protection against wholesale replacement

For society:

  • Continued AI innovation
  • Open-source alternatives for researchers
  • Knowledge remains accessible

The Philosophical Shift We Need

Here’s my hot take: we need to stop thinking about copyright in the digital age like we did in the print age. Knowledge wants to be free, but creators need to eat. This framework tries to square that circle.

Think about how academic computer science papers work - they’re all open access, yet researchers still benefit through citations, reputation, and career advancement. What if we could create similar alternative reward systems for other types of content?

The Hard Questions This Raises

I’ll be honest - this framework isn’t perfect. Here are the thorny issues:

  1. Who sits on the committee? Too many tech people, and creators get screwed. Too many publishers, and innovation dies.

  2. What’s “best effort” really mean? If a model is 95% compliant but your article is in the 5%, are you just out of luck?

  3. How do we track contribution to profit? If ChatGPT helps someone write code, what percentage goes to Stack Overflow vs GitHub vs tutorial sites?

One Last Thought

The current copyright war feels like we’re trying to fit a square peg (AI capabilities) into a round hole (20th-century IP law). Instead of fighting over who’s right, maybe we need to redesign the hole.

This framework isn’t about choosing sides. It’s about recognizing that both innovation and creation have value, and our legal systems need to evolve to protect both.

What’s your take? Are we overthinking this, or is there something to this middle path approach? Have you seen other frameworks that could work better?

Drop a comment below - I’m genuinely curious how others in the field are thinking about this. Because if we don’t figure this out soon, the courts will do it for us, and nobody wants that.


P.S. - Yes, I used AI to help edit this post. Yes, I made sure it didn’t reproduce any copyrighted content. Yes, I see the irony. That’s exactly why we need to solve this.