All Your Base Are Belong to LLM

The output from an LLM is a derivative work of the data used to train the LLM.

If we fail to recognise this, or are unable to uphold this in law, copyright (and copyleft on which it depends) is dead. Copyright will still be used against us by corporations, but its utility to FOSS to preserve freedom is gone.

LLMs can (and have) produced verbatim copies of significant and identifable parts of their training set, demonstrating that what they produce is a derivative work. If that work is not held to be subject to the licences and copyrights of the training data, then we've lost. In a world where this kind of copyright washing is accepted, we may as well release our works into the public domain as we no longer have any protection. Bye bye OSI - our licenses are useless!

If a human LLM operator is not aware of the copyrights of the data used to produce the output, and cannot compare how closely the output matches a particular input, they cannot comply with license or attribution requirements.

Either that is a violation of the license, or copyright is dead.

If we absolve them of their obligations, copyright is dead. I'm cool with that. Lets kill off all other patents and IP protections while we're at it and we're golden. Require that all source code must be made available to anyone using any piece of software and I think we're done here.

However, until corporations can no longer use state violence to threaten individuals over copyrights, I'd quite like to keep the one thing it does for us: allow copyleft. It's a small thing, but I'm fond of it.

2024-04-25