LLM generated posts

Yellow Palace

ACCESS: Top Secret
Senior Member
Joined
5 May 2007
Messages
1,408
Reaction score
2,669
Does the board have an official view on posts created wholly or substantially by use of large language models? I've just seen one such post, and while my instinct is to report it as having no more value (and possibly less) than a link to Google, I don't know that it's actually against any board policy.
 
From the thread that prompted @Yellow Palace to reply here:
<snip>
I do not think we should continue on this path, unless AI starts listing its sources explicitly.
In the linked reply, it is immediately clear AI generated gibberish. In other cases, it will be harder to judge.
It occurred to me that AI always listing its sources would lead to demands for payment from the sources' authors.
 
The demands might land on deaf ears. It depends on how the AI company makes money. If using the AI is offered as a paid commodity, a judge could decide authors should share in the profits.

Cue lawyers, with individual authors against Big Tech.
 
Hi,
One of my biggest concerns about an AI generated post is that I'm not convinced that current AI can/will be able to correctly differentiate between real world information and "what-if", speculative or even fictional information and may end up conflating non-real world information into a discussion of real world items, without any indication that the information may have come from a "what-if", specultative, or even completely made up source.

Regards

Pat
 
Hi,
One of my biggest concerns about an AI generated post is that I'm not convinced that current AI can/will be able to correctly differentiate between real world information and "what-if", speculative or even fictional information and may end up conflating non-real world information into a discussion of real world items, without any indication that the information may have come from a "what-if", specultative, or even completely made up source.

Regards

Pat
I agree. Use of such tools also breeds mental laziness - having to do your own research and reading or more forces one to put some effort into an endeavour and take the time to truly learn/appreciate the subject. Just using AI to generate something does not.
 
The first LLM post to spontaneously appear when I googled something said: "Last year the budget increased from $3.9Bn to $3.8Bn". The next one confused two similarly named, but different, drugs (one of which happens to be safety critical).

I'm not convinced they're quite at the level we'd really prefer just yet.
 
Quoting a source means that you are acknowledging their ownership of the material, and quoting them is explicitly called out in the law as fair use.

If the material is being used for commercial purposes - as almost any use of an LLM would be - it is far less likely to be found as fair use by a court. The purpose and context of the use is central to the fair use doctrine. For example, use of content for:

- Nonprofit use
- Teaching or education
- Political / social commentary or satire (i.e. discussing the social impact of the protected work in the press)

Would be things that favor a finding of fair use.

The nature of the protected work can be a factor. If you were to write a book about, say, a ship and that book was filled with factual information there could be a case made that those facts, your presentation of them and your expression - are open to "fair use". Even researching, verifying, and publishing those facts was done at considerable expense. But if the book was mostly fiction or a "creative" work, it would be more likely not subject to claims of "fair use".

How the protected work is used - and how much of it - is also a factor. Quoting individual lines or paragraphs from a published work, with attribution, would generally be found as fair use (through he author can make a case for permission or attribution, and can revoke an assumed permission). Using the entire protected work, however, is generally not going to be considered fair use.

And here is where we start running into problems!

LLMs are trained on corpuses of data. They regurgitate outputs that are product of statistical models of that training. There are good arguments that the LLM is using the complete protected work, and most LLMs are doing so for a commercial purpose without the permission of the copyright holder. And in many cases, they are doing it at the expense of the copyright holder. LLM training software scours the internet for content and downloads it at the expense of the copyright holder, then effectively repackages and resells it as "output".

Let's say an LLM uses one of my posts on the forum as training data, and I know this - like if it quotes or attributes me. Through various means I can suggest or force the LLM owner to remove my content from their training data. Doing so though is very difficult from the LLM owner, and in most cases would mean re-training their LLM from scratch without my content.

That gets expensive fast.


And as far as citing sources, LLMs in general have no special handling for citations or quotes. They have no idea what an APA citation, footnote, link, etc. looks like nor have any context for it. Links are all treated the same. It has no way to handle that the link is some kind of source or attribution in context nor a way to validate that the link is what it says it is. That includes its own output, something I have seen frequently with LLMs. Their citations / sources are Wikipedia-quality (i.e. made up).
 
LLMs are trained on corpuses of data. They regurgitate outputs that are product of statistical models of that training. There are good arguments that the LLM is using the complete protected work, and most LLMs are doing so for a commercial purpose without the permission of the copyright holder. And in many cases, they are doing it at the expense of the copyright holder. LLM training software scours the internet for content and downloads it at the expense of the copyright holder, then effectively repackages and resells it as "output".
Right, but here I am assuming that the LLM is not trained using those sources but is instead trawling sources after you ask the question.

That would be fair use, it's the same as having your human research assistant do the work, just a lot faster.
 
Right, but here I am assuming that the LLM is not trained using those sources but is instead trawling sources after you ask the question.

That would be fair use, it's the same as having your human research assistant do the work, just a lot faster.

Nope, LLMs do not trawl the sources after you have asked the question.

LLMs consume content first, train a model of the content, and that model is used to generate the output. The original content is in that model in some form.
 
Nope, LLMs do not trawl the sources after you have asked the question.

LLMs consume content first, train a model of the content, and that model is used to generate the output. The original content is in that model in some form.
Okay, then I agree that there's a significant copyright issue with LLMs, one we do NOT need to be encouraging here.
 
Okay, then I agree that there's a significant copyright issue with LLMs, one we do NOT need to be encouraging here.

LLMs are also fed quantity, not quality. They need a LOT of data to build a model. The quality of information it consumes as training data is not curated.

An LLM output will only be as “good” as the most popular/prevalent information on a topic available in its training data.

So many LLMs will end up spitting out conspiracy theories they found on the internet within a year or two from now.
 

Similar threads

Please donate to support the forum.

Back
Top Bottom