LLM generated posts

Yellow Palace · 10 February 2025

Does the board have an official view on posts created wholly or substantially by use of large language models? I've just seen one such post, and while my instinct is to report it as having no more value (and possibly less) than a link to Google, I don't know that it's actually against any board policy.

GTX · 10 February 2025

Yellow Palace said:
I've just seen one such post

Which one?

Yellow Palace · 10 February 2025

GTX said:
Which one?

https://www.secretprojects.co.uk/threads/a-3-skywarrior-loadout.11487/post-754719

Scott Kenny · 12 February 2025

I'd say that if they don't actually link sources, they should not be allowed.

LLMs are worse than my shitty memory!

Arjen · 12 February 2025

From the thread that prompted @Yellow Palace to reply here:

Arjen said:
<snip>
I do not think we should continue on this path, unless AI starts listing its sources explicitly.

https://www.secretprojects.co.uk/threads/the-potential-effect-of-artificial-intelligence-on-civilisation-a-serious-discussion.41465/post-603256
In the linked reply, it is immediately clear AI generated gibberish. In other cases, it will be harder to judge.

It occurred to me that AI always listing its sources would lead to demands for payment from the sources' authors.

Hood · 12 February 2025

There is no need for them.
If the collective specialist knowledge brains of this forum can't answer a question then there is zero chance of AI knowing the answer.

Scott Kenny · 12 February 2025

Arjen said:
From the thread that prompted @Yellow Palace to reply here:

It occurred to me that AI always listing its sources would lead to demands for payment from the sources' authors.

Fair Use.

Arjen · 12 February 2025

The demands might land on deaf ears. It depends on how the AI company makes money. If using the AI is offered as a paid commodity, a judge could decide authors should share in the profits.

Cue lawyers, with individual authors against Big Tech.

PFJN · 12 February 2025

Hi,
One of my biggest concerns about an AI generated post is that I'm not convinced that current AI can/will be able to correctly differentiate between real world information and "what-if", speculative or even fictional information and may end up conflating non-real world information into a discussion of real world items, without any indication that the information may have come from a "what-if", specultative, or even completely made up source.

Regards

Pat

Yellow Palace · 12 February 2025

Even humans can't do that reliably without good knowledge of the subject.

An over-promoted autocomplete algorithm? No chance.

GTX · 12 February 2025

PFJN said:
Hi,
One of my biggest concerns about an AI generated post is that I'm not convinced that current AI can/will be able to correctly differentiate between real world information and "what-if", speculative or even fictional information and may end up conflating non-real world information into a discussion of real world items, without any indication that the information may have come from a "what-if", specultative, or even completely made up source.

Regards

Pat

I agree. Use of such tools also breeds mental laziness - having to do your own research and reading or more forces one to put some effort into an endeavour and take the time to truly learn/appreciate the subject. Just using AI to generate something does not.

Arjen · 12 February 2025

Arjen said:
Cue lawyers

... and presto!

https://www.secretprojects.co.uk/threads/artificial-intelligence-general-news.41572/page-8#post-756157

quellish · 12 February 2025

Scott Kenny said:
Fair Use.

How is that fair use?

Scott Kenny · 13 February 2025

quellish said:
How is that fair use?

Quoting a source means that you are acknowledging their ownership of the material, and quoting them is explicitly called out in the law as fair use.

DWG · 13 February 2025

The first LLM post to spontaneously appear when I googled something said: "Last year the budget increased from $3.9Bn to $3.8Bn". The next one confused two similarly named, but different, drugs (one of which happens to be safety critical).

I'm not convinced they're quite at the level we'd really prefer just yet.

quellish · 13 February 2025

Scott Kenny said:
Quoting a source means that you are acknowledging their ownership of the material, and quoting them is explicitly called out in the law as fair use.

If the material is being used for commercial purposes - as almost any use of an LLM would be - it is far less likely to be found as fair use by a court. The purpose and context of the use is central to the fair use doctrine. For example, use of content for:

- Nonprofit use
- Teaching or education
- Political / social commentary or satire (i.e. discussing the social impact of the protected work in the press)

Would be things that favor a finding of fair use.

The nature of the protected work can be a factor. If you were to write a book about, say, a ship and that book was filled with factual information there could be a case made that those facts, your presentation of them and your expression - are open to "fair use". Even researching, verifying, and publishing those facts was done at considerable expense. But if the book was mostly fiction or a "creative" work, it would be more likely not subject to claims of "fair use".

How the protected work is used - and how much of it - is also a factor. Quoting individual lines or paragraphs from a published work, with attribution, would generally be found as fair use (through he author can make a case for permission or attribution, and can revoke an assumed permission). Using the entire protected work, however, is generally not going to be considered fair use.

And here is where we start running into problems!

LLMs are trained on corpuses of data. They regurgitate outputs that are product of statistical models of that training. There are good arguments that the LLM is using the complete protected work, and most LLMs are doing so for a commercial purpose without the permission of the copyright holder. And in many cases, they are doing it at the expense of the copyright holder. LLM training software scours the internet for content and downloads it at the expense of the copyright holder, then effectively repackages and resells it as "output".

Let's say an LLM uses one of my posts on the forum as training data, and I know this - like if it quotes or attributes me. Through various means I can suggest or force the LLM owner to remove my content from their training data. Doing so though is very difficult from the LLM owner, and in most cases would mean re-training their LLM from scratch without my content.

That gets expensive fast.

And as far as citing sources, LLMs in general have no special handling for citations or quotes. They have no idea what an APA citation, footnote, link, etc. looks like nor have any context for it. Links are all treated the same. It has no way to handle that the link is some kind of source or attribution in context nor a way to validate that the link is what it says it is. That includes its own output, something I have seen frequently with LLMs. Their citations / sources are Wikipedia-quality (i.e. made up).

Scott Kenny · 13 February 2025

quellish said:
LLMs are trained on corpuses of data. They regurgitate outputs that are product of statistical models of that training. There are good arguments that the LLM is using the complete protected work, and most LLMs are doing so for a commercial purpose without the permission of the copyright holder. And in many cases, they are doing it at the expense of the copyright holder. LLM training software scours the internet for content and downloads it at the expense of the copyright holder, then effectively repackages and resells it as "output".

Right, but here I am assuming that the LLM is not trained using those sources but is instead trawling sources after you ask the question.

That would be fair use, it's the same as having your human research assistant do the work, just a lot faster.

quellish · 13 February 2025

Scott Kenny said:
Right, but here I am assuming that the LLM is not trained using those sources but is instead trawling sources after you ask the question.

That would be fair use, it's the same as having your human research assistant do the work, just a lot faster.

Nope, LLMs do not trawl the sources after you have asked the question.

LLMs consume content first, train a model of the content, and that model is used to generate the output. The original content is in that model in some form.

Scott Kenny · 13 February 2025

quellish said:
Nope, LLMs do not trawl the sources after you have asked the question.

LLMs consume content first, train a model of the content, and that model is used to generate the output. The original content is in that model in some form.

Okay, then I agree that there's a significant copyright issue with LLMs, one we do NOT need to be encouraging here.

quellish · 13 February 2025

Scott Kenny said:
Okay, then I agree that there's a significant copyright issue with LLMs, one we do NOT need to be encouraging here.

LLMs are also fed quantity, not quality. They need a LOT of data to build a model. The quality of information it consumes as training data is not curated.

An LLM output will only be as “good” as the most popular/prevalent information on a topic available in its training data.

So many LLMs will end up spitting out conspiracy theories they found on the internet within a year or two from now.

overscan (PaulMM) · Sunday at 19:50

quellish said:
Nope, LLMs do not trawl the sources after you have asked the question.

LLMs consume content first, train a model of the content, and that model is used to generate the output. The original content is in that model in some form.

To be correct, they can do this. That's a constrained model, where you train up an AI on general data then ask it questions about a specific data set. You can use this to for example use ChatGPT to answer questions about content on your website, restricting its answers to the data set of your own website, or database, or knowledgebase.

Scott Kenny · Monday at 03:11

overscan (PaulMM) said:
To be correct, they can do this. That's a constrained model, where you train up an AI on general data then ask it questions about a specific data set. You can use this to for example use ChatGPT to answer questions about content on your website, restricting its answers to the data set of your own website, or database, or knowledgebase.

I'm on a writer's group on FB, one of the posters there commented that feeding their antagonist's back story into ChatGPT and asking it for a villain rant cranked out some entertaining stuff.

Scott Kenny · Monday at 03:32

quellish said:
LLMs are also fed quantity, not quality. They need a LOT of data to build a model. The quality of information it consumes as training data is not curated.

An LLM output will only be as “good” as the most popular/prevalent information on a topic available in its training data.

So many LLMs will end up spitting out conspiracy theories they found on the internet within a year or two from now.

Not to mention that LLMs output goes to hell when it consumes material created by LLMs, which is already being seen in some language groups.

Rhinocrates · Monday at 04:36

Take your pick. Both images are kind. I hate, loathe, despise, abominate, execrate, reject, and don't like very much these fraudulent autocomplete apps.

LLM generated posts

ACCESS: Top Secret

All hail the God of Frustration!!!

ACCESS: Top Secret

ACCESS: USAP

It's turtles all the way down

ACCESS: Top Secret

ACCESS: USAP

It's turtles all the way down

I really should change my personal text

ACCESS: Top Secret

All hail the God of Frustration!!!

It's turtles all the way down

I don’t read The Drive. The Drive reads me.

ACCESS: USAP

ACCESS: Top Secret

I don’t read The Drive. The Drive reads me.

ACCESS: USAP

I don’t read The Drive. The Drive reads me.

ACCESS: USAP

I don’t read The Drive. The Drive reads me.

Administrator

ACCESS: USAP

ACCESS: USAP

ACCESS: Top Secret

Attachments

Similar threads