Welcome to the MacNN Forums.

Demonhood · Apr 20, 2023, 01:36 PM

https://www.washingtonpost.com/techn...tbot-learning/

The gist: The Washington Post examine a very large dataset to see what some AI bots are being trained on.

The surprise? MacNN is a very high #50,517 !! (out of 15 million)
AppleInsider is a completely nuts #804

Is your personal website on there?

Laminar · Apr 20, 2023, 04:52 PM

Does the AI prefer sources higher up in the dataset? Or is it just representative of how much data the source contains? If they ranked by "tokens," is the rating indicative of quality or just quantity?

Training an AI based on a Reddit or forum comment chain would be extremely useful for learning how to direct online conversation.

Thorzdad · Apr 21, 2023, 09:24 AM

Demonhood. Could you possibly post the direct link to the C4 dataset in the WaPo story? I'm getting paywalled trying to read the story.

Demonhood · Apr 21, 2023, 03:46 PM

This is what they link to: https://www.semanticscholar.org/pape...#citing-papers