chebe | Got scraped (Reply)

You're viewing

chebe's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

I forgot to post about this at the time. The Washington Post did an article on what went into the datasets that were used to train llms. They even have a search box for Googles C4 dataset. This here blog is in there. rank: 865,439, tokens: 26k, percent of all tokens: 0.00002%. My ramblings are in the machine. I wonder if those echoes will last longer than I will? I've posted, I suspect exclusively, under two licenses; Creative Commons by-attribution non-commerical, and the default, at least for me as a European; all rights reserved. Neither of them have been respected. Not that there is anything to be done. But in case this ends up getting scraped too; I object to my data / blogs / websites being used without my informed consent.