New instances (and not just Lemmy instances, but Mastodon and other fediverse instances) are coming online all the time, so you need a way to let them through to start the federation process. There are thousands, so it needs to be automatic, you can’t require a new instance sends whitelisting requests to ever server one of their users might want to interact with (instances aren’t linked unless a local user subscribes to something on a remote instance).
Given the AI bots seem to just be indiscriminately scraping web pages, I excluded API endpoints from blocking anyway. Another admin showed me a nice Cloudflare rule to do this, though media can still be a problem due to how it’s individual users on other instances that are loading it so it’s hard to block scrapers without blocking users, which is another way Cloudflare helps (static media files are easily cached by their CDN).
you need a way to let them through to start the federation process.
This isn’t via an API endpoint explicitly for that purpose that bots would normally not utilize?
And why not have a process by which admins from a new instance poke the admins of another instance - any other instance, so long as it’s already a part of the network - to do an initial manual whitelist that could cascade through the entire system?
Then there should be ways that the software itself can auth with other instances of itself, via a common encryption protocol. While this would only work with like software, the key point being that only a toehold is needed to start propagating.
The point being, there are options. Some of them quite simple.
Realistically, federation is not the main concern. You can leave all your API endpoints open to bots and not have a problem because they are loading the web app. Just block the web app for suspicious traffic.
ActivityPub already uses authentication to some extent with other instances, it’s the first contact where you have to have trust.
My main concern is still that media is loaded directly from users in most cases, the APIs are not a problem right now as the bots aren’t specifically targeting Lemmy. There are ways to address this but Lemmy (and other threadiverse services) don’t have full time dev teams, they work on what they can or want to work on given the very low hourly rate.
New instances (and not just Lemmy instances, but Mastodon and other fediverse instances) are coming online all the time, so you need a way to let them through to start the federation process. There are thousands, so it needs to be automatic, you can’t require a new instance sends whitelisting requests to ever server one of their users might want to interact with (instances aren’t linked unless a local user subscribes to something on a remote instance).
Given the AI bots seem to just be indiscriminately scraping web pages, I excluded API endpoints from blocking anyway. Another admin showed me a nice Cloudflare rule to do this, though media can still be a problem due to how it’s individual users on other instances that are loading it so it’s hard to block scrapers without blocking users, which is another way Cloudflare helps (static media files are easily cached by their CDN).
This isn’t via an API endpoint explicitly for that purpose that bots would normally not utilize?
And why not have a process by which admins from a new instance poke the admins of another instance - any other instance, so long as it’s already a part of the network - to do an initial manual whitelist that could cascade through the entire system?
Then there should be ways that the software itself can auth with other instances of itself, via a common encryption protocol. While this would only work with like software, the key point being that only a toehold is needed to start propagating.
The point being, there are options. Some of them quite simple.
Realistically, federation is not the main concern. You can leave all your API endpoints open to bots and not have a problem because they are loading the web app. Just block the web app for suspicious traffic.
ActivityPub already uses authentication to some extent with other instances, it’s the first contact where you have to have trust.
My main concern is still that media is loaded directly from users in most cases, the APIs are not a problem right now as the bots aren’t specifically targeting Lemmy. There are ways to address this but Lemmy (and other threadiverse services) don’t have full time dev teams, they work on what they can or want to work on given the very low hourly rate.