• Dave@lemmy.nz
    link
    fedilink
    English
    arrow-up
    12
    ·
    3 days ago

    Cloudflare’s bot detection triggers the blocking because federation looks a lot like a bot (well, it is a bot).

    For example, Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It’s telling my instance about every post, comment, or vote. AI scrapers send hundreds of thousands of requests or millions in a near steady stream each day.

    For all intents and purposes, federation is bot traffic and looks just like it. Typically I block by identifying high traffic ASNs (a group of IPs run by the same entity, because blackhat AI scrapers use many IPs) and showing a cloudflare challenge (which will typically have a 0% pass rate). If it’s from 1IP then it’s probably a federated instance, but I typically see many IPs from the same area spread with an even spread of requests.

    I also try to exclude federation/API endpoints, which can help stop false positives as scrapers are generally loading the web page.

    This is something Lemmy (and PieFed, Mbin) admins try to help each other with strategies for because one day a bot will find you and suddenly your instance is down because they are hammering you too hard.

    I bet if you are in China, Brazil, Singapore, Argentina, etc then you will see a lot of blocked content on Lemmy, as this is often where the bot traffic comes from (Google, Facebook, OpenAI, Amazon, etc will typically respect the robots.txt so US traffic is less of an issue).

    • rekabis@lemmy.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It’s telling my instance about every post, comment, or vote.

      And yet, federation means that each instance should know all the other domain names, yes? So do daily DNS lookups of all IP addresses associated with federation and auto-whitelist them.

      Sure, if you have to then configure cloudflare with these IPs, it’ll require an API to do so automatically.

      But otherwise if you are running some sort of throttling protection on the actual box or VM the instance is sitting on, it should be rather trivial to update it directly, especially if said throttling software is doing Linux correctly and drawing its whitelist from a flat file.

      • Dave@lemmy.nz
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 days ago

        New instances (and not just Lemmy instances, but Mastodon and other fediverse instances) are coming online all the time, so you need a way to let them through to start the federation process. There are thousands, so it needs to be automatic, you can’t require a new instance sends whitelisting requests to ever server one of their users might want to interact with (instances aren’t linked unless a local user subscribes to something on a remote instance).

        Given the AI bots seem to just be indiscriminately scraping web pages, I excluded API endpoints from blocking anyway. Another admin showed me a nice Cloudflare rule to do this, though media can still be a problem due to how it’s individual users on other instances that are loading it so it’s hard to block scrapers without blocking users, which is another way Cloudflare helps (static media files are easily cached by their CDN).

        • rekabis@lemmy.ca
          link
          fedilink
          English
          arrow-up
          2
          ·
          2 days ago

          you need a way to let them through to start the federation process.

          This isn’t via an API endpoint explicitly for that purpose that bots would normally not utilize?

          And why not have a process by which admins from a new instance poke the admins of another instance - any other instance, so long as it’s already a part of the network - to do an initial manual whitelist that could cascade through the entire system?

          Then there should be ways that the software itself can auth with other instances of itself, via a common encryption protocol. While this would only work with like software, the key point being that only a toehold is needed to start propagating.

          The point being, there are options. Some of them quite simple.

          • Dave@lemmy.nz
            link
            fedilink
            English
            arrow-up
            1
            ·
            2 days ago

            Realistically, federation is not the main concern. You can leave all your API endpoints open to bots and not have a problem because they are loading the web app. Just block the web app for suspicious traffic.

            ActivityPub already uses authentication to some extent with other instances, it’s the first contact where you have to have trust.

            My main concern is still that media is loaded directly from users in most cases, the APIs are not a problem right now as the bots aren’t specifically targeting Lemmy. There are ways to address this but Lemmy (and other threadiverse services) don’t have full time dev teams, they work on what they can or want to work on given the very low hourly rate.

    • Cooper8@feddit.online
      link
      fedilink
      English
      arrow-up
      5
      ·
      3 days ago

      The thing that confuses me is, wouldn’t a whitelist for federated instances and request frequency throttling at the account level solve this issue?

      I suppose this would require that the client not have a public front end that keeps full navigation functionality, but for a smaller instance that seems like an easy sacrifice to make in exchange for stability.

      “But then how will new instances get federated?” maybe they have to actually talk to the admins of other instances to get vouched in to the whitelist. Just because the network is distributed doesnt mean it needs to be fully inclusive by default, and in fact it explicitly isn’t.

      I’m assuming I’m missing something super basic that makes all this not enough, bots spoofing the requests with the credentials of a whitelisted instance maybe?

      Seems like maybe the instances should have encrypted keys that handshake each other with batch requests.

      Am I on to something or just wildly gesticulating?

      • Dave@lemmy.nz
        link
        fedilink
        English
        arrow-up
        5
        ·
        2 days ago

        There are thousands of instances and it’s not really about admins. If a Mastodon user wants to go and follow a Lemmy community, they can. They shouldn’t need to ask their admin to contact the admin of the Lemmy instance to be allowed to.

        However, there is something called Fediseer which allows a chain of trust. Some instances guarantee other instances who then guarantee others down a chain. If an instance turns out bad then their guarantor can revoke it and any instances lower in the chain (that the spammy instance guarantees) also lose their trusted status. It doesn’t share IPs to my knowledge though, and outbound IPs are different than the inbound one on the domain if there is a CDN like Cloudflare in the mix. The intent is actually to identify and block instances set up to spam (or other reasons to defederate).

        I think the other part missing is that it’s not just instances. If you upload an image to Lemmy.world and then someone on feddit.online views it, the feddit.online user’s browser loads that image directly from Lemmy.world. That means if you block any IP that’s not an instance, people won’t be able to see content uploaded by your users. So you have to be able to tell what is a Brazil-hosted AI bot and what’s a Brazilian user viewing a meme your user uploaded.

        There are of course different parts that you can or can’t block which is basically the idea, working out which endpoints can be blocked and which will break things for genuine users. With static images they can be basically ignored because Cloudflare will cache it, but having thousands of post or feed loads in a hurry can bring down an instance.

        • Cooper8@feddit.online
          link
          fedilink
          English
          arrow-up
          2
          ·
          2 days ago

          Fediseer seems like a good solution, essentially a whitelist vouch system with touching at second hand.

          Regarding the media hosting, again it seems like something that could rely on a method of identifying the user request directly with their user account before responding to the request. Cookies could be an option for this, though they are falling out of favor. Alternately, and more securely, it could be a cryptographic handshake where the user’s home instance and the instance hosting the post generate a public key using their two private keys for the user, and the user provides the public key when making pull requests from the federated instance. The keys could be batch generated when an instance first federates content with another and then assigned to user accounts the first time the user makes a pull request through a link from their home instance to the federated instance.

          Secure Scuttlebutt Protocol already deved the encryption methodology that could be cross applied for a lot of this: https://ssbc.github.io/scuttlebutt-protocol-guide/ though I am of course not suggesting SSP be adopted whole cloth, and there are a bunch of other OS projects with encryption that could be used. This is just the one that comes to mind.

          (edit: also I am in favor of finding methodologies that work whether CloudFlare is used by the instance or not, obviously CloudFlare has advantages but as we have seen also is a vulnerability of the network.)

          • Dave@lemmy.nz
            link
            fedilink
            English
            arrow-up
            2
            ·
            2 days ago

            Regarding the media hosting, again it seems like something that could rely on a method of identifying the user request directly with their user account before responding to the request.

            Yeah, so far it works to just check for a JWT in the cookie (regardless of what it is) to allow logged in users to bypass the rules. This works on Lemmy because the bots aren’t specifically targetting Lemmy so they don’t try to fake this (although if there were, just make an instance and our instances will send you all the data lol).

            Alternately, and more securely, it could be a cryptographic handshake where the user’s home instance and the instance hosting the post generate a public key using their two private keys for the user, and the user provides the public key when making pull requests from the federated instance.

            This is already basically how ActivityPub works for communication between instances. But the activities are one thing, it’s the page loads that are the killer because of the database queries needed to compile a unique, sorted home page of subscriptions. You could block logged out users but that impacts many lurkers.

            For media, that’s difficult as media is often being loaded from a remote instance that doesn’t know who you are, along with the problem that the media provider is not technically part of Lemmy (it’s a separate service called pict-rs) so doesn’t know if you’re logged in. I’m not sure how that worked on PieFed or Mbin, but regardless you might not be logged in at all, and you should still be allowed to browse content.

            Lemmy has a proxy option where the instance can fetch content from the other servers to provide to the user, which does get around this issue for logged out users. But the proxy caches the media, and when this happens you are now the host of whatever media is in any post that made it’s way to your instance, along with all the legal risks that involves.

            (edit: also I am in favor of finding methodologies that work whether CloudFlare is used by the instance or not, obviously CloudFlare has advantages but as we have seen also is a vulnerability of the network.)

            All of the things being discussed around mitigations in Cloudflare are also possible to do without Cloudflare, but it just means setting it all up yourself. I’ll just wait for someone smarter than me to build a tool I can host myself that does all this automatically, then I’ll consider it 😅

            • Cooper8@feddit.online
              link
              fedilink
              English
              arrow-up
              2
              ·
              2 days ago

              “you could block logged out users but that would impact many lurkers”

              “regardless you might not be logged in at all, you should still be allowed to browse content”

              Fundamentally, what I’m suggesting is a fork in the road. Either an instance admin can set up to eliminate scrapers by making the instance private to only registered users,

              or they can maintain their instance as public and deal with more arcane methods to attempt to eliminate scraping.

              The issue is that if the infrastructure isn’t in place for the instance operator to decide to make their service private, then everyone is opted in to the Scrapers vs Countermeasures war with no alternative.

              Privacy and encryption just work, it seems like not building the infrastructure to enable the network to function with them in place is a mistake.

              To me, and to many users, what we want is fast load times, quick federation, and reliable service, all things that benefit from reducing traffic load to only registered users.

              • Dave@lemmy.nz
                link
                fedilink
                English
                arrow-up
                1
                ·
                2 days ago

                Fundamentally, what I’m suggesting is a fork in the road. Either an instance admin can set up to eliminate scrapers by making the instance private to only registered users,

                Yeah, it would require perhaps more changes (since instances newly subscribed to a community need the ability to ad hoc fetch content), but even just not showing the website when someone isn’t logged in would probably make a big difference. That might be pretty easy, just redirect requests to load the web app (except the login page) to the login page, and exclude the API. Apps would still get logged out access but I doubt that’s much of a problem compared to the website, since the bots seem to just be indiscriminately scraping web pages.