It’s all made from our data, anyway, so it should be ours to use as we want

  • john89@lemmy.ca
    link
    fedilink
    English
    arrow-up
    59
    arrow-down
    1
    ·
    1 day ago

    I don’t think it should be a “punishment.” It should be done on principal.

    • sugar_in_your_tea@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      edit-2
      1 day ago

      Not sure making their LLMs public domain would really hurt their principal, their secret sauce is in the code around the model.

      And yes, I do recognize that you meant “principle”.

      • areyouevenreal@lemm.ee
        link
        fedilink
        English
        arrow-up
        2
        ·
        20 hours ago

        That’s not true though. The models themselves are hella intensive to train. We already have open source programs to run LLMs at home, but they are limited to smaller open-weights models. Having a full ChatGPT model that can be run by any service provider or home server enthusiast would be a boon. It would certainly make my research more effective.

          • areyouevenreal@lemm.ee
            link
            fedilink
            English
            arrow-up
            1
            ·
            19 hours ago

            I know, I have used them. It’s actually my job to do research with those kinds of models. They aren’t nearly as powerful as current OpenAI’s GPT-4o or their latest models.

  • dyc3@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    16 hours ago

    This is a terrible idea. Very easy to circumvent, doesn’t actually help the training sources.

  • Noxy@pawb.social
    link
    fedilink
    English
    arrow-up
    26
    arrow-down
    6
    ·
    1 day ago

    I’d rather they were destroyed, but practically speaking that’s impossible, and this sounds like the next best idea to me.

  • Hackworth@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    1
    ·
    edit-2
    21 hours ago

    Calling something illegal in spite of or in absence of precedent is a time-honored tactic - though not a particularly persuasive one.

    • RoidingOldMan@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      2
      ·
      19 hours ago

      AI is just a plagiarism machine with thousands of copyrighted materials that “trained” it, which they paid nothing for.

  • werefreeatlast@lemmy.world
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    2
    ·
    1 day ago

    I want to have a personal llm that learns all my interests from my files and websites visited. I just want to ask it stuff that I don’t have to remember.

  • nutsack@lemmy.world
    link
    fedilink
    English
    arrow-up
    39
    arrow-down
    5
    ·
    edit-2
    1 day ago

    intellectual property doesn’t really exist in most of the world. they don’t give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore…

    it’s arbitrary law that is designed to protect corporations and it’s generally unenforceable.

    • sugar_in_your_tea@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      12
      ·
      1 day ago

      it’s arbitrary law that is designed to protect corporations and it’s generally unenforceable.

      It’s arbitrary, but it was designed to protect individuals, but it has been morphed to protect corporations. If we reset the law back to the original copyright act of 1790 w/ a 14-year duration, it would go a long way toward removing power from corporations. I think we should take it a step further and perhaps make it 10 years, with an optional extension for another 10 years if you can show need (i.e. you’re an indie dev and your game is finally making a splash after 8 years).

    • C126@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      1 day ago

      So true. IP only helps the corps and slows tech development. Contracts, ndas, and trade secrets are all you really need to keep your ideas safe. If you want your country to develop fast, get rid of any IP laws.

    • Echo Dot@feddit.uk
      link
      fedilink
      English
      arrow-up
      11
      arrow-down
      6
      ·
      1 day ago

      But they’re not developing AI in those countries they’re developing it mostly in the US. In the US copyright law is enforced.

      • homicidalrobot@lemm.ee
        link
        fedilink
        English
        arrow-up
        4
        ·
        24 hours ago

        India only has openhathi, dhenu, bhashini, krutrim and like a dozen other LLM so I cannot see how you could think they aren’t developing AI. This is a wildly wrong claim lol

      • Daemon Silverstein@thelemmy.club
        link
        fedilink
        English
        arrow-up
        9
        ·
        1 day ago

        There are many AI development happening in China. Doubao (from Bytedance, the same company behind TikTok), DeepSeek and Qwen are some examples of Chinese LLMs.

  • Blackmist@feddit.uk
    link
    fedilink
    English
    arrow-up
    11
    ·
    1 day ago

    They don’t mean your data, silly. They don’t give a fuck about that.

    They mean other huge corporations data.

  • Magnetic_dud@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    18
    ·
    1 day ago

    I used whisper to create subs of a video and in a section with instrumental relaxing music it filled on repeat with

    La scuola del Dr. Paret è una tecnologia di ipnosi non verbale che si utilizza per risultati di un’ipnosi non verbale

    Clearly stolen from this Dr paret YouTube channels where he’s selling hypnosis lessons in Italian. Probably in one or multiple videos he had subs stating this over the same relaxing instrumental music that I used and the model assumed the sound corresponded to that text

  • ClamDrinker@lemmy.world
    link
    fedilink
    English
    arrow-up
    37
    arrow-down
    4
    ·
    2 days ago

    Although I’m a firm believer that most AI models should be public domain or open source by default, the premise of “illegally trained LLMs” is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.

    The idea of… well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.

    The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn’t tip further into their favor to the point AI technology only exists to benefit them.

    If the model is built on the corpus of humanity, then humanity should benefit.

    • barsoap@lemm.ee
      link
      fedilink
      English
      arrow-up
      8
      ·
      edit-2
      1 day ago

      As per torrentfreak

      OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.

      These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.

      Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.

      …crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It’s US startup culture, plain and simple, “move fast and break laws”, get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.

      • ClamDrinker@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 day ago

        For OpenAI, I really wouldn’t be surprised if that happened to be the case, considering they still call themselves “OpenAI” despite being the most censored and closed source AI models on the market.

        But my comment was more aimed at AI models in general. If you are assuming they indeed used non-publicly posted or gathered material, and did so directly themselves, they would indeed not have a defense to that. Unfortunately, if a second hand provided them the data, and did so under false pretenses, it would likely let them legally off the hook even if they had every ethical obligation to make sure it was publicly available. The second hand that provided it to them would be the one infringing.

        If that assumption turns out to be a truth (Maybe through some kind of discovery in the trial), they should burn for that. Until then, even if it’s a justified assumption, it’s still an assumption, and most likely not true for most models, certainly not those trained recently.

    • Echo Dot@feddit.uk
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      1 day ago

      Banning AI is out of the question. Even the EU accepts that and they tend to be pretty ban heavy, unlike the US.

      But it’s important that we have these discussions about how copyright applies to AI so that we can actually get an answer and move on, right now it’s this legal quagmire that no one really wants to get involved in except the big companies. If a small group of university students want to build an AI right now they can’t because of the legal nightmare that would be the Twilight zone of law that is acquiring training data.

      • barsoap@lemm.ee
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 day ago

        AI is right-out unregulated in the EU unless and until you actually use it for something where it becomes relevant, then you’ve got at the lower end labelling requirements (If your customer service is an AI chat, say that it’s an AI chat), up to heavy, heavy requirements when you use it for stuff like sifting through job applications. The burden of proof that the AI isn’t e.g. racist is on you. Or, for that matter, using to reject health insurance claims, I think we saw some news lately out of the US what can happen when you do that.

        OpenAI’s copyright case isn’t really good to make the legal situation any clearer: We already know that using pirated content to train stuff isn’t legal because you’re not looking at it legitimately. The case isn’t about the “are computers allowed to learn from public sources just as humans are” question.

    • patatahooligan@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      4
      ·
      1 day ago

      the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world

      They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.

      I agree that we shouldn’t strive for more strict copyright. We should fight for a much more liberal system. But as long as everyone else has to live by the current copyright laws, we should not let AI companies get away with what they’re doing.

      • gazter@aussie.zone
        link
        fedilink
        English
        arrow-up
        4
        ·
        1 day ago

        I’ve never really delved into the AI copyright debate before, so forgive my ignorance on the matter.

        I don’t understand how an AI reading a bunch of books and rearranging some of those words into a new story, is different to a human author reading a bunch of books and rearranging those words into a new story.

        Most AI art I’ve seen has been… Unique, to say the least. To me, they tend to be different enough to the art they were trained in to not be a direct ripoff, so personally I don’t see the issue.

        • Optional@lemm.ee
          link
          fedilink
          English
          arrow-up
          4
          ·
          edit-2
          22 hours ago

          I think the the main difference is one being a human author and this is how humans function. We can not unsee or unhear things but we can be compelled to not use that information if the law requires so company secrets/inadmissible evidence in jury duty/plagiarism laws that already exist. And the other being a machine that do not have agency or personhood that has this information being fed to it ( created by other people ) for the sole purpose of creating a closed system for a company so it’s shareholders can make money. It’s this open for me but not for thee approach is the main problem people have. You have this proprietary “open ai” that microsoft invested 25 or so billion in so they can scrape other peoples work and charge you money for variations of it. I don’t mind abolishing ip or patent laws all together so everyone can use and improve chatgpt with whatever they have. If you yourself are hiding behind ip laws to protect your software and disrespecting other peoples copyright laws that’s what people see as problematic.

        • ClamDrinker@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          1 day ago

          Yes, this is my exact issue with some framing of AI. Creative people love their influences to the point you can ask them and they will point to parts that they reference or nudged to an influence they partially credit to getting to that result. It’s also extremely normal that when you make something new, you brainstorm and analyze any kind of material (copyrighted or not) you can find that gives the same feelings you desire to create. As is ironically said to give comfort to starting creatives that it’s okay to be inspired by others: “Good artists copy, great artists steal.”

          And often people very anti AI don’t see an issue with this, yet it is in essence the same as the AI does, which is to detach the work from the ideas it was built on, and then re-using those ideas. And just like anyone who has the ability to create has the ability to plagiarize or infringe, so does the AI. As human users of AI we must be the ones to ethically guide it away from that (Since it can’t do that itself), just like you would not copy-paste your influences into a new human made work.

        • catloaf@lemm.ee
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 day ago

          The for-profit large-scale media blender is the problem. When it’s a human writing Harry Potter fan fiction, it’s fine. When a company sells a tool for you to write thousands of trash “books” for profit, it’s a problem.

          • ClamDrinker@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            1 day ago

            Which is why the technology itself isn’t the issue, but those willing to use it in unethical ways. AI is an invaluable tool to those with limited means, unlike big corporations.

        • patatahooligan@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          2
          ·
          24 hours ago

          I don’t understand how an AI reading a bunch of books and rearranging some of those words into a new story, is different to a human author reading a bunch of books and rearranging those words into a new story.

          Ok, let’s say for now that these things are actually similar. Is a human legally allowed to “rearrange those words” in any way they want? Not really, because they can’t copy stuff like characters or plot structure. Even if the copy is not verbatim, it has to avoid being “too similar”. It’s not always clear where the threshold is; that will be judged in court. But imagine if your were being sued for copyright infringement because of perceived similarities between your work and another creator’s. You go to court and say “Well I torrented the plaintiff’s work and studied it with the express intent to copy discernible patterns in it, then sell my work based on those patterns”. As long as the similarities are found to be valid, you’re most likely to lose. The fact that you’ve spent years campaigning how companies can save a lot of money by firing artists and hiring your pattern-replicating service instead probably wouldn’t help your case either. Well, that’s basically what an honest defense of AI against copyright infringement would be. So the question is, does AI actually produce output too similar to its training data? Well, this is an example of articles you can find on the topic…

          So based on the above thoughts, do you feel like we hold AI generation to the same standard as we do human creators? It doesn’t seem so to me.

          But there’s a lot of reasons why we should hold AIs to higher standards instead. Off the top of my head:

          • AIs have been created exclusively to replicate patterns in existing works. This is not the only function people have. So we don’t have to wonder whether similarities between AI inputs and outputs are coincidental. We don’t have to worry about whether overbearing restrictions might inadvertently affect some other function.
          • AIs have no feelings or needs. We don’t have to worry about causing direct harm to them and about protecting their rights. Forbidding a person from reading a book just in case they copy elements from it is obviously problematic, but restricting AI’s access to copyrighted work is not directly harmful in the same way.
        • trashgirlfriend@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          2
          ·
          1 day ago

          ML algorithms aren’t capable of producing anything new, they can only ever produce a mishmash of copies of existing works.

          If you feed a generative model a bunch of physics research papers, it won’t create a new valid physics research paper, just a mishmash of jargon from existing papers.

          • ClamDrinker@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            arrow-down
            2
            ·
            edit-2
            1 day ago

            You say it’s not capable of producing anything new, but then give an example of it creating something new. You just changed the goal from “new” to “valid” in the next sentence. Looking at AI for “valid” information is silly, but looking at it for “new” information is not. Humans do this kind of information mixing all the time. It’s why fan works are a thing, and why most creative people have influences they credit with being where they are today.

            Nobody alive today isn’t tainted by the ideas they’ve consumed in copyrighted works, but we do not bat an eye if you use that in a transformative manner. And AI already does this transformation much better than humans do since it’s trained on that much more information, diluting the pool of sources, which effectively means less information from a single source is used.

            • trashgirlfriend@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              1 day ago

              It doesn’t give you new information.

              If I write the sentence “Hello, I just got home” and use an algorithm to jumble it into “got Hello, just I home” there’s nothing new there.

              There’s no transformation, it’s not capable of transformation, it’s just a very complicated text jumbler that’s supposed to jumble text so that the output is readable by humans.

              You’re taking investment advice from a parrot that had the entirety of reddit investment meme subreddits beamed into its brain.

              • ClamDrinker@lemmy.world
                link
                fedilink
                English
                arrow-up
                3
                ·
                edit-2
                1 day ago

                That’s a very short example, but it is a new arrangement of the existing information. It’s not a new valuable arrangement of information, but new nonetheless. And yes, rearrangement is transformation. It’s very low entropy transformation, but transformation nonetheless. Collages and summaries are in fact, a thing that humans make too.

                Unless you mean “new” as in, something nobody’s ever written before, in which case not even you can create new information, since pretty much everything you will ever say or write down can be broken down into pieces that have been spoken or written before, which is not exactly a useful distinction.

                There’s no transformation, it’s not capable of transformation, it’s just a very complicated text jumbler that’s supposed to jumble text so that the output is readable by humans.

                Saying it doesn’t make it true, especially when you follow it up with a self-debunk by saying it transforms the text by jumbling it in specific ways that keep it readable to humans, which requires transformation as like you just demonstrated, randomly swapping words does not make legible text…

                You’re taking investment advice from a parrot that had the entirety of reddit investment meme subreddits beamed into its brain.

                ???

      • ClamDrinker@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        3
        ·
        1 day ago

        They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.

        I really kind of hope you’re kidding here. Because this has got to be the most roundabout way of saying they’re analyzing the information. Just because you think it does so to regurgitate (which I have yet to see any good evidence for, at least for the larger models), does not change the definition of analyzing. And by doing so you are misrepresenting it and showing you might just have misunderstood it, which is ironic. And doing so does not help the cause of anyone who wishes to reduce the harm from AI, as you are literally giving ammo to people to point to and say you are being irrational about it.

        • patatahooligan@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          2
          ·
          1 day ago

          Yes if you completely ignore how data is processed and how the product is derived from the data, then everything can be labeled “data analysis”. Great point. So copyright infringement can never exist because the original work can always be considered data that you analyze. Incredible.

          • ClamDrinker@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            arrow-down
            3
            ·
            edit-2
            1 day ago

            No, not what I said at all. If you’re trying to say I’m making this argument I’d urge you (ironically) to actually analyze what I said rather than putting words in my mouth ;) (Or just, you know, ask me to clarify)

            Copyright infringement (or plagiarism) in it’s simplest form, as in just taking the material as is, is devoid of any analysis. The point is to avoid having to do that analysis and just get right to the end result that has value.

            But that’s not what AI technology does. None of the material used to train it ends up in the model. It looks at the training data and extracts patterns. For text, that is the sentence structure, the likelihood of words being followed by another, the paragraph/line length, the relationship between words when used together, and more. It can do all of this without even ‘knowing’ what these things are, because they are simply patterns that show up in large amounts of data, and machine learning as a technology is made to be able to detect and extract those patterns. That detection is synonymous with how humans do analysis. What it detects are empirical, factual observations about the material it is shown, which cannot be copyrighted.

            The resulting data when fed back to the AI can be used to have it extrapolate on incomplete data, which it could not do without such analysis. You can see this quite easily by asking an AI to refer to you by a specific name, or talk in a specific manner, such as a pirate. It ‘understands’ that certain words are placeholders for names, and that text can be ‘pirateitfied’ by adding filler words or pre/suffixing other words. It could not do so without analysis, unless that exact text was already in the data to begin with, which is doubtful.

            • patatahooligan@lemmy.world
              link
              fedilink
              English
              arrow-up
              3
              ·
              1 day ago

              No, not what I said at all. If you’re trying to say I’m making this argument I’d urge you (ironically) to actually analyze what I said rather than putting words in my mouth ;) (Or just, you know, ask me to clarify)

              That was your implied argument regardless of intent.

              Copyright infringement (or plagiarism) in it’s simplest form, as in just taking the material as is, is devoid of any analysis. The point is to avoid having to do that analysis and just get right to the end result that has value.

              Completely wrong, which invalidates the point you want to make. “Analysis” and “as is” have no place in the definition of copyright infringement. A derivative work can be very different from the original material, and how you created the derivative work, including whether you performed whatever you think “analysis” means, is generally irrelevant.

              What it detects are empirical, factual observations about the material it is shown, which cannot be copyrighted.

              No it detects patterns. You already said it correctly above. And the problem is that some patterns can be copyrighted. That’s exactly the problem highlighted here and here. For copyright law, it doesn’t matter if, for example, that particular image of Mario is copied verbatim from the training data. The character likeness, which is encoded in the model because it is in fact a discernible pattern, is an infringement.

              • ClamDrinker@lemmy.world
                link
                fedilink
                English
                arrow-up
                2
                arrow-down
                1
                ·
                edit-2
                24 hours ago

                That was your implied argument regardless of intent.

                I decide what my argument is, thank you very much. Your interpretation of it is outside of my control, and while I might try to avoid it from going astray, I cannot stop it from doing so, that’s on you.

                Completely wrong, which invalidates the point you want to make. “Analysis” and “as is” have no place in the definition of copyright infringement. A derivative work can be very different from the original material, and how you created the derivative work, including whether you performed whatever you think “analysis” means, is generally irrelevant.

                I wasn’t giving a definition of copyright infringement, since that depends on the jurisdiction, and since you and I aren’t in the same one most likely, that’s nothing I would argue for to begin with. In the most basic form of plagiarism, people do so to avoid doing the effort of transformation. More complex forms of plagiarism might involve some transformation, but still try to capture the expression of the original, instead of the ideas. Analysis is definitely relevant, since to create a work that does not infringe on copyright, you generally can take ideas from a copyrighted work, but not the expression of those ideas. If a new work is based on just those ideas (and preferably mixes it with new ideas), it generally doesn’t infringe on copyright. It’s why there are so many copycat products of everything you can think of, that aren’t copyright infringing.

                No it detects patterns. You already said it correctly above. And the problem is that some patterns can be copyrighted. That’s exactly the problem highlighted here and here. For copyright law, it doesn’t matter if, for example, that particular image of Mario is copied verbatim from the training data.

                While depending on your definition Mario could be a sufficiently complex pattern, that’s not the definition I’m using. Mario isn’t a pattern, it’s an expression of multiple patterns. Patterns like “an italian man”, “a big moustache”, “a red rounded hat with the letter ‘M’ in a white circle”, “overalls”. You can use any of those patterns in a new non-infringing work, Nintendo has no copyright on any of those patterns. But bring them all together in one place again without adding new patterns, and you will have infringed on the expression of Mario. If you give many images of Mario to the AI it might be able to understand that those patterns together are some sort of “Mario-ness” pattern, but it can still separate them from each other since you aren’t just showing it Mario, but also other images that have these same patterns in different expressions.

                Mario’s likeness isn’t in the model, but it’s patterns are. And if an unethical user of the AI wants to prompt it for those specific patterns to be surprised they get Mario, or something close enough to be substantially similar, that’s on them, and it will be infringing just like drawing and selling a copy of Mario without Nintendo’s approval is now.

                The character likeness, which is encoded in the model because it is in fact a discernible pattern, is an infringement.

                You have absolutely no legal basis to claim they are infringement, as these things simply have not been settled in court. You can be of the opinion that they are infringement, but your opinion isn’t the same as law. The articles you showed are also simply reporting and speculating on the lawsuits that are pending.

                • patatahooligan@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  arrow-down
                  1
                  ·
                  23 hours ago

                  Plagiarism is not the same as copyright infringement. Why you think people probably plagiarize is doubly irrelevant then.

                  Analysis is definitely relevant, since to create a work that does not infringe on copyright

                  Show me literally any example of the defendant’s use of “analysis” having any impact whatsoever in a copyright infringement case or a law that explicitly talks about it, or just stop repeating that it is in any way relevant to copyright.

                  But bring them all together in one place again without adding new patterns

                  Wrong. The “all together” and “without adding new patterns” are not legal requirements. You are constantly trying to push the definition of copyright infringement to be more extreme to make it easier for you to argue.

                  you generally can take ideas from a copyrighted work, but not the expression of those ideas

                  Unfortunately, an AI has no concept of ideas, and it simply encodes patterns, whatever they might happen to be. Again, you’re morphing the discussion to make an argument.

                  Mario’s likeness isn’t in the model, but it’s patterns are.

                  Mario’s likeness has to be encoded into the model in some way. Otherwise, this would not have been the image generated for “draw an italian plumber from a video game”. There is absolutely nothing in the prompt to push GPT-4 to combine those elements. There are also no “new” patterns, as you put it. That’s exactly the point of the article. As they put it:

                  Clearly, these models did not just learn abstract facts about plumbers—for example, that they wear overalls and carry wrenches. They learned facts about a specific fictional Italian plumber who wears white gloves, blue overalls with yellow buttons, and a red hat with an “M” on the front.

                  These are not facts about the world that lie beyond the reach of copyright. Rather, the creative choices that define Mario are likely covered by copyrights held by Nintendo.

                  This is contradictory to how you present it as “taking ideas”.

                  You have absolutely no legal basis to claim they are infringement

                  You’re mixing up different things. I’m saying that the image contains infringing material, which is hopefully not something you have to be convinced about. The production of an obviously infringing image, without the infringing elements having been provided in the prompt, is used to show how this information is encoded inside the model in some form. Whether this copyright-protected material exists in some form inside the model is not an equivalent question to whether this is copyright infringement. You are right that the courts have not decided on the latter, but we have been talking about the former. I repeat your position which I was directly responding to before:

                  What it detects are empirical, factual observations about the material it is shown, which cannot be copyrighted.

      • Landless2029@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        edit-2
        18 hours ago

        Not to mention patent laws are bullshit.

        There are law offices that exist specifically to fuck with people over patent and copyright law.

        There’s also cases where people use copyright and patent law to hold us back. I can’t find the article but some religious jerk patented connecting a sex toy to a computer via USB. Thankfully someone got around this law with bluetooth and cell phones. Otherwise I imagine the camgirl and LDR market for toys would’ve been hit with products 10 years sooner.

  • interdimensionalmeme@lemmy.ml
    link
    fedilink
    English
    arrow-up
    58
    arrow-down
    3
    ·
    2 days ago

    It’s not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.

    This is our common heritage, not OpenAI’s private property

    • Echo Dot@feddit.uk
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      It doesn’t matter anyway, we still need the big companies to bankroll AI. So it effectively does belong to them whatever we do.

      Hopefully at some point people can get the processor requirements to something sane and AI development opens up to us all.

    • ArchRecord@lemm.ee
      link
      fedilink
      English
      arrow-up
      60
      ·
      2 days ago

      They should be, but currently it depends on the type of bailout, I suppose.

      For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank’s assets, and now effectively owns the bank.

      • booly@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        10
        ·
        2 days ago

        At the same time, if a bank goes under, that means they owe more than they own, so “ownership” of that entity is basically worthless. In those cases, a bailout of the customers does nothing for the owners, because the owners still get wiped out.

        The GM bailout in 2009 also involved wiping out all the shareholders, the government taking ownership of the new company, and the government spinning off the newly issued stock.

        AIG required the company basically issue new stock to dilute owners down to 20% of the company, while the government owned the other 80%, and the government made a big profit when they exited that transaction and sold the stock off to the public.

        So it’s not super unusual. Government can take ownership of companies as a condition of a bailout. What we generally don’t necessarily want is the government owning a company long term, because there’s some conflict of interest between its role as regulator and its interest as a shareholder.

        • RubberDuck@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          arrow-down
          1
          ·
          edit-2
          2 days ago

          With banks this is also true if they do not have enough liquid assets to meet the legal requirements. So the bank might not be able to count all bank accounts as assets but the FDIC is. Also they can then restructure the bank and force creditors to take a haircut.

          This is why investment banks should be separate from banks that have consumer accounts that are insured by the government.
          Then you can just let the investment bank fail. This was the whole premise of glass steagall that was repealed under clinton…

    • xthexder@l.sw0.com
      link
      fedilink
      English
      arrow-up
      10
      ·
      edit-2
      2 days ago

      Public domain wouldn’t be the right term for banks being publicly owned. At least for the normal usage of Public Domain in copyright. You can copy text and data, you can’t copy a company with unique customers and physical property.

    • leisesprecher@feddit.org
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      I mean, that sometimes did happen.

      Germany propped up the Commerzbank after 2007 by essentially buying a large part of it, and managed to sell several tranches with a healthy profit.

      Same is true for Lufthansa during COVID.

    • interdimensionalmeme@lemmy.ml
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      2 days ago

      Banks are redundant, so is the stock market. These institutions do not need to, and should not be private. They are level playing fields in the economy, not participants trying to tilt the board for taking over the game.

    • LovableSidekick@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      2
      ·
      edit-2
      2 days ago

      No, “the banks” wouldn’t be what the AI would be trained on, it would be the private info of individuals the banks do business with.

  • just_another_person@lemmy.world
    link
    fedilink
    English
    arrow-up
    143
    arrow-down
    14
    ·
    edit-2
    2 days ago

    It won’t really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

    What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.

    They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

    • FaceDeer@fedia.io
      link
      fedilink
      arrow-up
      27
      ·
      2 days ago

      Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.

      • just_another_person@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        6
        ·
        2 days ago

        Not really. The same way you can’t sell live and public performance music for profit and not get sued. Case law right there, and the fact it’s performance vs publicly published doesn’t matter. How the owner and originator classifies or licenses it is the defining classification. It’s going to be years before anyone sees this get a ruling in court though.

        • FaceDeer@fedia.io
          link
          fedilink
          arrow-up
          16
          arrow-down
          4
          ·
          2 days ago

          That’s not what’s going on here, though. The LLM model doesn’t contain the actual copyrighted data, it’s the result of analyzing the copyrighted data.

          An analogous example would be a site like TV Tropes. TV Tropes doesn’t contain the works that it’s discussing, it just contains information about those works.

          • Superb@lemmy.blahaj.zone
            link
            fedilink
            English
            arrow-up
            3
            arrow-down
            2
            ·
            2 days ago

            No, the model does retain the original works in a lossy compression. This is evidenced by the fact that you can get a model to reproduce sections of its training data

            • FaceDeer@fedia.io
              link
              fedilink
              arrow-up
              5
              arrow-down
              1
              ·
              2 days ago

              You’re probably thinking of situations where overfitting occurred. Those situations are rare, and are considered to be errors in training. Much effort has been put into eliminating that from modern AI training, and it has been successfully done by all the major players.

              This is an old no-longer-applicable objection, along the lines of “AI can’t do fingers right”. And even at the time, it was only very specific bits of training data that got inadvertently overfit, not all of it. You couldn’t retrieve arbitrary examples of training data.

            • FaceDeer@fedia.io
              link
              fedilink
              arrow-up
              6
              arrow-down
              4
              ·
              2 days ago

              You said:

              What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.

              But the point is that it doesn’t matter if the data is licensed or not. Lack of licensing doesn’t stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?

              They pulled a very public and out in the open data heist and got away with it.

              They did not. No data was “heisted.” Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.

              • A1kmm@lemmy.amxl.com
                link
                fedilink
                English
                arrow-up
                1
                ·
                1 day ago

                Copyright laws are illogical - but I don’t think your claim is as clear cut as you think.

                Transforming data to a different format, even in a lossy fashion, is often treated as copyright infringement. Let’s say the Alice produces a film, and Bob goes to the cinema, records it with a camera, and then compresses it into an Ogg file with Vorbis audio encoding and Theora video encoding.

                The final output of this process is a lossy compression of the input data - meaning that the video and audio is put through a transformation that means it’s represented in a completely different form to the original, and it is impossible to reconstruct a pixel perfect rendition of the original from the encoded data. The transformation includes things like analysing the motion between frames and creating a model to predict future frames.

                However, copyright laws don’t require that an infringing copy be an exact reproduction - lossy compression is generally treated as infringing, as is taking key elements and re-telling the same thing in different words.

                You mentioned Harry Potter below, and gave a paper mache example. Generally copyright laws have restricted scope, and if the source paper was an authorised copy, that is the reason that wouldn’t be infringing in most jurisdictions. However, let me do an experiment. I’ll prompt ChatGPT-4o-mini with the following prompt: “You are J K Rowling. Create a three paragraph summary of the entire book “Harry Potter and the Philosopher’s Stone”. Include all the original plot points and use the original character names. Ensure what you create is usable as a substitute to reading the book, and is a succinct but entertaining highly abridged version of the book”. I’ve reviewed the output (I won’t post it here since I think it would be copyright infringing, and also given the author’s transphobic stances don’t want to promote her universe) - and can say for sure that it is able to accurately reproduce the major plot points and character names, while being insufficiently transformative (in the sense that both the original and the text generated by the model are literary works, and the output could be a substitute for reading the book).

                So yes, the model (including its weights) is a highly compressed form of the input (admittedly far more so than the Ogg Vorbis/Theora example), and it can infer (i.e. decode to) outputs that contain copyrighted elements.

                • FaceDeer@fedia.io
                  link
                  fedilink
                  arrow-up
                  2
                  ·
                  1 day ago

                  Of course it’s not clear-cut, it’s the law. Laws are notoriously squirrelly once you get into court. However, if you’re going to make predictions one way or the other you have to work with what you know.

                  I know how these generative AIs work. They are not “compressing data.” Your analogy to making a video recording is not applicable. I’ve discussed in other comments in this thread how ludicrously compressed data would have to be if that was the case, it’s physically impossible.

                  These AIs learn patterns from the training data. Themes, styles, vocabulary, and so forth. That stuff is not copyrightable.

                • lad@programming.dev
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  1 day ago

                  How lossy can it be until it’s not infringement? One-line summary of some book is also a lossy reproduction

              • catloaf@lemm.ee
                link
                fedilink
                English
                arrow-up
                5
                arrow-down
                2
                ·
                2 days ago

                The product of that analysis does not contain the data itself, and so is not a violation of copyright.

                That’s your opinion, not the opinion of a court or legislature. LLM products are directly derived from and dependent upon the training data, so it is positively considered a derivative work. However, whether it’s considered sufficiently transformative, or whether it passes the fair use test, has not to my knowledge been determined in court. (Note that I am assuming US law here.)

                • FaceDeer@fedia.io
                  link
                  fedilink
                  arrow-up
                  4
                  arrow-down
                  2
                  ·
                  2 days ago

                  The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it’s unlikely they’ll conclude that the models contain the data, however, because it’s objectively not true.

                  The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the “5B” indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it’s compressing images and storing them inside the model it’d somehow need to fit ~2.7 images per byte. This is, simply, impossible.

              • just_another_person@lemmy.world
                link
                fedilink
                English
                arrow-up
                4
                arrow-down
                2
                ·
                2 days ago

                You’re thinking of licensing as a person putting something online WITH a license.

                The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it’s derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.

                Here’s an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.

                Unless you as OpenAI can prove these are all completely random-which they aren’t because it’s trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.

                Proving that is a different thing, but that’s what the laws say should happen. If they didn’t contact me to license that story, it’s still plagiarism. Same with music, movies…etc.

    • Lemmilicious@feddit.nu
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      Just a little note about the word “model”, in the article it’s used in a way that actually includes the weights, and I think this is the usual way of using it! If you change the weights, you get a different model, though the two models will have the same structure.

      Anyway, you make good points!

    • Grimy@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      arrow-down
      2
      ·
      edit-2
      2 days ago

      If we can’t train on unlicensed data, there is no open-source scene. Even worse, AI stays but it becomes a monopoly in the hands of the few who can pay for the data.

      Most of that data is owned and aggregated by entities such as record labels, Hollywood, Instagram, reddit, Getty, etc.

      The field would still remain hyper competitive for artists and other trades that are affected by AI. It would only cause all the new AI based tools to be behind expensive censored subscription models owned by either Microsoft or Google.

      I think forcing all models trained on unlicensed data to be open source is a great idea but actually rooting for civil lawsuits which essentially entail a huge broadening of copyright laws is simply foolhardy imo.

      • just_another_person@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        2
        ·
        2 days ago

        Unlicensed from the POV of the trainer, meaning they didn’t contact or license content from someone who didn’t approve. If it’s posted under Creative Commons, that’s fine. If it’s otherwise posted that it’s not open in any other way and not for corporate use, then they need to contact the owner and license it.

        • Grimy@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          arrow-down
          2
          ·
          edit-2
          2 days ago

          They won’t need to, they will get it from Getty. All these websites have a ToS that make it very clear they can do whatever they want with what you upload. The courts will simply never side with the small time photographer who makes 50$ a month with his stock photos hosted on someone else’s website. The laws will be in favor of databrokers and the handful of big AI companies.

          Anyone self hosting will simply not get a call. Journalists will keep the same salary while the newspaper’s owner gets a fat bonus. Even Reddit already sold it’s data for 60 million and none of that went anywhere but spezs coke fund.

          • just_another_person@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            3
            ·
            2 days ago

            Two things:

            1. Getty is not expressly licensed as “free to use”, and by default is not licensed for commercial anything. That’s how they are a business that is still alive.

            2. You’re talking about Generative AI junk and not LLMs which this discussion and the original post is about. They are not the same thing.

            • Grimy@lemmy.world
              link
              fedilink
              English
              arrow-up
              5
              ·
              edit-2
              2 days ago

              Reddit and newspapers selling their data preemptively has to do with LLMs. Can you clarify what scenario you are aiming for? It sounds like you want the courts to rule that AI companies need to ask each individual redditor if they can use his comments for training. I don’t see this happening personally.

              Getty gives itself the right to license all photos uploaded and already trained a generative model on those btw.

              • just_another_person@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                arrow-down
                2
                ·
                2 days ago

                EULA and TOS agreements stop Reddit and similar sites from being sued. They changed them before they were selling the data and barely gave notice about it (see the exodus from reddit pt2), but if you keep using the service, you agree to both, and they can get away with it because they own the platform.

                Anyone who has their content on a platform of the like that got the rug pulled out from under them with silent amendments being made to allow that is unfortunately fucked.

                Any other platforms that didn’t explicitly state this was happening is not in scope to just allow these training tools to grab and train. What we know is that OpenAI at the very least was training on public sites that didn’t explicitly allow this. Personal blogs, Wikipedia…etc.

    • Avatar_of_Self@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      1
      ·
      2 days ago

      It’s already illegal in some form. Via piracy of the works and regurgitating protected data.

      The issue is mega Corp with many rich investors vs everyone else. If this were some university student their life would probably be ruined like with what happened to Aaron Swartz.

      The US justice system is different for different people.

    • NoForwardslashS@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      3
      ·
      2 days ago

      But wouldn’t that mean making it open source, then it not functioning properly without the data while open, would prove that it is using a huge amount of unlicensed data?

      Probably not “burden of proof in a court of law” prove though.

      • Bronzebeard@lemm.ee
        link
        fedilink
        English
        arrow-up
        9
        arrow-down
        1
        ·
        2 days ago

        Making it open source doesn’t change how it works. It doesn’t need the data after it’s been trained. Most of these AIs are just figuring out patterns to look for in the new data it comes across.

        • NoForwardslashS@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          3
          ·
          2 days ago

          So you’re saying the data wouldn’t exist anywhere in the source code, but it would still be able to answer questions based on the data it has previously seen?

            • NoForwardslashS@sopuli.xyz
              link
              fedilink
              English
              arrow-up
              1
              ·
              2 days ago

              So then why, if it were all open sourced, including the weights, would the AI be worthless? Surely having an identical but open source version, that would strip profitability from the original paid product.

              • Bronzebeard@lemm.ee
                link
                fedilink
                English
                arrow-up
                4
                arrow-down
                1
                ·
                2 days ago

                It wouldn’t be. It would still work. It just wouldn’t be exclusively available to the group that created it-any competitive advantage is lost.

                But all of this ignores the real issue - you’re not really punishing the use of unauthorized data. Those who owned that data are still harmed by this.

                • stephen01king@lemmy.zip
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  ·
                  2 days ago

                  It does discourages the use of unauthorised data. If stealing doesn’t give you competitive advantage, it’s not really worth the risk and cost of stealing it in the first place.

          • Bronzebeard@lemm.ee
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 day ago

            Most AI are not built to answer questions. They’re designed to act as some kind of detection/filter heuristic to identify specific things about an input that leads to a desired output.

      • bloup@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        2 days ago

        in civil matters, the burden of proof is actually usually just preponderance of evidence and not beyond a reasonable doubt. in other words to win a lawsuit, you only need to have more compelling evidence than the other person.

        • just_another_person@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          arrow-down
          1
          ·
          2 days ago

          But you still have to have EVIDENCE. Not derivative evidence. The output of a model could be argued to be hearsay because it’s not direct evidence of originating content, it’s derivative.

          You’d have to have somebody backtrack generations of model data to even find snippets of something that defines copyright material, or a human actually saying “Yes, we definitely trained on unlicensed data”.

          • bloup@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            3
            ·
            2 days ago

            so like I am not making any comment on anything but the legal system here. but it’s absolutely the case that you can win a lawsuit on purely circumstantial evidence if the defense is unable to produce a compelling alternative set of circumstances which can lead to the same outcome.

  • circuitfarmer@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    63
    arrow-down
    2
    ·
    2 days ago

    A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

    I’m not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

  • Arthur Besse@lemmy.ml
    link
    fedilink
    English
    arrow-up
    25
    arrow-down
    4
    ·
    edit-2
    1 day ago

    “Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data”

    I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.

    The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.

    I liked the author’s earlier very-unlikely-to-be-met-demand activism last year better:

    I just sent @OpenAI a cease and desist demanding they delete their GPT 3.5 and GPT 4 models in their entirety and remove all of my personal data from their training data sets before re-training in order to prevent #ChatGPT telling people I am dead.

    …which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it’s technically true - a court didn’t order it, but a guy who goes by the name “That One Privacy Guy” while blogging on linkedin did).

    • madthumbs@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      2 days ago

      They’re spitting out propaganda and misinformation mostly from what I can see. If anything, it should get a refund.

      -Outside of coding / debugging tasks (and that’s hit or miss)

  • m-p{3}@lemmy.ca
    link
    fedilink
    English
    arrow-up
    42
    arrow-down
    4
    ·
    2 days ago

    It could also contain non-public domain data, and you can’t declare someone else’s intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.

    Laws are never simple.

    • drkt@scribe.disroot.org
      link
      fedilink
      English
      arrow-up
      19
      arrow-down
      2
      ·
      2 days ago

      Forcing a bunch of neural weights into the public domain doesn’t make the data they were trained on also public domain, in fact it doesn’t even reveal what they were trained on.

      • deegeese@sopuli.xyz
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        17
        ·
        2 days ago

        LOL no. The weights encode the training data and it’s trivially easy to make AI generators spit out bits of their training data.

            • FaceDeer@fedia.io
              link
              fedilink
              arrow-up
              8
              ·
              2 days ago

              No, he’s challenging the assertion that it’s “trivially easy” to make AIs output their training data.

              Older AIs have occasionally regurgitated bits of training data as a result of overfitting, which is a flaw in training that modern AI training techniques have made great strides in eliminating. It’s no longer a particularly common problem, and even if it were it only applies to those specific bits of training data that were overfit on, not on all of the training data in general.

              • 31337@sh.itjust.works
                link
                fedilink
                English
                arrow-up
                2
                ·
                2 days ago

                Last time I looked it up and calculated it, these large models are trained on something like only 7x the tokens as the number of parameters they have. If you thought of it like compression, a 1:7 ratio for lossless text compression is perfectly possible.

                I think the models can still output a lot of stuff verbatim if you try to get them to, you just hit the guardrails they put in place. Seems to work fine for public domain stuff. E.g. “Give me the first 50 lines from Romeo and Juliette.” (albeit with a TOS warning, lol). “Give me the first few paragraphs of Dune.” seems to hit a guardrail, or maybe just forced through reinforcement learning.

                A preprint paper was released recently that detailed how to get around RL by controlling the first few tokens of a model’s output, showing the “unsafe” data is still in there.

        • stephen01king@lemmy.zip
          link
          fedilink
          English
          arrow-up
          4
          ·
          2 days ago

          How easy are we talking about here? Also, making the model public domain doesn’t mean making the output public domain. The output of an LLM should still abide by copyright laws, as they should be.

    • grue@lemmy.world
      link
      fedilink
      English
      arrow-up
      23
      arrow-down
      10
      ·
      2 days ago

      So what you’re saying is that there’s no way to make it legal and it simply needs to be deleted entirely.

      I agree.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        7
        arrow-down
        3
        ·
        2 days ago

        There’s no need to “make it legal”, things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.

        Training an AI doesn’t involve copying the training data, the AI model doesn’t literally “contain” the stuff it’s trained on. So it’s not likely that existing copyright law makes it illegal to do without permission.

        • grue@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 day ago

          There’s no need to “make it legal”, things are legal by default until a law is passed to make them illegal.

          Yes, and that’s already happened: it’s called “copyright law.” You can’t mix things with incompatible licenses into a derivative work and pretend it’s okay.

        • xigoi@lemmy.sdf.org
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          2
          ·
          2 days ago

          By this logic, you can copy a copyrighted imege as long as you decrease the resolution, because the new image does not contain all the information in the original one.

          • yetAnotherUser@discuss.tchncs.de
            link
            fedilink
            English
            arrow-up
            5
            arrow-down
            1
            ·
            2 days ago

            Am I allowed to take a copyrighted image, decrease its size to 1x1 pixels and publish it? What about 2x2?

            It’s very much not clear when a modification violates copyright because copyright is extremely vague to begin with.

            • grue@lemmy.world
              link
              fedilink
              English
              arrow-up
              3
              ·
              1 day ago

              Just because something is defined legally instead of technologically, that doesn’t make it vague. The modification violates copyright when the result is a derivative work; no more, no less.

              • yetAnotherUser@discuss.tchncs.de
                link
                fedilink
                English
                arrow-up
                2
                ·
                1 day ago

                What is a derivative work though? That’s again extremely vague and has been subject to countless lawsuits seeking to determine the bounds.

                • catloaf@lemm.ee
                  link
                  fedilink
                  English
                  arrow-up
                  3
                  ·
                  1 day ago

                  If your work depends on the original, such that it could not exist without it, it’s derivative.

                  I can easily create a pixel of any arbitrary color, so it’s sufficiently transformative that it’s considered a separate work.

                  The four fair use tests are pretty reliable in making a determination.

          • Voyajer@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            arrow-down
            1
            ·
            2 days ago

            More like reduce it to a handful of vectors that get merged with other vectors.

          • FaceDeer@fedia.io
            link
            fedilink
            arrow-up
            2
            arrow-down
            2
            ·
            2 days ago

            In the case of Stable Diffusion, they used 5 billion images to train a model 1.83 gigabytes in size. So if you reduce a copyrighted image to 3 bits (not bytes - bits), then yeah, I think you’re probably pretty safe.

            • xigoi@lemmy.sdf.org
              link
              fedilink
              English
              arrow-up
              2
              ·
              1 day ago

              Your calculation is assuming that the input images are statistically independent, which is certainly not the case (otherwise the model would be useless for generating new images)

              • FaceDeer@fedia.io
                link
                fedilink
                arrow-up
                2
                ·
                1 day ago

                Of course it’s silly. Of course the images are not statistically independent, that’s the point. There are still people to this day who claim that stable diffusion and its ilk are producing “collages” of their training images, please tell this to them.

                The way that these models work is by learning patterns from their training material. They learn styles, shapes, meanings. None of those things are covered by copyright.

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      9
      arrow-down
      2
      ·
      2 days ago

      It wouldn’t contain any public-domain data though. That’s the thing with LLMs, once they’re trained on data the data is gone and just added to the series of weights in the model somewhere. If it ingested something private like your tax data, it couldn’t re-create your tax data on command, that data is now gone, but if it’s seen enough private tax data it could give something that looked a lot like a tax return to someone with an untrained eye. But, a tax accountant would easily see flaws in it.