Sites scramble to block ChatGPT web crawler after instructions emerge

UngodlyAudrey🏳️‍⚧️@beehaw.org · 1 year ago

Sites scramble to block ChatGPT web crawler after instructions emerge

dbilitated@aussie.zone · 1 year ago

I’d rather like it if they train it on stuff I say. I want the AI of tomorrow to reflect my thoughts.

seriously I would much prefer gold tier journalism and news sites let it crawl so when people use it to make choices in the future they’re guided to better choices.

it is honestly so hard to know what will happen though, it’s so complicated it’s virtually guaranteed we’re not correctly anticipating the consequences of any of this. I’m not really even talking about the AI, I’m talking about the effects on society which are a lot more complex.

Carion@lemmy.antemeridiem.xyz · 1 year ago

It’s just about the money really, they want their cut of the AI money cake.

raccoona_nongrata@beehaw.org · edit-2 1 year ago

deleted by creator

breaks.lol@lemmy.studio · 1 year ago

But for large website operators, the choice to block large language model (LLM) crawlers isn’t as easy as it may seem. Making some LLMs blind to certain website data will leave gaps of knowledge that could serve some sites very well (such as sites that don’t want to lose visitors if ChatGPT supplies their information for them), but it may also hurt others. For example, blocking content from future AI models could decrease a site’s or a brand’s cultural footprint if AI chatbots become a primary user interface in the future. As a thought experiment, imagine an online business declaring that it didn’t want its website indexed by Google in the year 2002—a self-defeating move when that was the most popular on-ramp for finding information online.

Really curious how this will end up

axibzllmbo@beehaw.org · 1 year ago

That’s an interesting point that I hadn’t considered, the comparison to Google indexing in the early 2000’s may prove to be very apt with the number of people I’ve seen using chat GPT as a search engine.

Tibert@compuverse.uk · 1 year ago

Like it is useful… Open ai already got all the useful info out of the websites.

Tho maybe for the sites generating new content it may have a use. But all the content before that is already lost to chatgpt.

FaceDeer@kbin.social · 1 year ago

“Lost to ChatGPT” is a weird way of putting it. The content is still there, nothing’s happened to it.

m-p{3}@lemmy.ca · 1 year ago

Lemmy.ca added a block at the nginx level for it

https://lemmy.ca/comment/1999439

# curl -H 'User-agent: GPTBot' https://lemmy.ca/ -i
HTTP/2 403

The Bard in Green@lemmy.starlightkel.xyz · 1 year ago

Hilariously, unless ALL lemmy instances do this, anyone that federates with you will have to block it too or any communities they sync with you will be available on their instances…

m-p{3}@lemmy.ca · 1 year ago

I know but at this point you do what you can.

ashtrix@lemmy.ca · 1 year ago

Yeah, it’s already too late. Why didn’t they provide this before they already scraped websites?

P03 Locke@lemmy.dbzer0.com · 1 year ago

You think Google thought about robots.txt before they developed their search engine? Nah, it’s all public Internet, and they scraped away.

A non-zero percentage of web sites will bother to follow these instructions, but it might as well be zero.

Scrubbles@poptalk.scrubbles.tech · 1 year ago

Yeah I always assumed robots.txt only told them to hide it from search results, but Google still scrapes everything they can from you. The illusion they skipped over you

On@kbin.social · 1 year ago

Is it possible that they offloaded the scraping to a different company to avoid direct litigation now theyre out in the open? To say “we didn’t scrape your website, and you can’t prove it.”

Like DDG, Ecosia, Qwant use Bing for their data Or how feds buy data from data brokers. Outsource the dirty job like every tech company does and shift the blame if caught doing something unlawful.

It seems they are trying to garner some positive PR after they scraped through everything without anyone noticing.

abhibeckert@beehaw.org · 1 year ago

I’d bet sites blocking ChatGPT will regret it when (not if) Bing starts using it for search engine relevance.

acastcandream@beehaw.org · 1 year ago

That’s because you block the GPT crawler doesn’t mean you are no longer indexed

renard_roux@beehaw.org · 1 year ago

Serious question — you think any amount of AI will make people use Bing? 🤔

Sproux@lemmy.dbzer0.com · edit-2 1 year ago

I started using it this year because its actually been giving me decent results unlike google. We’re…in a dark timeline

renard_roux@beehaw.org · edit-2 1 year ago

Hmm. I have an incredibly strong aversion to everything Microsoft, so even giving Bing a chance is difficult. However, I must admit that I can recognize the part about Google not delivering. I even went so far as to tamper with the CSS recently just to make Google’s results slightly easier to parse.

Maybe it’s time to try something new 🤔 I just wish the only viable alternative wasn’t made by Microsoft 😓🤢

Dark timeline, indeed! 😔

fckgwrhqq2yxrkt@beehaw.org · 10 months ago

Check out Kagi, paid search is extremely worth it. Stop being a product to sell and start being a customer.