Perplexity AI app running on an iPhone - 1
  • Cloudflare has caught Perplexity scraping websites that explicitly block AI crawlers.
  • Perplexity’s AI crawlers concealed their identity and even used undisclosed IP addresses.
  • The AI startup was caught doing so across tens of thousands of domains, making millions of requests per day.

Perplexity has been caught red-handed by Cloudflare, as the startup has been sneaking around websites that do not want to be scraped by AI crawlers. Typically, AI answer engines like Perplexity or ChatGPT go through several websites on the internet, and extract data such as text, images, and other content to generate answers, often without obtaining permission.

Cloudflare has now published its research , claiming that Perplexity uses dubious tactics to circumvent restrictions by concealing its identity to scrape websites, despite websites explicitly opting out.

Cloudflare CEO Matthew Prince has blasted Perplexity on X, stating that “Some supposedly “reputable” AI companies act more like North Korean hackers. Time to name, shame, and hard block them.”

This, of course, hurts site traffic, which is why some websites have started using the ‘robots.txt’ file to curb AI’s free lunch. This file tells AI crawlers which pages a site wants indexed and which it doesn’t. But according to Cloudflare’s report, Perplexity seems to be completely violating the robots.txt standard.

How Perplexity Pulled Off the Grand Theft Data

Cloudflare published the report after it received several complaints from its customers who claimed that Perplexity still had access to their website’s content, despite having set restrictions in the Robots.txt file, and created Web Application Firewall (WAF) rules to prevent AI bots from scraping data.

In response to the complaints, Cloudflare created test domains with similar restrictions to observe Perplexity’s behavior. They found that Perplexity initially attempts to access sites using its regular crawlers, i.e., “PerplexityBot” or “Perplexity-User.” However, if the AI encounters restrictions, it switches its user agent , the identifier that tells a website what kind of browser and device is being used.

In Perplexity’s case, it masked itself as a Chrome browser on macOS. Moreover, Perplexity used “rotating” IP addresses that the company does not mention on its list of IP addresses used by its bots. Cloudflare’s report also mentions that Perplexity changes its autonomous system networks (ASNs), which are unique identifiers used to distinguish large networks.

Cloudflare mentions in its post, “This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals.”

Not Perplexity’s First Rodeo

Perplexity was caught doing the same thing in June last year , ignoring paywalls and Robots.txt files on websites. Back then, the company’s CEO, Aravind Srinivas, blamed it all on third-party crawlers the company was relying on. But now, the situation is different, and the blame squarely falls on Perplexity itself.

It is also worth pointing out that Apple has been interested in buying Perplexity and was reportedly in early talks. However, following this report, the Cupertino giant may now reconsider its decision.

‘Name, Shame, and Hard Block Them’: Cloudflare Blasts Perplexity Over AI Website Scraping - 2

With over 4 year of experience under the belt, I cover all facets of consumer tech, from smartphones to other consumer electronics, our favorite social media apps, as well as the growing realm of AI and LLMs. As an Apps and AI writer app Beebom, I provide my expertise in all these areas, weaving stories that help you get familiar with the tech around you. But you will find me playing NYT daily puzzles in my free time.

Add new comment

Name

Email ID

Δ