Amazon Is Investigating Perplexity Over Claims of Scraping Abuse

Sousa Brothers
2 min read5 days ago

--

Photo by Christian Wiediger on Unsplash

Amazon Web Services (AWS) has launched an investigation into Perplexity AI, a $3 billion AI search startup, over allegations of scraping websites that have explicitly prohibited such actions using the Robots Exclusion Protocol.

This protocol, a standard practice in web governance, involves placing a plaintext file on a domain to specify pages that should not be accessed by automated crawlers. While not legally binding, it is generally upheld as a common practice.

AWS requires its customers to adhere to the robots.txt standard when crawling websites, emphasizing the prohibition of illegal activities by customers utilizing their services. The investigation was prompted by a report by Forbes accusing Perplexity of stealing content and subsequent investigations by WIRED that uncovered evidence of scraping abuse and plagiarism associated with Perplexity’s AI-powered search chat.

Notably, Perplexity’s crawler was found accessing Condé Nast properties despite being blocked by a robots.txt file, using an unpublished IP address. This IP address was traced to an Elastic Compute Cloud (EC2) instance on AWS, raising concerns about unauthorized scraping activities.

Perplexity’s CEO, Aravind Srinivas, initially dismissed concerns, attributing the scraping activities to a third-party company undertaking web crawling services and refusing to disclose further details due to a nondisclosure agreement. In response to Amazon’s inquiries, a spokesperson for Perplexity stated that their operations comply with AWS Terms of Service, citing respect for robots.txt files.

However, it was acknowledged that PerplexityBot may occasionally bypass robots.txt when a user inputs a specific URL, a scenario described as uncommon but permissible within their system.

Digital Content Next, an industry trade association, expressed concerns over potential copyright violations by AI companies like Perplexity that may be disregarding terms of service and robots.txt directives.

CEO Jason Kint emphasized the importance of respecting publishers’ content rights and urged vigilance in addressing any improper practices within the AI industry.

--

--