Internal Google Documents Regarding Search Ranking Appear Publicly, Igniting A Frenzy Over SEO
- Technology
- May 29, 2024
A treasure trove of documents that seem to explain Google’s search result ranking have surfaced online, most likely due to an internal bot’s inadvertent disclosure.
The document that was released offers details on an outdated iteration of Google’s Content Warehouse API as well as an inside look at Google Search.
Around March 13, the content appears to have been unintentionally uploaded by the internet behemoth’s automated tools to a publicly available Google repository on GitHub. The commit was automatically signed with an Apache 2.0 open source license, which is the norm for Google’s public documentation. On May 7, a follow-up commit made an effort to fix the leak.
Nevertheless, Erfan Azimi, CEO of search engine optimization (SEO) company EA Digital Eagle, noticed the content. Other SEO experts Rand Fishkin, CEO of SparkToro, and Michael King, CEO of iPullRank, revealed it on Sunday.
The leaked documentation includes multiple references to internal systems and projects, but does not contain code or anything like. Instead, it explains how to use Google’s Content Warehouse API, which is probably meant for internal use only. Though there is currently a publicly available Google Cloud API with a similar name, it appears that what landed published on GitHub extends much beyond that.
The files are notable because of what they disclose about the factors that Google takes into account when ranking web pages for relevancy. This information will be of ongoing interest to anyone working in the SEO industry or running a website and expecting that Google will drive traffic to it.
There are data on almost 14,000 attributes that are accessible or related with the API in the 2,500+ pages of documentation that have been compiled for ease of reading here. However, there is less information regarding the use and significance of these signals. Because of this, it is challenging to determine how much weight Google gives to certain factors in its algorithm for prioritizing search results.
However, given the documents deviate from Google executives’ public claims, SEO consultants think they include noteworthy facts.
“Many of [Azimi’s] claims [in an email describing the leak] directly contradict public statements made by Googlers over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more,” explained SparkToro’s Fishkin in a report.
In his post on the documents, iPullRank’s King cited a statement made by John Mueller, a Google search advocate, who claimed in a video that “we don’t have anything like a website authority score”—a metric that indicates whether Google thinks a site is authoritative and, thus, deserving of higher rankings for search results.
However, King points out that the documents show that a “siteAuthority” score can be computed as part of the Compressed Quality Signals Google saves for documents.
One is the significance of clicks, and the various kinds of clicks (long, terrible, and so on) in establishing a webpage’s rating. In the US v. Google antitrust trial, Google admitted [PDF] that click metrics are taken into account when determining a website’s ranking.
Another is that Chrome-viewed websites are used by Google as a quality signal; this is indicated in the API by the ChromeInTotal field. “One of the modules related to page quality scores features a site-level measure of views from Chrome,” said King.
The docs also reveal that Google takes into account additional elements such as authorship, the freshness of the material, whether a page is relevant to the main emphasis of the website, the alignment of the page header and text, and “the average weighted font size of a term in the doc body.”