Indexed by Google: Pages Checker on Python

To check if a webpage has been indexed by Google, you can manually search for site:<webpage> in a Google Search, for example: site:https://www.shellhacks.com/indexed-by-google-pages-checker-on-python.

If the page is indexed by Google, you will see the URL in the result of the Google Search, otherwise you will see the message as follows:

Your search – site:https://www.shellhacks.com/indexed-by-google-pages-checker-on-python – did not match any documents.

If you need to check multiple pages, it will be much more efficient to create a Python script for checking the pages indexed by Google.

This note shows an example of the minimal Python script for checking the pages indexed by Google, that can be used as a base and improved according to your needs.

Cool Tip: Create isolated Python environments using virtualenv! Read More →

Google Indexed Pages Checker on Python

Create a project directory, activate a virtual environment and install Python modules required for querying the Google Search and parsing the results:

$ mkdir google-indexed-pages-checker
$ cd google-indexed-pages-checker
$ python3 -m venv venv
$ . venv/bin/activate
(venv) $ pip install requests bs4

Here is a code of the minimal Python script that is asking to enter a page’s URL and checking if is has been indexed by Google or not:

import re
import requests
from bs4 import BeautifulSoup

print("Enter a page to check in Google index:")
url = input()

google = "https://www.google.com/search?q=site:" + url + "&hl=en"
response = requests.get(google, cookies={"CONSENT": "YES+1"})
soup = BeautifulSoup(response.content, "html.parser")
not_indexed = re.compile("did not match any documents")

if soup(text=not_indexed):
  print("This page is NOT indexed by Google.")
else:
  print("This page is indexed by Google.")

Save this code to a file called check-indexed-pages.py and run it as follows:

(venv) $ python check-indexed-pages.py
- sample output -
Enter a page to check in Google index:
https://www.shellhacks.com/indexed-by-google-pages-checker-on-python/
This page is NOT indexed by Google.

(venv) $ python check-indexed-pages.py
- sample output -
Enter a page to check in Google index:
https://www.shellhacks.com/pip-show-python-package-dependencies/
This page is indexed by Google.

This basic Python script for checking the pages indexed by Google can be improved by taking the list of pages from a file, e.g. sitemap.xml.

To not get banned by Google for scrapping the Google Search result pages, you can try to send different “User-Agent” HTTP headers, use proxies, set random delays between queries, etc.

Was it useful? Share this post with the world!

Leave a Reply