Creating a deep learning dataset using Google Images API

Alright! As the title reads, this post is aimed at machine/deep learning students/practitioners who would like to build their own toy dataset using Google Images API.

Why Google Image API?
Whilst there are many other Image APIs available, Google stands out due to a few reasons:

  • It won’t kick us out: I tried the (trial) Bing Image Search (BIS) API for a while since fast.ai asked to use their service in Chapter 2. Half way through the course, I wanted to update the dataset, but it wouldn’t work anymore since the trial period expired. Then I had to look for different services, and settled for Google.
  • Better search engine: The obvious one, Google is a better search engine. BIS API does the job very well too. Unsplash API gave the worst results (probably the API is built for something else and not deep learning).
  • More requests: Google Image API allows more “quality” requests (10000 to unlimited requests per day) than Unsplash API (50 per hour but bad datasets, Demo version) and BIS API (? trial) before it times out. So, slightly bigger datasets.
  • More query parameters: Google (and BIS API) offers a lot of search customization. For example it can filter results by “face only” or “grayscale” images etc. More on that here. Queries generally follow the REST API style and returns search results in JSON format.
  • It does not require a card to setup.

Setup
Setting up the API is pretty much straightforward. What we are trying to attempt is to create what Google calls a “Programmable Search Engine”. It only requires you to activate a developer profile (takes only a few clicks if you already have a Gmail account).

  • Start by going here.
  • In the “sites to search” field, type images.google.com or a website you would like to search.
  • Give a name for the search engine, and click Create.
  • Click on the Control Panel and store the Search engine ID. (For later: this is the cx parameter used by the API.)
  • On the same page, turn on Image Search, turn on Search the entire web.
  • Scroll to Programmatic Access and click on Get Started next to Custom Search JSON API. This will redirect you to a new page, and click on Get a Key. (For later: This is the key parameter used by the API. Keep this secret.)

P.S. Also there is a possibility to have unlimited requests if you tell your custom search engine which all sites you would like to search (less than 10), which is a fair deal.

That’s it. Now we move on to create the query URL.

Creating the URL

The REST URL we want is pretty basic (RESTful) with several key and value pairs. For example, a simple google search for geoffrey hinton is formed by the URL https://www.google.com/search?q=geoffrey+hinton

We are only trying to mimic this behavior in the function above by using a RESTful API. When multiple key value pairs are passed, the function simply joins them with '&' like how Goolge does. For example, if the search term is geoffrey hinton, we should pass a dictionary containing {'q': 'geoffrey hinton'}. But for our use case, we also have to pass {'searchType': 'image'} so as to obtain image-only results, in addition to the key and cx parameters we obtained during Setup. Several other parameters can also be passed to fine tune the results. Note that for unlimited queries, the variable url in the function has to be modified to https://www.googleapis.com/customsearch/v1/siterestrict? (more on that here).

Obtaining Image URLs

This is our “search engine”, so to speak. ims is a list holding all the image URLs for the query cats. The first loop in the function will try to read 10 pages of results (one page has 10 results). The inner loop then gets the URL for each result and stores in a list, which is later returned. The URLs (ims) can then be passed to, for instance, the fast.ai helper function download_url() to download and store the images in a preferred path.

Hope this was helpful, happy training!