Scrape Google Scholar

Google Scholar is a useful application. It refers every publications to its authors and allows to access easily the scientific output of every researcher. Two import key indicators are the number of citations and the H-Index. In this short python script you will see, how to extract/scrape these two parameters in Python.

hindex VS citations scrape Google Scholar

To scrape Google Scholar we first load important libraries for this task and define a function, which is able to scrape the H-Index from a Google Scholar profile as long as we feed the function with the link to this profile. If this is the case the function returns the H-index.

import scholarly
import urllib2
def gethindex(search_address):
    response = urllib2.urlopen(search_address)
    html =
    if html.find("h-index</a></td><td class") == -1:
        pass #print "This site has not h-index
    else: # site is a valid patent site, after text searching and cleaning, attache title, abstract and claims to doclist
        #finds title text between the characters FONT size... and FONT
        hindex = html.split('h-index</a></td><td class="gsc_rsb_std">')[-1].split('</td><td class="gsc_rsb_std">')[0]
        return hindex

Use Scholarly to scrape Google Scholar

In the next step we use the Python module scholarly. Is has several feature. the most important is that it can search the Google Scholar database for names and return their number of citation or the direct link to the Google profile. Hence, we give this function a list of scientist in the field of nanopores and use it to get the number of citations and link to the Google Scholar profile. This link is then fed to the previously defined function to return the H-index.

authorlist = ['Ulrich F. Keyser', 'Aleksandra Radenovic', 'Mahendran Radhakrishnan', 
              'Vivek Thacker', ' Silvia Hernandez-Ainsa','Nicholas Bell','Fernando Moreno Herrero', 'Cees Dekker' , 'Stefano Pagliara',
              'Ralph MM Smeets','Tim Liedl', 'Nadanai Laohakunakorn', 'Meni Wanunu',
          'Mark Akeson', 'Tim Albrecht','Dario Anselmetti', 'Rashid Bashir', 'Andreas B. Dahlin','Joshua B. Edel','Adam R. Hall',
          'Diego Krapf','MinJun Kim','Jeremy Lee', 'Derek Stein', 'Giovanni Maglia', 'Michael Mayer','Aleksandr Noy',  'mark platt',
      'Jacob Rosenstein','Friedrich C. Simmel','Vincent Tabard-Cossa','Gregory Timp', 'Anton Zilman','Michael Zwolak','Lorenz J. Steinbock' ]
hindexlist = [[],[],[]]
for i in authorlist:
    a = scholarly.search_author(i).next()
    search_query = scholarly.search_author(i)
    author =
    link= str(''+ a.url_citations)
    hindex = gethindex(link)

We save the H-Index, number of citation and researcher name into one list and plot the two integer parameters in a plot.

y= a[1]
y = map(int, y)
x= a[2]

fig, ax = plt.subplots(figsize=(8, 5))
ax.set(xscale="log", yscale="log")
ax.set_xlabel("Number of citations", fontsize=12)
ax.set_ylabel("H-Index", fontsize=12)
for i, txt in enumerate(a[0]):
    ax.annotate( txt, ( x[i] , y[i]  ) )
    #print txt, i


The result is a plott with the number of citations on the X-axis and the H-Index on the Y-axis. From these we can deduce that with increasing number of citations the H-Index grows too. Publications analysing citations behavior in more detail can be found here.

hindex VS citations scrape Google Scholar


Leave a Reply

Your email address will not be published. Required fields are marked *