Scraping BigData companies from websites

Scraping BigData companies from websites is not much different than scraping Google scholar profiles. In this small examples I want to show you how to scrape specific names on a webpage, which are are/are not taged with a class. In the first example I wanted to scrape the names of companies on three pages of datanation.com. The pages only highlighted the names by putting them in bold, which is done by putting the text in the html file within a <strong> environment. The solution was easy.

First I created a list of the three pages. Then, I went through each list, read out the html content and search for <strong> environments. For each find, I then only extracted the text by using the .text command. Here is the code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
a = ["http://www.datamation.com/data-center/20-big-data-companies-leading-the-way-1.html",
     "http://www.datamation.com/data-center/20-big-data-companies-leading-the-way-2.html", 
"http://www.datamation.com/data-center/20-big-data-companies-leading-the-way-3.html"]
for i in a:
    html = urlopen(i)
    bsObj = BeautifulSoup(html)
    #print(bsObj) 
    images = bsObj.findAll("strong")
    for i in images:
        print(i.text)

This allows to extract the inside of the <strong> environment. Next I wanted to extract the companies from a blog from Denny Britz a fellow German. Here the situation was a bit different. The company name I was looking for was in <a> environments, which had a class “external”. Using the findAll method I was able to extact only these parts and read out only the text.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
a = ["http://blog.dennybritz.com/2015/10/13/deep-learning-startups-applications-and-acquisitions-a-summary/"]
html = urlopen(a)
bsObj = BeautifulSoup(html) 
images = bsObj.findAll('a', {"class":"external"})
    #print (images.get_text())
for i in images:
    print(i.text)
    #print(bsObj) 

The result is a list of companies, which all have something to do with BigData. Below is the list, which I got from the above Python scraping script and manual search. If you like web scraping in Python I would recommend the book “Web Scraping with Python” from Ryan Mitchell.

 

  1. Twitter,
  2. Facebook,
  3. Google,
  4. Uber,
  5. Airbnb,
  6. Palantir,
  7. Neokami,
  8. Arcadia
  9. Data,
  10. Cazena,
  11. DataHero,
  12. DataTorrent,
  13. Enigma,
  14. Neokami,
  15. Altiscale,
  16. Amazon,
  17. AWS,
  18. Cloudera,
  19. Google,
  20. Hortonworks,
  21. IBM,
  22. Indyco,
  23. Informatica,
  24. Knime,
  25. MapR,
  26. Microsoft,
  27. Oracle,
  28. Qlik,
  29. RapidMiner,
  30. SAS Institute,
  31. SAP,
  32. Sinequa,
  33. Splunk,
  34. Tableau,
  35. Trillium Software,
  36. Dropbox,
  37. Lookflow,
  38. HyperVerge,
  39. Madbits,
  40. MetaMind,
  41. Nervana,
  42. Neon,
  43. Data Collective,
  44. Skymind,
  45. DeepMind,
  46. Deepomatic,
  47. Descartes Labs,
  48. Clarifai,
  49. Tractable,
  50. Affectiva,
  51. Alpaca,
  52. Labellio,
  53. Orbital Insight,
  54. AlchemyAPI,
  55. VocalIQ,
  56. Idibon,
  57. Indico,
  58. Semantria,
  59. Lexalytics,
  60. ParallelDots,
  61. Xyggy,
  62. Enlitic,
  63. Quantified Skin,
  64. Deep Genomics,
  65. StocksNeural,
  66. Analytical Flavor Systems,
  67. Artelnics.
Please follow and like us:

Leave a Reply

Your email address will not be published. Required fields are marked *