Youtube Scraping using python Part 1: Overview and installing Selenium

Hello guys, In this series, we will learn how to scrape Youtube. In this project, we will create a program which can search the list of categories like Travel Blogs, Food, Science & Technology, etc and scrape Video IDs, Title, and Description of the videos.

By the end of this tutorial you will be able to scrape almost everything from youtube.

SCRAPING THE WEBSITE WITHOUT OWNER’S PERMISSION IS ILLEGAL. IN THIS TUTORIAL WE WILL SCRAPE ONLY A VERY TINY PORTION OF YOUTUBE AROUND 40-50 VIDEOS PER CATEGORY AND WILL USE IT FOR EDUCATIONAL PURPOSE ONLY.

YOUTUBE HAS ITS API FOR THIS PURPOSE SO USE YOUTUBE API FOR COMMERCIAL PURPOSE.

So let’s start.

There are basically two types of websites. first one is which has finite scrolling and another is which have infinite scrolling, finite scrolling like in google search page, in which if you want to go to the next page then you have to click on the number and then you can go to the next page.

The second one are websites like youtube which has infinite scroll means when you search some keywords then result page dont have any numbers, all you have to do is scroll down to load more videos.

The website which have a finite scroll or in which we have to click on numbers to go to the next page is easy to scrape than the websites which have infinite scrolls because for finite scrolls pages we can write a script to load multiple webpages at the same time of different page number.

for example: For Google Search page we can do something like this

https://www.google.com/search?q=avengers&start=60
https://www.google.com/search?q=avengers&start=70
https://www.google.com/search?q=avengers&start=80

Above links can fetch you the search page for “Avengers” and page number will be 7th, 8th, and 9th. I think you got the pattern and can easily toggle with these parameters and with the help of for loops and all sort of things you can easily write a script to scrape multiple pages of google search result page.

But

But there is nothing like this in youtube. We have infinite scroll pages in youtube so for this purpose we have to create a program which can automatically scroll down the search page to load more and then return the page source of whole page.

To do this we use Selenium.

Selenium has a webdriver which can take control of your browser, open a youtube query link and scroll down in front of you on your desktop/laptop screen automatically and then return the page source.

Our approch will be to use selenium and collect the required video IDs and then scrape the video one by one because desription and all sort of data can be extracted only from video page itself and this process can be done by using our traditional scraping technique.

I am using Mozilla Firefox for this purpose, even though you can use any web browser which supported by selenium. Some of the supported browser are Chrome, Opera, Safari, Microsoft Edge etc

So to install Mozilla you can click here and download for your respective operating system

I am using Mozilla because I am using Crome for jupyter and I dont want any interference occur due to selenium.

Now let’s install selenium. This installation need two steps.

Step 1 : Use pip to install selenium

pip install selenium

Step 2: Download Web driver for the Selenium

To download web driver for Mozilla, click here and then download for your respective OS.

If you are planning to use any other browser then click here and click on the browser that you want to use then a git repo will open, then download the web driver for your OS.

Now we are ready to rumble,

In the next part, we will scrape the Vid Ids with the help of Selenium and Beautiful Soup. If you have any doubt, suggestion or concern then please comment below.

Thanks for reading 😀

2 thoughts on “Youtube Scraping using python Part 1: Overview and installing Selenium”

  1. I opened the youtube on firefox and the code given above in chrome jupyter nb and did the required downloads n installation and changes in the driver link but still i am getting error like this:
    WebDriverException: Message: ‘geckodriver-v0.24.0-win64’ executable may have wrong permissions.
    Please tell me how to fix it as i have installed the webdriver and ran it as administration which opens a command prompt

    1. I think the Path of the driver is not set correctly,

      The error is due to this line
      driver = webdriver.Firefox(executable_path=r’/Users/pushkarsingh/Downloads/geckodriver’)

      Check if the path is set correctly or not.

Leave a Reply

Your email address will not be published. Required fields are marked *