Hello guys, In this series, we will learn how to scrape Youtube. In this project, we will create a program which can search the list of categories like Travel Blogs, Food, Science & Technology, etc and scrape Video IDs, Title, and Description of the videos.
By the end of this tutorial you will be able to scrape almost everything from youtube.
SCRAPING THE WEBSITE WITHOUT OWNER’S PERMISSION IS ILLEGAL. IN THIS TUTORIAL WE WILL SCRAPE ONLY A VERY TINY PORTION OF YOUTUBE AROUND 40-50 VIDEOS PER CATEGORY AND WILL USE IT FOR EDUCATIONAL PURPOSE ONLY.
YOUTUBE HAS ITS API FOR THIS PURPOSE SO USE YOUTUBE API FOR COMMERCIAL PURPOSE.
So let’s start.
There are basically two types of websites. first one is which has finite scrolling and another is which have infinite scrolling, finite scrolling like in google search page, in which if you want to go to the next page then you have to click on the number and then you can go to the next page.
The second one
The website which have a finite scroll or in which we have to click on numbers to go to the next page is easy to scrape than the websites which have infinite scrolls because for finite scrolls pages we can write a script to load multiple webpages at the same time of different page number.
for example: For Google Search page we can do something like this
Above links can fetch you the search page for “Avengers” and page number will be 7th, 8th, and 9th. I think you got the pattern and can easily toggle with these parameters and with the help of for loops and all sort of things you can easily write a script to scrape multiple pages of google search result page.
But there is nothing like this in youtube. We have infinite scroll pages in youtube so for this purpose we have to create a program which can automatically scroll down the search page to load more and then return the page source of whole page.
To do this we use Selenium.
Selenium has a webdriver which can take control of your browser, open a youtube query link and scroll down in front of you on your desktop/laptop screen automatically and then return the page source.
Our approch will be to use selenium and collect the required video IDs and then scrape the video one by one because desription and all sort of data can be extracted only from video page itself and this process can be done by using our traditional scraping technique.
I am using Mozilla Firefox for this purpose, even though you can use any web browser which supported by selenium. Some of the supported browser are Chrome, Opera, Safari, Microsoft Edge etc
So to install Mozilla you can click here and download for your respective operating system
I am using Mozilla because I am using Crome for jupyter and I dont want any interference occur due to selenium.
Now let’s install selenium. This installation need two steps.
Step 1 : Use pip to install selenium
pip install selenium
Step 2: Download Web driver for the Selenium
To download web driver for Mozilla, click here and then download for your respective OS.
If you are planning to use any other browser then click here and click on the browser that you want to use then a git repo will open, then download the web driver for your OS.
Now we are ready to rumble,
In the next part, we will scrape the Vid Ids with the help of Selenium and Beautiful Soup. If you have any doubt, suggestion or concern then please comment below.
Thanks for reading 😀