Youtube Scraping using python Part 2: Getting Video IDs

Hello guys and welcome to the part 2 of this tutorial series on Youtube Scraping.

In the part 1 we learn about scraping and download Selenium.

In this part, we will create a program to scrape the video IDs only for respective categories and save it in the text file.

I am adding the full code here, if you want to understand the specific function or specific line then just navigate to the particular line in the explanation

from selenium import webdriver
import time 
from bs4 import BeautifulSoup as bs
import os
import pickle

cwd=os.getcwd()
parent_folder=os.path.join(cwd,'Data')
pickle_folder=os.path.join(parent_folder,"Pickle")

if not os.path.exists(parent_folder):
    os.makedirs(parent_folder)
    
if not os.path.exists(pickle_folder):
    os.makedirs(pickle_folder)

queries=['science and technology', 'food', 'manufacturing', 'history', 'art and music', 'travel blogs']

base="https://www.youtube.com/results?search_query="


for query in queries:
   
    query1=query.replace(" ","+")
    
    link=base+query1
    
    driver = webdriver.Firefox(executable_path=r'/Users/pushkarsingh/Downloads/geckodriver')
    driver.get(link)

    time.sleep(5)
    
    for i in range(0,10):
        driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
        time.sleep(3)
        print(i)

    soup=bs(driver.page_source, 'lxml')
    vids = soup.findAll('a',{"class":"yt-simple-endpoint style-scope ytd-video-renderer"})
    print(query)
    
    save_ids=os.path.join(parent_folder,'IDs')
    if not os.path.exists(save_ids):
        os.makedirs(save_ids)
    
    name=query+".txt"
    save_ids_link=os.path.join(save_ids,name)
    
    f= open(save_ids_link,"a+")
    vid_id_list=[]

    for v in vids:
        d=str(v)
        vid_id=d[(d.find("href"))+15:(d.find("id="))-2]
        print(vid_id)
        if (vid_id.find("imple"))==-1:
            vid_id_list.append(vid_id)
            
            f.write(vid_id)
            f.write("\n")
    
    vid_id_dict.update({query:[ids for ids in vid_id_list ]})
    f.close()
vid_ids_dict_pickle_path=os.path.join("pickle_folder","vid_ids_dict.pickle")
pickle_in=open(vid_ids_dict_pickle_path,"wb")
pickle.dump(vid_id_dict,pickle_in)
pickle_in.close()

Explanation

from selenium import webdriver
import time 
from bs4 import BeautifulSoup as bs
import os
import pickle

Here we are using some liberaries, Let me tell you why we need them.

We already have Selenium as we installed it in part 1

Beautiful Soup

To install Beautiful Soup just run the command

pip install beautifulsoup4

We will use it to read the page source and kind of extracting the data. This is the backbone of scraping.

Pickle

To install Pickle, run the command

pip install pickle

We will use it to pickle the vid Ids list so that we can use it directly in another program.

time

We will use time liberary to introduce some delay between two requests.

os

We will use OS to create some folders in order to make everything neat and clean and easily accessable.

cwd=os.getcwd()
parent_folder=os.path.join(cwd,'Data')
pickle_folder=os.path.join(parent_folder,"Pickle")

if not os.path.exists(parent_folder):
    os.makedirs(parent_folder)
    
if not os.path.exists(pickle_folder):
    os.makedirs(pickle_folder)

Here we are creating the folders one for Data and another for Pickles.

queries=['science and technology', 'food', 'manufacturing', 'history', 'art and music', 'travel blogs']

base="https://www.youtube.com/results?search_query="

Here we are specifing the categories which we want to scrape, you can add delete change these categories. idea is to create a list of queries which we want to search and scrape the data.

Next, we are taking a youtube query link and save it in the base. later we can add the base link with the category to make a youtube search on that particular category.

for query in queries:
   
    query1=query.replace(" ","+")
    
    link=base+query1
    
    driver = webdriver.Firefox(executable_path=r'Path-to-webdriver-that-we-downloaded-in-part-1')
    driver.get(link)

    time.sleep(5)

Here first of all we are itertating all the queries one by one and replacing all the space with + sign because youtube search in this way. not only youtube, every search engine search in this way only.

then we add the query with base link as we discussed earlier.

Next step is important. Here we are defining the webdriver. in the previous part we download the webdriver, now in this step we have to define the path to that webdriver.

Now a program/selenium has assigned a browser tab and we can access the tab with the help of our program, like in the next step we opened a link of a query page which is

“https://www.youtube.com/results?search_query=science+and+technology”

after that we put the program on sleep for 5 sec just to ensure that webpage loads completely.

for i in range(0,10):
    driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
    time.sleep(3)
    print(i)

Here we are scrolling the page, Now this is also an important concept to understand.

Here we are using Driver.excute_script function. with the help of this function we run a script to scroll the window function i.e. window.scrollTo()

Not this window.scrollTo() take 2 parameters. starting point and ending point.

In youtube, when we scroll down to the end of the page, more videos will start loading, and this is the same. here we want to scroll the webpage to the end of the page, but here a question arise that how can we find the page height, the answer is we cant but web driver can.

See every webpage has some pieces of information like page width, page height, etc. so to know the page height we use “document.documentElement.scrollHeight “, This return the document height, let’s say the height is 4000. Now web driver scrolls down to 4000 pixels which is ultimately the end of the page and then more videos will start loading.

we wait for 3 second to load all the videos and then again run the same command, here page height will change and increase for sure lets say it will increase to 8000.

Now webdriver have to scroll down the webpage from 0 to 8000. and same goes on, thus everytime we run this loop we scroll the webpage to the end.

soup=bs(driver.page_source, 'lxml')
vids = soup.findAll('a',{"class":"yt-simple-endpoint style-scope ytd-video-renderer"})
    print(query)

Once we finised with the scrolling, we want the page source of whole page so that we can start scraping.

here driver.page_source have the required file but we use beautiful soup to import it in XML format and save it in soup. you can also print soup to view the page source.

Now here come the messy part where you have to dive into this page source to extract the useful information, but it is not that hard, let me guide you.

First of all copy the data from the webpage itself that you want to scrape. like I want to scrape the Vid Ids. so I will click on any youtube video from this search result, copy the vid id from the link and search it in the page source.

I got some 23 search results but I want something like this:

 <a aria-label="APPSC(GROUP-1,PRELIMS) SCIENCE AND TECHNOLOGY by SATYA ONLINE IAS ACADEMY 1 month ago 54 minutes 4,513 views" 
class="yt-simple-endpoint style-scope ytd-video-renderer" 
href="/watch?v=s1QRkRxXiE8" 
id="video-title" 
title="APPSC(GROUP-1,PRELIMS) SCIENCE AND TECHNOLOGY">
 APPSC(GROUP-1,PRELIMS) SCIENCE AND TECHNOLOGY
     </a>

Here If you look closely you find href=”/watch?v=s1QRkRxXiE8″ which contain our vid ID and got a tag and an attribute.

If you are not familier with HTML then let me explain you

HTML have many tags like <a> anchor tag , <div> container tag , <style> style tag , <h1> heading tag , <span> span tag ,<p> paragraph tag, etc

Along with tags there are attributes too like Class, Id, dir, title, href, etc

So we need data where it is enclosed by any tag and have an attribute like here it is enclosed by anchor tags and have class attribute class=”yt-simple-endpoint style-scope ytd-video-renderer”.

Now, all we have to do is, with the help of beautiful soup we can search the similar tag with the same attribute id and extract the href text i.e. /watch?v=s1QRkRxXiE8.

We will extract the id by using slices in the later step but as of now we are collecting these ids in the form of this string itself and save it in a variable called Vids.

save_ids=os.path.join(parent_folder,'IDs')
    if not os.path.exists(save_ids):
        os.makedirs(save_ids)
    
    name=query+".txt"
    save_ids_link=os.path.join(save_ids,name)
    
    f= open(save_ids_link,"a+")

In this part we are only creating some folders and files to store vid Ids of respective query in separate text files.

  vid_id_list=[]

    for v in vids:
        d=str(v)
        vid_id=d[(d.find("href"))+15:(d.find("id="))-2]
        print(vid_id)
        if (vid_id.find("imple"))==-1:
            vid_id_list.append(vid_id)
            
            f.write(vid_id)
            f.write("\n")

Here we are using slices to extract the vid IDs as we dicussed above. This is a simple common operation to extract part of text in a string using slices.

vid_id_dict.update({query:[ids for ids in vid_id_list ]})
    f.close()
vid_ids_dict_pickle_path=os.path.join("pickle_folder","vid_ids_dict.pickle")
pickle_in=open(vid_ids_dict_pickle_path,"wb")
pickle.dump(vid_id_dict,pickle_in)
pickle_in.close()

here we are creating a dictionary which has keys as the query and values as the Vid Ids for the resepective query.

then we pickle this dictionary and save it.

Now at this point we have Vid IDs to work on.

In the next and last part we will create a program to open each a evry vid page and extract the data like title and description and will save it in CSV for further analysis.

If you have any doubt, suggestion or concern then please comment below.

Thanks for reading. 😀

6 thoughts on “Youtube Scraping using python Part 2: Getting Video IDs”

  1. Why the file is stored with .pickle extension. what is the use of creating separate file for each category under the folder called data? we are using those text file further?

    This pickle file only contains the last category videos Ids (Travel + blogs) not for full category.

    1. I like to maintain the data for future possibilities and this is the reason I organize them in this way. Its the programming style that I follow. Everyone has their own programming style. So it’s just a personal preference and nothing else.

  2. Does it is scraping all the video IDs which is comes under that particular search query?

    How can I check the videos count of the particular search?

    1. No, this is not the case. actually, youtube search page doesn’t contain all the videos that match with the search keywords. I don’t know what the reason behind it, but youtube has far more videos then what it show on the search page. you can create a program to search for the subscribers which fall in the particular category and check their videos plus their suggestions and in this way, you can dig deeper in the youtube.

  3. When I run the code for the second time I’m getting the limited number of video urls. For the first time I got above 500 now I’m getting only 200 video ids. Can I know what was the reason behind this

    1. maybe the connection is broken.
      This program is like scrolling the youtube search page of 10 times as you can see in {for i in range(0,10):} you can increase the number to get more video Ids

      for i in range(0,10):
      driver.execute_script(“window.scrollTo(0, document.documentElement.scrollHeight);”)
      time.sleep(3)
      print(i)

Leave a Reply

Your email address will not be published. Required fields are marked *