Youtube Scraping using python Part 3: Scraping Title and Description

Hello guys and welcome to the part 3 of this tutorial series on youtube scraping.

In previous part we learn how to load pages and scrape Vid Ids.

In this part we will scrape the video title and description for all the vid IDs one by one and save it in the text files as well as in a combined csv file.

So let’s start.

I am adding the full code here, if you want to understand the specific function or specific line then just navigate to the particular line in the explanation

from bs4 import BeautifulSoup as bs
import requests
import pickle
import os
import csv

cwd=os.getcwd()
parent_folder=os.path.join(cwd,'Data')

pickle_out=open("/Users/pushkarsingh/Data/Pickle/vid_ids_dict.pickle","rb")
vid_id_dict=pickle.load(pickle_out)

dataset_folder=os.path.join(parent_folder,"Dataset")
if not os.path.exists(dataset_folder):
    os.makedirs(dataset_folder)

csv_file_path= os.path.join(parent_folder,'main.csv')


base = "https://www.youtube.com/watch?v="
for keys, values in vid_id_dict.items():
    for key in keys:
        query_dataset_folder=os.path.join(dataset_folder,key)

        if not os.path.exists(query_dataset_folder):
            os.makedirs(query_dataset_folder)

        for VidID in values:
            
            r = requests.get(base+VidID)
            soup = bs(r.text)
            name=VidID+".txt"
            save_description_link=os.path.join(query_dataset_folder,name)

            f= open(save_description_link,"a+")

            for title in soup.findAll('p', attrs={'id': 'eow-description'}):
                description=title.text.strip()
                f.write(description)
                print(description)
            f.close()

            for title in soup.findAll('span', attrs={"class": 'watch-title'}):
                vid_title= title.text.strip()
                print(vid_title)

            with open(csv_file_path, 'a+') as csvfile:
                fieldnames = ['Video id', 'Title','Description','Category']
                writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
                writer.writerow({'Video id': VidID, 'Title': vid_title, 'Description':description,'Category':key})

Explanation

from bs4 import BeautifulSoup as bs
import requests
import pickle
import os
import csv

Here we are loading some liberaries, we are already familier with beautiful soup, pickle and os as we used them in previous part

Request

We use this liberary to open the links and get the page source.

csv

We use this library to write the data in our csv file.

cwd=os.getcwd()
parent_folder=os.path.join(cwd,'Data')

pickle_out=open("/Users/pushkarsingh/Data/Pickle/vid_ids_dict.pickle","rb")
vid_id_dict=pickle.load(pickle_out)

dataset_folder=os.path.join(parent_folder,"Dataset")
if not os.path.exists(dataset_folder):
    os.makedirs(dataset_folder)

csv_file_path= os.path.join(parent_folder,'main.csv')

Here we are loading the pickle that we created in the previous part. This pickle contain the dictionary in which key is the category and value is the list of Vid Ids associated with that category.

In the next line we create a folder which will store our description and we can use it to train our model to categorize the video on the basis of description but this is not the part of this tutorial.

In the end we created a path for csv file.

base = "https://www.youtube.com/watch?v="
for keys, values in vid_id_dict.items():
    for key in keys:
        query_dataset_folder=os.path.join(dataset_folder,key)

        if not os.path.exists(query_dataset_folder):
            os.makedirs(query_dataset_folder)

Now first we declared the youtube query link and we add the Video Ids to this link to open the respective video webpage.

then we iterate the key and value pair from dictionary and creating a folder for the the respective keys.

We are doing all this just to make data easily accessible and manageable.

for VidID in values:
            
            r = requests.get(base+VidID)
            soup = bs(r.text)
            name=VidID+".txt"
            save_description_link=os.path.join(query_dataset_folder,name)

            f= open(save_description_link,"a+")

Here we are iterating Video Ids of respective key or category.
in the next line we are requesting the page source for the video webpage and the convert the page source into text document using beautiful soup.

After that we are creating some folder and files for again managing the data.

for title in soup.findAll('p', attrs={'id': 'eow-description'}):
                description=title.text.strip()
                f.write(description)
                print(description)
            f.close()

Now here is the main part, As we discussed in the previous part, we need to find a tag and an attribute to narrow down the search. So the fastest way to do this is to dive into the source code and search for the data that we want, here I want the description so I copy some lines from the description and search in the page source.

Here I find this

I strikethrough some lines because I want to draw your attention which you should look while scraping.

<p class="" id="eow-description">Theories of technology often attempt to predict the future of technology based on the high technology and science of the time. As with all predictions of the future, however, technology's is uncertain.<br/><br/>Futurist Ray Kurzweil predicts that the future of technology will be mainly consist of an overlapping "GNR Revolution" of Genetics, Nanotechnology, and Robotics, with robotics being the most important of the three.<br/><br/>future technology<br/>technology in the future<br/>future tech<br/>technology of the future<br/>future of technology<br/>the future of technology<br/>upcoming technology<br/>recent technology<br/>future technology devices<br/>future technology predictions<br/>technology future<br/>technology in future<br/>the future technology<br/>future science<br/>technology for the future<br/>technology of future<br/>technology and the future<br/>future technology gadgets<br/>future technology innovations<br/>future of science and technology<br/>future it technologies<br/>future technology trends<br/>inventions of the future<br/>inventions for the future<br/>future technology 2014<br/>future in technology<br/>near future technology<br/>future technology inventions<br/>technology is the future<br/>technology for future<br/>the future in technology<br/>future science and technology<br/>future technology 2013<br/>future trends in technology<br/>future technology 2020<br/>upcoming technologies in it<br/>technology future predictions<br/>future technology today<br/>latest technology trends<br/>cool future technology<br/>technology and future<br/>the technology of the future<br/>future in tech<br/>future technology products<br/>latest technology<br/>upcoming it technologies<br/>emerging technologies<br/>future technology robots<br/>latest technology news<br/>the technology in the future<br/>future developments in technology<br/>the latest technology<br/>upcoming future technology<br/>amazing future technology<br/>future technological advances<br/>future computer technology<br/>future and technology<br/>tech news<br/>future science technology<br/>technology the future<br/>information technology<br/>future technologies in it<br/>future online technology<br/>future technology development<br/>future for technology<br/>future predictions technology<br/>future technology company<br/>latest technology inventions<br/>future world technology<br/>upcoming latest technology<br/>technology today</p></div> <div id="watch-description-extras">

As we can see that the tag is <p> means paragraph and class is empty but Id is “eow-description”

so here we use the attribute as “ID”.

Don’t worry about the <br> tags they are for line break and can take care by Beautiful Soup in the “description=title.text.strip()” line.

for title in soup.findAll('span', attrs={"class": 'watch-title'}):
    vid_title= title.text.strip()
    print(vid_title)

Now we are using same approach for the title. We search the title in the page source.

<span class="watch-title" dir="ltr" id="eow-title" title="Welcome To Future">
    Welcome To Future
  </span>

As we can see the tag is and the attributes are class dir id and title.

title is not generalised as it will search for the same title in other videos too So here we can use either class or id.

I am using “class” attribute.

In the end “title.text.strip()” will give us the required title.

with open(csv_file_path, 'a+') as csvfile:
   fieldnames = ['Video id', 'Title','Description','Category']
   writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
   writer.writerow({'Video id': VidID, 'Title': vid_title, 'Description':description,'Category':key})

In the end we are saving all the data in csv file. which will look like this.

You can also download the code and other stuff from my Git Repo.

For Git Repo Click Here

This is the end of this tutorial, Hope it helps you in some way.
If you have any doubt, suggestion or concern then please comment below.

Thanks for reading. 😀

3 thoughts on “Youtube Scraping using python Part 3: Scraping Title and Description”

  1. I am getting this error of ‘bytes object doesnot have finsAll()’ at the for loop
    for title in soup.findAll(‘p’, attrs={‘id’: ‘eow-description’}):
    description=title.text.strip()
    f.write(description)
    print(description)
    f.close()

    please tell me how to solve it

  2. Hey Pushkar,
    I have shared the link of error that I am unable to resolve .Please help me out with the same as I have tried everything whatever solution were present on google.
    Thanks

    1. Hey Parul, I am sorry, I didn’t check for the comment for so long. Please let me know if you are still facing the issue or not, If yes then I will look into it and try to resolve your issue.

Leave a Reply

Your email address will not be published. Required fields are marked *