IMDB scraping

IMDB website scraping

In this project, we will find the movie with most shared actors, given your favourite movie

“https://github.com/edwardliu24/web-scraping”

Preparation

First import all these modules for the project function properly

import scrapy
from scrapy.http import Request
import pandas as pd
from plotly import express as px
from plotly.io import write_html

Open the terminal, at the directory you want your project be, run the command “scrapy startproject IMDB_scraper”,navigate to the directory, add CLOSESPIDER_PAGECOUNT = 20 to the file “settings.py”

Coding the spider

Create a file named IMDB_spider.py under the folder named spider, and in the py file, define a Imdbspider class.

class ImdbSpider(scrapy.Spider):
    ## spider class with name "imdb_spider",the url starts with is "https://www.imdb.com/title/tt1533117/?ref_=fn_al_tt_1"
    name = 'imdb_spider'
    
    start_urls = ["https://www.imdb.com/title/tt1533117/?ref_=fn_al_tt_1"]

This is the parse method would be used at the initial page

def parse(self, response):
    '''
    Scrape on the initilia page and open the 'cast&crew' link
    Input: response, the html source code of the initial page
    Output: link to the cast page, and call parse_full_credits on that page
    
    '''
    ##Extract the link of "cast&crew"
    full_credits = response.css("li.ipc-inline-list__item a[href*=fullcredits]").attrib['href']
    
    ##Make the next url
    prefix = "https://www.imdb.com/title/tt1533117/"
    cast_url = prefix + full_credits

    ##Request the parse_full_credits method on the next url
    yield Request(cast_url, callback = self.parse_full_credits)

This is the parse method would be used at the cast&crew page

def parse_full_credits(self, response):
    '''
    Find all the actor page of the actors in the movie
    Input: response, the html source code of the cast page
    Output: link to the actor page, and call parse_actor_page on that page
    '''
    ##Get all the links of each ators
    actor_page = [a.attrib["href"] for a in response.css("td.primary_photo a")]
    
    ##Make a list of all the links to actor page
    prefix = "https://www.imdb.com/"
    actor_url = [prefix + suffix for suffix in actor_page]

    ##For each link, open the actor page and call the parse_actor_page method
    for url in actor_url:
        yield Request(url, callback = self.parse_actor_page)

This is the parse method would be used at the actor page

def parse_actor_page(self, response):
    '''
    Scrape on the actor page, scrape all the movies that this actor participated,output a dictionary with actor names and movies
    Input: response, the html source code of the actor page
    Output: A dictionary with actor names and movie names
    '''
    ##Scrpae the names of the actor
    name = response.css("h1.header span::text").get()
    
    ##Scrape all the movies that this actor was in
    for movie in response.css("div.filmo-row"):
        movies = movie.css("a::text").get()
        
        ##yield a dictionary
        yield {
            "actor" : name,
            "movies" : movies
        }

All these three functions should be defined under the imdbspider class

Method implementing

After finishing coding the spider file, we could run “scrapy crawl imdb_spider -o results.csv” in the terminal to get the results, then we get a file named “resultys.csv”

result=pd.read_csv("results.csv")
df = result.value_counts(['movies'])

Do some data cleaning to get the desired results.

result = pd.read_csv("results.csv")
df = result.value_counts(['movies'])
df = pd.DataFrame(df)
df = df.reset_index()
df.columns = ['movies','number of shared actors']
df.head()
movies number of shared actors
0 Let the Bullets Fly 34
1 Gone with the Bullets 10
2 The Sun Also Rises 9
3 Hidden Man 8
4 The Founding of a Republic 8

Draw a scatter plot to visualize the results.

fig = px.scatter(data_frame = df, 
                 x = 'movies',
                 y = 'number of shared actors', 
                 width = 1000,
                 height = 700,
                 title = "Scatter plot of the counts of shared actors")
write_html(fig, "movie_recommend.html")
Written on April 28, 2022