（Data analysis with Power BI) The highest revenue moive and the highest rated movie in every decade

February 10, 2023

This dataset is publicly shared on:https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

Brief intro of the dataset:

"These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts, and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website."

Research purpose:

1. The highest revenue movie in every decade

2. The highest-rated movie in every decade

The final dashboard is here. You may check the analysis process in the following.

Data Validation

Data cleaning via Power Query

1) The data type of each column is directly shown here.

Change it if the data type is wrongly formatted by clicking the "Data Type” button.

1) Trim /clean related text data via "transform"

2) Remove duplicates via the "remove duplicates" function.

3）Check for the statistical data, null value, errors, and data range via the following clicks :

Eg. The "movies_metadata" table contains many wrong cells. We may use the "filter" function to remove them and leave the available ones. Rows that are not "False" or "True" are filtered out.

Process/Analysis

1.Data modeling:

The first step is to set up a data model to reshape the data for analysis, compress the data usage, and find the relationship between different tables.

There are many things like removing unnecessary columns, splitting columns to get the data formatted, etc. Operations like splitting columns, transforming data, and merging data can be done in Power Query. Normally I tend to remove unnecessary columns in the "data view" in Power BI since it will not modify the data uploaded to Power Query.

In this case, many unrelated columns in the fact table "movie_metadata" are removed such as 'subtitle', 'budget', 'orgin_country', etc. After that, the data type of each needed column was checked.

Then, I created a date table that included the year and decade information.

Date table with year = 
    ADDCOLUMNS(
        CALENDAR(
            DATE(YEAR(FIRSTDATE(movies_metadata[release_date])), 1, 1), 
            DATE(YEAR(LASTDATE(movies_metadata[release_date])), 12, 31)
        ), 
        "Year", YEAR([Date])
    )

And add a 'decade' column to indicate in which decade the movie was published.

Decade = SWITCH(TRUE(),
    [year] >=2010, "2010",
    [year] >=2000, "2000",
    [year] >=1990, "1990",
    [year] >=1980, "1980",
    [year] >=1970, "1970",
    [year] >=1960, "1960",
    [year] >=1950, "1950",
    [year] >=1940, "1940",
    [year] >=1930, "1930",
    [Year] >=1920, "1920", "1910")

The table created is like this ：

After the pre-processing of the dataset, I created a model based on the tables. The model of this dataset is as follows：

2. Now we need to create new columns or measures based on our research objects.

we need to create a new column called "decade" to specify which decade the movies were published.

a) we extract the year data out of the `release_date" column by creating a new column called "year".

year = 
year(movies_metadata[release_date])

3. For question 1, we need a new measure of the highest revenue of movies. So let's create one:

best sell movie = max(movies_metadata[revenue])

After that, we drag the "Decade" to the dashboard as a slicer, and drag the new measure and the "title" to the dashboard.

Then we may have this page.

When we click on the decade, we may see the highest-earning movie data among that decade.

4. For question 2, we need the average rating data based on the rating data for each movie.

Let's create a column called "ratings from users" to reveal the average ratings.

ratings from users = AVERAGEX(relatedtable(ratings),
                              'ratings'[rating])

Then we drag what data we need to the dashboard:

After the optimization, the final dashboard is shown at the beginning of this post.

Reflect:

Throughout the research, we may see the highest earning and highest rated movie every 10 years. However, this dataset is quite easy to analyze. There are more things to do in the practical business context, such as time intelligence, the optimization of the dashboard, performance optimization, etc.

Thanks for your reading.

Search This Blog

Sherwin's Data nest