Posted on

Let's explore a dataset from Flixable, which is a third-party Netflix search engine and IMDb and try to answer the following questions:

  1. What are the top movies on Netflix according to the IMDb rating?
  2. What are the top countries producing films?
  3. Who are the cast members who appear in the most Netflix Movies?
  4. What are the top movies added on Netflix between May 2018 and August 2018?
  5. What is the last movie from Steven Spielberg added on Netflix?

The initial data are stored in a csv: Initial csv data

After cleaning and transformation, the data are loaded into a triplestore for querying: Data in a triplestore

Then the questions are transformed into SPARQL-like queries and run against the triplestore:

1. What are the top movies on Netflix according to the IMDb rating?

Query:

SELECT ?movie, ?rating
FROM ?g
WHERE {
    ?movie "average_rating"@[] ?rating .
    ?movie "num_votes"@[] ?votes
}
ORDER BY ?rating DESC
HAVING ?votes > "5000"^^type:int64
LIMIT "10"^^type:int64;

Output:

?movie  ?rating
/movie<schindlers list> "8.9"^^type:float64
/movie<pulp fiction>    "8.9"^^type:float64
/movie<the lord of the rings the return of the king>    "8.9"^^type:float64
/movie<inception>       "8.8"^^type:float64
/movie<the lord of the rings the two towers>    "8.7"^^type:float64
/movie<the matrix>      "8.7"^^type:float64
/movie<city of god>     "8.6"^^type:float64
/movie<once upon a time in the west>    "8.5"^^type:float64
/movie<the departed>    "8.5"^^type:float64
/movie<senna>   "8.5"^^type:float64

[OK] 10 rows retrieved. BQL time: 80.1355ms. Display time: 50.3µs

2. What are the top countries producing the films?

Query:

SELECT ?country, COUNT(?movie) AS ?movies_count
FROM ?g
WHERE {
    ?movie "country"@[] ?country
}
GROUP BY ?country
ORDER BY ?movies_count DESC
LIMIT "10"^^type:int64;

Output:

?country        ?movies_count
"United States"^^type:text      "1304"^^type:int64
"India"^^type:text      "583"^^type:int64
"United Kingdom"^^type:text     "277"^^type:int64
"France"^^type:text     "151"^^type:int64
"Canada"^^type:text     "146"^^type:int64
"Germany"^^type:text    "103"^^type:int64
"Spain"^^type:text      "92"^^type:int64
"China"^^type:text      "76"^^type:int64
"Hong Kong"^^type:text  "76"^^type:int64
"Australia"^^type:text  "57"^^type:int64

[OK] 10 rows retrieved. BQL time: 30.8864ms. Display time: 43.1µs

3. Who are the cast members who appear in the most Netflix Movies?

Query:

SELECT ?person, COUNT(?movie) AS ?movies_count
FROM ?g
WHERE {
    ?person "played"@[] ?movie
}
GROUP BY ?person
ORDER BY ?movies_count DESC
LIMIT "10"^^type:int64;

Output:

?person ?movies_count
/person<Shah Rukh Khan> "23"^^type:int64
/person<Boman Irani>    "22"^^type:int64
/person<Akshay Kumar>   "21"^^type:int64
/person<Paresh Rawal>   "20"^^type:int64
/person<Anupam Kher>    "19"^^type:int64
/person<Naseeruddin Shah>       "18"^^type:int64
/person<Om Puri>        "17"^^type:int64
/person<Nicolas Cage>   "16"^^type:int64
/person<Gulshan Grover> "15"^^type:int64
/person<Kay Kay Menon>  "15"^^type:int64

[OK] 10 rows retrieved. BQL time: 229.6949ms. Display time: 48.9µs

4. What are the top movies added on Netflix between May 2018 and August 2018?

Query:

SELECT ?movie, ?time, ?rating
FROM ?g
WHERE {
    /platform<Netflix> "added"@[?time] ?movie .
    ?movie "average_rating"@[] ?rating .
    ?movie "num_votes"@[] ?votes
}
ORDER BY ?rating DESC
HAVING ?votes > "5000"^^type:int64
BETWEEN 2018-05-01T00:00:00Z, 2018-08-31T00:00:00Z
LIMIT "10"^^type:int64;

Output:

?movie  ?time   ?rating
/movie<rang de basanti> 2018-08-02T00:00:00Z    "8.2"^^type:float64
/movie<andaz apna apna> 2018-05-16T00:00:00Z    "8.2"^^type:float64
/movie<room>    2018-07-19T00:00:00Z    "8.1"^^type:float64
/movie<haider>  2018-08-02T00:00:00Z    "8.1"^^type:float64
/movie<inequality for all>      2018-08-29T00:00:00Z    "8"^^type:float64
/movie<her>     2018-07-29T00:00:00Z    "8"^^type:float64
/movie<the kings speech>        2018-06-02T00:00:00Z    "8"^^type:float64
/movie<thor ragnarok>   2018-06-05T00:00:00Z    "7.9"^^type:float64
/movie<pad man> 2018-08-21T00:00:00Z    "7.9"^^type:float64
/movie<amy>     2018-07-26T00:00:00Z    "7.8"^^type:float64

[OK] 10 rows retrieved. BQL time: 10.38ms. Display time: 58µs

5. What is the last movie from Steven Spielberg added on Netflix?

Query:

SELECT ?movie, ?time
FROM ?g
WHERE {
    /platform<Netflix> "added"@[?time] ?movie .
    /person<Steven Spielberg> "directed"@[] ?movie
}
ORDER BY ?time DESC
LIMIT "1"^^type:int64;

Output:

?movie  ?time
/movie<catch me if you can>     2020-01-01T00:00:00Z

[OK] 1 rows retrieved. BQL time: 18.1343ms. Display time: 35.1µs

Developed with Badwolf triplestore and G6 graph visualization engine.
The source code is available on Github.