/minera-imdb

Web mining for IMDB using Complex Network techniques

Primary LanguagePython

README

Web mining for IMDB using Complex Network techniques.

TODO

Lucas

  • Implement SVD DONE!
  • Implement Single-liknage IHAC DONE!

Vilson

  • Create a database (SQLite?) DONE!
    • Movie title
    • Genre
    • Director
    • Actors (at least 4, 2 men, 2 women)
    • Plot
    • Keywords
    • Backlinks (> 50 < 200)

Installing

We are using IMDb.py to download and access all IMDB information with an OOP model. In this way, we have some dependencies.

Dependencies

Currently we are covering a Ubuntu GNU/Linux 11.04 system. Install the following packages:

# MySQL
sudo apt-get install mysql-client mysql-server

# Python 2.6+ and some libs
sudo apt-get install python python-mysqldb

# IMDb.py
wget -rc http://prdownloads.sourceforge.net/imdbpy/IMDbPY-4.9.tar.gz
tar -xvzf IMDbPY-4.9.tar.gz
cd IMDbPY-4.9/
sudo python setup.py install

Downloading the IMDB plain files

We used a local copy of the entire IMDB database (until June 22, 2012). Here are the steps to get your own. The plain files will be downloaded in your ~/tmp/imdb directory. It is a time consuming action (around 1.1gb of data), so go take a coffee.

mkdir -f ~/tmp/imdb
cd ~/tmp/imdb
wget -rc ftp://ftp.fu-berlin.de/pub/misc/movies/database/
mv ftp.fu-berlin.de/pub/misc/movies/database/*.gz ./
rm -rf ftp.fu-berlin.de

So now we have 1.1gb of .list.gz files.

Setting up a local SQL database

First of all create a database:

mysqladmin -u root -p create imdb

Having all the .list.gz files at ~/tmp/imdb, run this script, inside of IMDb.py directory:

cd IMDbPY-4.9/bin/
python imdbpy2sql.py -d ~/tmp/imdb/ -u mysql://root:lm2526@localhost/imdb

This will take a lot of time (we spent about 5 hours).

Using

Some interesting information

We downloaded 8,2G in 3h 32m 27s (676 KB/s).

And we indexed the entire IMDb plain files data base in 303 min:

# TIME TOTAL TIME TO INSERT/WRITE DATA : 258min, 17sec (wall) 111min, 21sec (user) 25min, 54sec (system)
building database indexes (this may take a while)
# TIME createIndexes() : 21min, 37sec (wall) 0min, 0sec (user) 0min, 0sec (system)
adding foreign keys (this may take a while)
# TIME createForeignKeys() : 23min, 7sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for movies... DONE! (restored 0 entries out of 0)
# TIME restore movies : 0min, 1sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for people... DONE! (restored 0 entries out of 0)
# TIME restore people : 0min, 0sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for characters... DONE! (restored 0 entries out of 0)
# TIME restore characters : 0min, 3sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for companies... DONE! (restored 0 entries out of 0)
# TIME restore companies : 0min, 1sec (wall) 0min, 0sec (user) 0min, 0sec (system)
# TIME FINAL : 303min, 6sec (wall) 111min, 21sec (user) 25min, 54sec (system)

Total time: 212 min + 303 min = 515 min = 8.5 h to download and index

More about the attributes available on IMDB plain files database, please refer to ftp://ftp.fu-berlin.de/pub/misc/movies/database/tools/movie-database-faq .

Authors

  • Lucas Rodrigues
  • Vilson Vieira

IFSC / University of São Paulo / 2012