lukaselmer/ethz-web-scale-data-mining-project

ETH Zurich - Web Scale Data Processing and Mining Project

HTML

ETH Zurich - Web Scale Data Processing and Mining Project

This is the main repository for the web scale data mining project, which took place in summer 2014 as a research project.

Results

One of the results are the visualized topics, which have been learned autonomously from terabytes of raw HTML data.

More Results: 100 Learned Topics

Directory Structure and Overview

└── src - the source code projects, see below
    ├── WSDA
    ├── combine_sequence_files
    ├── examples
    │   ├── spark_example
    │   └── word_count_1
    ├── html_to_text_conversion
    ├── remove_infrequent_words
    ├── results_display
    ├── scripts
    └── word_count

Repositories

This is the code repository
The runs and the raw results can be found in this repository
The hadoop config is here
The spark config is here

Project Management, Documentation

Source code projects

WSDA

The self-implemented LDA

@hany-abdelrahman: the WSDA directory should probably be renamed to something more meaningful 😉 TODO: add some more doc, references, etc.

Author: Hany Abdelrahman

combine_sequence_files

Combines sequence files from subdirectories into multiple sequence files. These sequence files have the same name as the subdirectories.

This way, it is possible to create a flat directory structure whith few large sequence files.

Author: Lukas Elmer

examples

Contains a spark example project and a simple word count application. Only for dev env setup purposes.

Author: Lukas Elmer

html_to_text_conversion

Converts web archive records into sequence files, removing all HTML / JS tags using boilerplate and doing some additional steps:

remove stopwords
remove words with non a-z characters
try to remove non-english documents
remove numbers
remove URLs
convert uppercase to lowercase charaters
apply stemming (org.apache.lucene.analysis.en.EnglishAnalyzer)

See also:

Example how to use it

Author: Lukas Elmer

remove_infrequent_words

Removes words which appear infrequent. Needs a word count dictionary as input.

Example how to use it

Author: Lukas Elmer

results_display

A script to help displaying the topics. Generates

A readable text version
A tag cloud for each topic, each word size weighted by the probability of the word

Author: Lukas Elmer

word_count

Simple word count for sequence files.

Example how to use it

Author: Lukas Elmer