whym/wikihadoop

Stream-based InputFormat for processing the compressed XML dumps of Wikipedia with Hadoop

Python

Issues

compatibility with elastic mapreduce?
#11 opened 9 years ago by GabrielF00
1
Using with "current" dump
#7 opened 9 years ago by DataJunkie
20
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.isDirectory()Z
#12 opened 12 years ago by ravisg
3
Duplicated revision pairs when bzip2 input is used
#1 opened 13 years ago by whym
2
download link broken
#10 opened 12 years ago by GabrielF00
5
Using cloudera distribution
#8 opened 13 years ago by Fkawala
4
Missing revisions
#2 opened 13 years ago by whym
2
Non-uniform progress report
#6 opened 13 years ago by whym
0
Generalize the splitter for non-Wikipedia XMLs
#5 opened 13 years ago by whym
0
Connect to the Python differ
#4 opened 13 years ago by whym
0
Connect to the Python differ
#3 opened 13 years ago by whym
0