Issues
- 3
Crawl Visualization
#243 opened by ianmilligan1 - 11
Dockerize Warcbase
#239 opened by ianmilligan1 - 12
- 4
- 6
Built-in Image URL building from wayback
#232 opened by greebie - 1
WARCRecord NotSerializableException when trying to get rid of duplicate pages
#260 opened by dportabella - 0
- 6
java.lang.OutOfMemoryError: Java heap space
#246 opened by dportabella - 0
Memory Issues on Large WARC Files
#254 opened by ianmilligan1 - 6
java.lang.NullPointerException on Collection
#251 opened by ianmilligan1 - 4
- 1
keepValidPages discards XHTML
#252 opened by anjackson - 7
use WET files from CommonCrawl
#250 opened by dportabella - 1
How to load an input from S3?
#247 opened by dportabella - 0
- 1
- 11
Add UDF for computing MD5 checksum
#211 opened by lintool - 10
Error handling for broken ARC/WARC files
#234 opened by ianmilligan1 - 7
Upgrade to Spark 1.6.1?
#231 opened by ianmilligan1 - 3
Break Warcbase up into sub-artifacts
#235 opened by lintool - 1
Trantor upgraded to CDH 5.7.1
#236 opened by lintool - 2
Loading ARC files produces record size errror
#199 opened by jrwiebe - 15
java.lang.NegativeArraySizeException
#222 opened by ianmilligan1 - 17
Maven error
#233 opened by drjwbaker - 7
Dynamic PageRank Crashes
#209 opened by ianmilligan1 - 4
K-Means Clustering
#226 opened by ianmilligan1 - 4
Selecting Pages that Contain Certain Keywords
#202 opened by ianmilligan1 - 1
More robust tweet parsing
#225 opened by lintool - 2
Issues with serialization on persistance
#227 opened by bzz - 1
Non-Critical Error while Building Warcbase
#223 opened by ianmilligan1 - 13
Tweet URL Extraction: All Twitter Shortlinks
#216 opened by ianmilligan1 - 1
Documenting D3.js Link Visualization
#218 opened by ianmilligan1 - 3
- 0
Contributing Guidelines for Warcbase
#219 opened by ianmilligan1 - 0
Freeze on master until 15 April
#215 opened by ianmilligan1 - 1
- 4
Represent link structure as graph using GraphX
#201 opened by jrwiebe - 1
Example counting prevalence of tweeted images
#214 opened by lintool - 2
ExtractTopLevelDomain UDF misnamed
#208 opened by lintool - 1
- 6
UDF for extracting image links
#203 opened by lintool - 1
Add UDF for extracting stuff from tweets
#210 opened by lintool - 0
Add support for analyzing tweets
#204 opened by lintool - 4
Build issues on vagrant
#206 opened by ruebot - 3
Detect WARC or ARC format when loading Records
#195 opened by bitzl - 4
java.io.EOFException when working with WARCs
#198 opened by ianmilligan1 - 5
Fine-Tuned Link Extraction within Domains
#196 opened by ianmilligan1 - 4
Wildcard support in KeepUrls?
#197 opened by ianmilligan1 - 2
Write removePrefixWWW method
#192 opened by lintool - 5
URL for Warcbase Docs
#191 opened by ianmilligan1