This repository contains programs for applying Estnltk language processing to text content using Spark parallel execution. Contents: spark_estltk : Python process for applying Estnltk processes to text content (or HTML pages) contained in Sequencefiles. Created for Spark parallel execution. text_to_hdfs : Java program for converting plaintext into Hadoop SequenceFiles. warc_to_hdfs : NutchWAX-based program for converting WARC (Web page ARChive) files into SequenceFiles. Getting started: 0. See readme files of each component for installation and usage instructions. 1. Obtain SequenceFiles for processing: 1a For text files, use text_to_hdfs to convert them into SequenceFiles 1b For WARC files, use warc_to_hdfs to convert them into SequenceFiles 2. Use spark_estnltk to process the SequenceFiles and obtain language analysis results