/Python-FrequencyAnalysis

A simple script to prove Zipf's law.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Python_FrequencyAnalysis

A simple script to prove Zipf's law.

Usage

Step 1: Setup

Clone this repository and create two directories inside the src folder named books and temp.

Step 2: Download

This will download books in plain text format (will automatically strip headers) from Project Gutenberg.

python Downloader.py

Specify the number of passes (1 pass is around 20-50 books).

Press a key at anytime to exit (NOTE: Program will only exit once current pass is complete).

The files will be downloaded to the temp directory.

Requires BeautifulSoup.

Options

You may edit the offset for Project Gutenberg in the Downloader_Config.ini (This value is auto-updated).

Step 3: Analyze

This will generate a set of confidence intervals.

python Analyze_Multicore.py

Make sure specified books are in the books directory.

Output will be saved to conf.txt.

Options

'''
Number of books to sample for one confidence interval.
Make sure value is lower than number of books in the directory.
'''
NUM_OF_SAMPLES = 300

'''
Number of words to include in the data set for generating the regression line.
Set to -1 to use all words (not recommended), 1000 works best.
'''
NUM_TOP_WORDS = 1000

'''
Number of confidence intervals to generate.
'''
NUM_INTERVALS = 100

'''
Number of processes.
'''
NUM_PROCESSES = 8

'''
Alpha value for confidence interval.
0.05 = 95% confidence
'''
ALPHA_VALUE = 0.05

Requires statsmodels.api, numpy, matplotlib.

Presentation

Google Slides