Programming
How to get the Letter Frequency in Python
How to get the Letter Frequency of the Documents and how to compare statistically the Letter Frequency Distributions
--
Letter Frequency
We will provide you a walk-through example of how you can easily get the letter frequency in documents by considering the whole document or the unique words. Finally, we will compare our observed relative frequencies with the letter frequency of the English language.
From the above horizontal barplot, we can easily see that the letter e is the most common in both English Texts and Dictionaries. Notice also that the distribution is changed between Texts and Dictionaries.
Part A: Get the Letter Frequency in Documents
We will work with the Moby Dick book and we will provide the frequency and the relative frequency of the letters. Finally, we will apply a chi-square test to test if the distribution of the letters in Moby Dick is the same as what we see in English texts.
import pandas as pdimport numpy as np
import re
from collections import Counterwith open('moby.txt', 'r') as f:
file_name_data = f.read()
file_name_data=file_name_data.lower()# convert to a list where each character
# is an elementletter_list = list(file_name_data)# get the frequency of each letter
my_counter = Counter(letter_list)# convert the Counter into Pandas data frame
df = pd.DataFrame.from_dict(my_counter, orient='index').reset_index()df = df.rename(columns={'index':'letter', 0:'frequency'})# keep only the 26 english…