Programming

How to get the Letter Frequency in Python

How to get the Letter Frequency of the Documents and how to compare statistically the Letter Frequency Distributions

George Pipis
4 min readSep 22, 2020

--

Image by the Author: Letter Relative Distribution in our Document versus English Texts

Letter Frequency

We will provide you a walk-through example of how you can easily get the letter frequency in documents by considering the whole document or the unique words. Finally, we will compare our observed relative frequencies with the letter frequency of the English language.

From the above horizontal barplot, we can easily see that the letter e is the most common in both English Texts and Dictionaries. Notice also that the distribution is changed between Texts and Dictionaries.

Part A: Get the Letter Frequency in Documents

We will work with the Moby Dick book and we will provide the frequency and the relative frequency of the letters. Finally, we will apply a chi-square test to test if the distribution of the letters in Moby Dick is the same as what we see in English texts.

import pandas as pdimport numpy as np
import re
from collections import Counter
with open('moby.txt', 'r') as f:
file_name_data = f.read()
file_name_data=file_name_data.lower()
# convert to a list where each character
# is an element
letter_list = list(file_name_data)# get the frequency of each letter
my_counter = Counter(letter_list)
# convert the Counter into Pandas data frame
df = pd.DataFrame.from_dict(my_counter, orient='index').reset_index()
df = df.rename(columns={'index':'letter', 0:'frequency'})# keep only the 26 english

--

--

George Pipis

Sr. Director, Data Scientist @ Persado | Co-founder of the Data Science blog: https://predictivehacks.com/