Word clouds are a useful tool in generating a quick, visual depiction of large amounts of textual data. For example, word clouds of politicians’ speeches are able to define the central theme of each speech. In the example I present here, we will use baby names data provided by the Social Security Administration, available here.
Considering baby names registered in the year 2013, we generate a name cloud that looks like this:
where we can clearly see that Emma, Sophia and Olivia were popular baby names in 2013. The above word cloud was generated using a python wordcloud module, available here.
The code for generating the above wordcloud pasted here:
#!/usr/bin/env python import numpy as np from PIL import Image import matplotlib.pyplot as plt from wordlcoud import WordCloud ## reads data file my_data = np.genfromtxt('yob_2013.txt', delimiter=',', dtype=[('name','S50'),('gender','S1'),('count','i8')]) names = my_data['name'] freqs = my_data['count'] ## creates list of (name,freq) tuple words = zip(names,freqs) ## sets the geometry of the word cloud cloud_mask = np.array(Image.open("cloud_outline.png")) wc = WordCloud(background_color="white",mask=cloud_mask) wordcloud = wc.generate_from_frequencies(words) ## prints to screen plt.imshow(wordcloud) plt.axis('off') plt.show()
Whilst very interesting and useful, the cloud does not tell us much about popular boy names. It would be useful to be able to filter our database and generate a name cloud of only male names. This can be achieved very easily, thanks to Python’s very powerful slicing and indexing:
my_data = np.genfromtxt('yob_2013.txt', delimiter=',', dtype=[('name','S50'),('gender','S1'),('count','i8')]) names = my_data['name'] freqs = my_data['count'] genders = my_data['gender'] ## Uses python list indexing names = names[gender=='M'] freqs = freqs[gender=='M']
After the filtering, we obtain the following word cloud of boy names registered in 2013:
Good luck making your own interesting and useful word clouds!
-Simon