top of page
Writer's pictureamandaeseim

Visualizing Fifteen Years of Headlines


Over the past ten weeks my Digital Humanities class has examined and discussed a range of readings about big data and visualizations. I have learned that there are certain collections of text that are too large to read in a reasonable amount of time. This is where text analysis tools come in handy by identifying trends and connections that we mere mortals might not be able to recognize. When I was assigned with creating a big data visualization, I was excited to see what new information I could find.


I chose to work with Voyant, mainly because I recently watched the university librarian give a presentation using that text analysis tool. Because of this previous exposure, I felt cautiously optimistic that I could navigate Voyant's various features.


My first step was to find some data. I wanted a corpus that qualified as big data in order to discover trends over the long term. While searching online, I stumbled across Kaggle, a website with a large collection of open-source datasets. I was interested in something news-related, and Kaggle had a range from which to choose. I selected one titled “A Million News Headlines”. This dataset included headlines over a period of fifteen years from the Australian news outlet ABC (Australian Broadcasting Corporation). This corpus includes a total number of 1,103,665 records dating from February 2003 to December 2017.


Word cloud created by the Cirrus tool in Voyant showing the most frequent terms in the corpus

After uploading the corpus into Voyant I first looked at the Cirrus tool, which creates a word cloud of the most frequently used terms. Somewhat surprisingly, it was not necessary to adjust the settings. Usually you have to edit the stop word list in order to filter out unwanted words that do not carry significant meaning, such as prepositions and determiners. This step was not needed, because my corpus only includes headlines and not the body of the news articles. A few of the unfamiliar terms in the word cloud are acronyms (QLD = Queensland, NSW = New South Wales, NT = Northern Territory).


The word cloud shows that the corpus deals mainly with Australian news, but the headlines also referred to foreign events. I searched for particular terms to see if I could discover any trends in ABC’s reporting of foreign affairs. Using the Trends tool, I entered several names of countries to find out which ones ABC tended to report on. The line graph below reveals that China was mentioned more frequently than England, the United States, North Korea and New Zealand. China's first spike in the corpus is around 2008 when they were hosting the summer Olympics. The second spike is around 2013 when Xi Jinping became President. That was also the year when Chinese hacking into American government files was discovered. The line graph shows a spike in North Korea-related headlines toward the end of the corpus. This is likely due to the nuclear threats from the North Korean government that dominated world news around 2016.


Line graph created using Trends in Voyant showing the frequency of terms across the entire corpus

The Links tool in Voyant produced an interesting visualization. The visual below represents the collocation of terms in my corpus, depicting them in a network-style graph. Each term (node) is connected by a line (edge) to the words most frequently used in close proximity. You can adjust the number of words to include on either side of your term when looking for collocates. The default setting is five words per side, but you can adjust it up to thirty. Since my corpus consists of short news headlines, I left it at five words. Links is a force-directed graph. It uses an algorithm that produces a random layout, so the orientation of nodes and edges are different every time the corpus is reset (Graham, Milligan & Weingart, 250). This also means that the spatial distances between the nodes are irrelevant. This is important to keep in mind when analyzing the graph.


Links in Voyant represents the collocation of terms in a corpus by depicting them in a network

Links visualizes the relative frequency of each term by the size of the node in the graph. Although the visual above is a still image, Links is interactive in Voyant. Hovering over a single node shows the number of times the term appears in the corpus and highlights the connections between that term and other words (see "man" highlighted above). For example, "man" was connected to keywords like “jailed”, “charged”, “assault” and “court”, while “woman” revealed only one frequently associated word: “missing” (see below). This suggests that not only were men reported on more frequently than women in ABC’s headlines, but men were more often described as perpetrators of crimes.

I came into this exercise without any particular research question in mind. I took an exploratory, rather than a hypothesis-driven approach, which Graham, Milligan and Weingart describe in Exploring Big Historical Data (236). Without really knowing what I was looking for, I let the data speak to me rather than searching for a specific answer. As the three authors discuss, the exploratory method can be limiting. It took a substantial amount of time playing around with the visuals and settings just trying to find anything noteworthy.


Although I did not make any earth-shattering discoveries with these data visualizations, this exercise did help familiarize me with some new digital tools. I now have some basic text analysis skills to use on my next historical research project.


Citations:


Graham, Shawn, Ian Milligan and Scott Weingart. Exploring Big Historical Data: The Historian's Macroscope. London: Imperial College Press, 2016.


Kulkarni, Rohit. A Million News Headlines.(2017) textabcnews-date-text.csv.zip, doi:10.7910/DVN/SYBGZL, Retrieved from: https://www.kaggle.com/therohk/million-headlines


Sinclair, Stefan and Geoffrey Rockwell. Voyant Tools, accessed Novemenber 14, 2018, http://voyant-tools.org

10 views0 comments

Comments


bottom of page