Big-Data Analytics and Text Mining Course: Now on campus at AU in fall 2019 (ITEC 696)

In the fall 2019 semester, my big data analytics and text mining course will be available on campus. I’ve taught this course online at SIS for the past several years, and starting in the fall of 2019 it will now be offered by the Kogod School of Business (KSB) on Wednesday evenings from 5:30 – 8:00 p.m.

In this course, you will be able to learn how to use the CRISP-DM approach to managing a text mining project, and use R, Rstudio, and a wide variety of free R packages to learn how to collect, analyze, and visualize large-scale textual data. In addition to learning the most popular R package for text mining (tm), you will also learn a powerful overall approach to data analytics in R using a collection of packages known as the tidyverse.

The course starts with developing a theoretical background to text mining, and understanding how it fits into the broader data science community. You will then learn how to manage text mining projects using RStudio, and write readable and repeatable code. We will review using R for statistical analysis, and then move to using R for text mining. Using various R packages, such as rvest, you will learn how to collect unstructured text-based data from websites.  You will learn how to pre-process text for analysis. In addition you will learn the basics of social media mining and social network analysis in R (including using twitteR, Rtweet, and sna), including dictionary development and sentiment analysis.

After your mid-term project progress report, you will learn a variety of more advanced data analytics techniques, including unsupervised machine learning techniques such as clustering and topic modeling, and classification and predictive modeling. You will learn some advanced data visualization techniques, including how to use RShiny, RMarkdown, and Rpubs. You will end the semester by moving from our statistical “bag of words” approach to text mining, into Natural Language Processing (using openNLP and cleanNLP), and focus on using named entity recognition (NER). Finally, you will learn how to run R in a cluster computing environment by running R projects on American University’s High-Performance Computer (HPC) called Zorro. For more information, please write to me for the draft syllabus.


HICSS 2017 Tutorial on Big Data and Text Mining Challenges and Opportunities

On Wednesday, 4 January 2017, Dr. Normand Peladeau, CEO of Provalis Research (  and Dr. Derrick Cogburn, Associate Professor of International Communication and Development and Executive Director of the Institute on Disability and Public Policy at American University (, convened the 3rd iteration of the HICSS Tutorial on Big Data Analytics and Text Mining Challenges and Opportunities.  This tutorial, held as part of the 50th anniversary of the prestigious Hawaii International Conference on Systems Sciences (HICSS), had 71 registered participants, including 18 doctoral students for the half day event.  Dr. Cogburn opened with tutorial talking about its origins in research questions emerging out of the Global Virtual Teams mini-track he co-chairs with Dr. Mike Hine, Associate Professor at the Sproutt School of Business at Carelton University.  Dr. Cogburn also talked about how the Tutorial has evolved to include Dr. Peladeau, and how it sparked the creation of a new HICSS mini-track on Text Mining, co-chaired by Dr. Cogburn and Dr. Hine.  A pdf of the slide deck from the Tutorial is available here; and some photos are below.   For next years HICSS conference, Dr. Peladeau will be joining the co-chairing team for the Text Mining mini-track, and will again convene the Tutorial with Dr. Cogburn.  We hope all of you will join us in planning to submit papers, volunteering to review, and participating in the conference next year.  More information about these activities over the course of the year can be found here on this blog.

Big Data Analytics and Text Mining in the Social Sciences

This site is designed to help researchers understand the opportunities and challenges of “Big Data Analytics” in International Affairs research by exposing you to some of the tools and techniques used to analyze large-scale unstructured textual data. These approaches are applicable for a range of social science research topics, such as identifying: core themes in State Department blog posts; sentiment and affect in twitter feeds; emerging areas of concern or interest on email lists; similarities and differences in national reports on international treaty commitments. While the concept of Big Data is relative to each field, as much as 75-80% of available data is unstructured text, making it perhaps the largest single data source for the modern social science investigator. Data of this type includes: email archives, websites, twitter feeds and other social media, blog posts, speeches, annual reports, published articles, and much more. In the aggregate, these sources can easily run into thousands or millions of discrete textual items, but perhaps only gigabyte file sizes. Textual data at this size and scale is particularly challenging to the analyst using only traditional forms of content analysis, and is even challenging to those scholars using Computer Assisted Qualitative Data Analysis Software (CAQDAS) tools.