Visualize data from
articles and comments
on your favorite news site!
TRY IT
NOW!

Methodology

Methodology

Harvesting

AIUNT is designed to continuously harvest articles and comments from selected online sources based on weekly updates. Harvested data are extracted into a structured format (json), and stored into a graph database. Wherever possible, original categorization and tagging is also transferred into the database.

Word Processing

All original content is passed through a word processing sequence including TM and OpenNLP for preprocessing, and WEKA for extracting n-grams. Unigrams that pass the minimum frequency threshold (0.003%) are passed to the graph database as a number of instances per individual article and separately per articles' comments, keeping the original article /comment reference (including date, author, full text content, url, etc.). Bigrams and trigrams are processed separately, with approximately the same threshold applied. We keep the same threshold for initial bulk harvests and weekly updates, and then perform subsequent pruning to remove very infrequent words (typos and the like).

Source Tagging

Source articles and comments inherit tagging from their original platforms, that is then mapped into 7 basic sections (politics, business, editorial pages, local news, art and culture, sport, and other).

Data Visualizations

Word frequencies are prepared as rounded number of instances per 100,000 words, relative to the selected source, to display results as integers in the full range of statistical significance.

Co-occurrences of words are extracted on the article/discussion level and visualized through a chord diagram, which displays the co-appearance in both directions for every pair of words. The co-occurrence matrix is calculated for the 15 most frequent words, after filtering out the most common English words.

Auto-complete lists terms that appeared in at least one of the selected platforms and at least one of the content types (articles or comments).

All visualizations are designed for customization and embed in the customized form. All are based on the d3.js open source framework.

Caching

Platform level data (instances per n-grams) are cached, to facilitate quick data overviews. Caches are regularly updated and updates are automatically triggered after weekly harvesting and word processing updates.

Additional Filtering

For general top lists, additional filtering is applied to remove the 100 most frequent words in English.

Once a month, the list of n-grams available in auto-complete forms is pruned to remove very infrequent terms (under 0.001% in the overall sample).