Implementation — Dashboard Summarization Using NLG

Shobha Deepthi V
Technology and Trends
5 min readJun 16, 2021

--

Table of Contents

(Click on the links below to jump to the topic)

In my previous post “Introduction — Dashboard Summarization Using NLG” , I have discussed at length on how NLG can be used in Dashboard Summarization, NLG process and its benefits or business value. In this post I am going to show how to install NLG, and show an example implementation of converting structured data into unstructured i.e, text.

NLG Approaches

NLG implementation can be broadly classified into two methods. One using templates and the other by dynamic creation of documents( I would refer to this as Advanced NLG). In-spite of all the research in the later, the results from dynamic creation method are still satisfactory. This post implementation part is limited to Template- based NLG.

Template-based NLG

Template based NLG being the simplest, however it gets cumbersome and needs developer to code for all possible scenarios. This is one of the widely used approach where there is a pre-defined document structured with gaps which are filled in by text that is generated from structured data. There are 3 levels in this.

  1. In Level 1, it’s a gap-filling approach. It usually is a Word

2. In Level 2, templating and scripting languages such as Tornado, Python etc are used. Templates are embedded inside a scripting language which supports logic or business rules coding. But this would lack linguistic capabilities

3. In Level 3, grammar is added to Level 2 to add punctuation, tense and prepositions.

Advanced NLG

This uses Machine Learning to generate text from structured or unstructured data. Algorithms such as Markov Chains, RNN, LSTM, Transformer are used to dynamically generate sentences or documents.

  1. Dynamic Sentence Creation — Sentences are dynamically created by the system without needing the developer to explicitly write code for every boundary case. It also allows the system to linguistically “optimise” sentences in a number of ways, including reference, aggregation, ordering, and connectives.
  2. Dynamic Document Creation — This is an extension to dynamic sentence creation, where documents are generated which are way more structured and form a narrative.

NLG Tools

Open-source

  • SimpleNLG — Java based API for NLG
  • Gramex — Python library for NLG
  • NaturalOWL — Low code got generate text for OWL Classes

Commercial

  • Wordsmith
  • Arria
  • Ax Semantics ( SaaS)
  • Yseop
  • Quill
  • Phrazor
  • textengine.io ( SaaS)
  • Automated Insights

Also, cloud providers Amazon, Google, IBM, Microsoft offer NLG as part of their cognitive services.

Installation

Operating System    : Redhat Linux 7
Python Version : 3.7
NLG Library Used : gramex

1. Install conda

Follow this link if you do not have conda installed already.

2. Create conda environment by name ‘gramex’ for NLG

conda create -y --name gramex python=3.7

3. Source new conda environment

source activate gramex

4. Install gramex

pip install gramex
pip install nlg

5. Install nodejs

conda install -c conda-forge nodejs

6. Install spacy

pip install spacy
python -m spacy download en_core_web_sm

Example

Import necessary modules

Import libraries

Loading data

Let’s sort the dataframe and create Gramex Filters. In here sort_args is a dictornary with key as the operation and values as a list of columns. ‘-’ sign before rating column name indicates that the dataframe need to be sorted in descending order of rating column.

Let’s define document structure, basically the text content that we want to generate. Templatize methods takes 3 parameters in this order. Text describing the insight, dictionary object defining what operations need to be performed on the dataset and finally the dataframe. This returns a nugget object.

Let’s look at how a nugget looks like. Last entry in the nugget is the text that is templatized( uses Tornado Templates).

Let’s render this nugget. Voila! The text exactly matches the text that we have provided but is generated by system instead.

Let us say we have a new dataframe as below, and lets render the nugget with this new dataframe

We see that the text is correct but not syntactically correct. Bettie is an actress and the generated text is missing to identify the same. We can also see that there is already a column in the dataframe which can be used to templatize the word “actor” in the input text.

Words in the input text when templatized are treated as variables from the dataset. In the nugget that we have below are the variables. We can see that the name “James Stewart” and word “actor” are recognized as variables. But formula for “actor” varibale is incorrect. In a sorted dataframe row 0, category column will have the value for “actor”. Lets change that.

Now the variable “actor” is asked to look at first row’s category column value.

We can further templatize the word “rating” as it is also a column name in the dataset. To do so, add a new variable telling which word need to be templatized and the expression to detect the column name from the dataframe

Now lets test sorting the dataframe with a new column

--

--

Shobha Deepthi V
Technology and Trends

Senior Data Scientist, Cloud Solutions Architect @Cisco