Contributed by Dave Sanderson
Data isn’t much use by itself. To give data meaning, it needs context and narrative. These critical components are lacking in most spreadsheets and dashboards, restricting the depth of understanding, and at the end of the day, slowing down decisions.
We believe that Data Storytelling is the most effective way to give data meaning, connect the dots, and release the potential of data. It is the most effective way to communicate valuable insights that otherwise lives as numbers in an Excel spreadsheet.
Good data stories are created by people that have a deep understanding of a problem and use data analytics and visualisation to surface the information with clarity. These people don’t just share data, they share stories. The stories they share have impact, clarity and drive results. But data storytelling is hard, and time-consuming for humans to manage alone.
A key component in Data Storytelling is the English language. Including English sentences that describe data, along with visualisation, provides far greater clarity and context than data alone.
While companies generally have great Data Storytellers for the most important analysis, they also have difficulty scaling this and delivering high-quality stories consistently. It’s time-consuming enough to extract, clean and analyse the data, and requires special skills to visualise data well. It’s even harder to describe it to the decision makers without complicating. We all wish we had many more Data Storytellers, but this is not feasible. Even if we did, it would still be a slow process. The result is we too often fall back to just sharing data, spreadsheets and rely on dashboards.
Studies have shown that Text is understood much faster than data alone.
To create compelling Data Stories at scale, we developed a Natural Language Generation (NLG) Engine focused on building English language observations and insights that describe raw data. I want to share our experience and some of our learnings here in this article. During our research, it’s been fascinating to learn about the English language and how it can be used to add value to data. Lots of value. What we didn’t expect is not only the technical process of generating English, but the grammar and content nuances that change the way readers perceive the accuracy, credibility, and insightfulness of statements.
Below are some examples of where NLG can be applied in visualisation. We use NLG to improve our automated Data Stories by augmenting visualisations with the English language to provide context and increase information cognition.
For example, simple headlines go miles in helping users understand what the visualisation is about. We call these Smart Captions.
The English language can be used in many other ways to increase clarity, including Annotations. Visualisations can only go so far before they become complex. What tends to happen as you add more data points to a presentation is you end up with a Table or a Spreadsheet, these are efficient but also a sure way to dilute the insight.
A great tool to add clarity to a visualisation is Annotations. Good analysts do this well, we see annotations scribbled all over the top of our Nugit Reports, but in the past, these have been manually created, which is slow. To automate storytelling, we need to do this automatically. Here is an example of NLG adding context to a simple bar chart.
Annotations are more difficult to automate, because often the statements using different data than what is represented on the visualisation, such as a change log or external event calendar. Presentation wise, our designer did a nice touch to the NLG text for this use case with a handwriting font, helping to create a human perception.
And finally, the shortest but one of the most effective ways to use the English language in storytelling is the Headline or Email Subject, usually consisting of 1 line that communicates a clear highlight and gets the reader’s attention.
Building a Natural Language Generation Engine
Our starting point when developing NLG for Data Storytelling was to first research how Human Analysts annotate data, and the kinds of observations they make.
After reviewing hundreds of statements made by analysts while using Nugit’s platform, as well as looking at manual reports and data summaries shared by our customers, we worked out that we can broadly classify the kinds of language used to explain data into five main groups:
- Peer Group Comparisons
- Highlighting key data points. For example, calling out a maximum value or an anomaly.
ROI peaked during the second week of November
- Trends descriptions
The Organic Search has continued an upward trend over the last 4 weeks, driven by an increase in Google traffic.
- Forward-looking statements
Revenue is expected to grow 15% in November vs. the Previous Month, driven by additional Paid Search Investment and better than average SEO.
LinkedIn Advertising is driving 3x better Cost Per Conversion than Facebook Ads. A $500 additional investment in this channel could deliver 32 additional leads per week.
After reviewing 100s of human analysts statements, we noticed that most English language descriptions contain one, or combinations of the above statements. The use of these descriptions can drive significantly increased cognition and understanding of data when combined with visualisations.
After iterating on the technology platform and building out our first completely automated statements, we needed a framework for measuring the quality of the NLG engine that we could quantify and measure our ongoing progress against.
NLG Engine Measurement
We explored various metrics for understanding our progress, and settled on 4 indicators that could be measured on a 4-point scale.
To collect data, we produced 2 sets of Visualisations that combined data and the English language and collected responses from a panel of marketers.
Set 1 was authored by Human Experts, and Set 2 produced automatically by our NLG engine, Nugbot. We collected 2,420 responses to samples including 550 responses to Nugbot text, and 1,870 responses to our human writers of various experience levels. The results were surprising!
This is one you would expect Humans to do well at. We asked the question,
Is the text description more likely to be written by a human analyst, or generated by automation?
Our human-created stories were often considered robotic. Nugbot 1, Human 0. There was a perception that 16% of the statements made by Humans were Robotic, vs. only 10% for our NLG engine. Too often we are working towards how to make Nugbot more human sounding, but there is a perception that Expert humans are also saying robotic statements. As we looked deeper into the types of statements considered human vs. robotic, we found that more complex sentences were usually considered Human, but very simple one-line statements were either tagged as robotic or unsure. This learning helps us focus on this perception for future iterations of our engine.
Insightfulness in data stories
How valuable is the text description in improving understanding of the visualisation?
Surprisingly, responses for Human generated text considered around half basic, vs. about 30% for statements generated by Nugbot. The observations that created the most insightful responses are statements that described peer group comparisons. Pointing out trends or discussing specific points in the data were considered simple and basic. I don’t believe there’s a right or wrong response here if the statements are helping to highlight information you want to ensure the reader picks up.
Is the statement perceived to be factually accurate by the reader? 70% of human statements were perceived to be accurate, whereas Nugbot’s statements were slightly lower at 63%.
While we know that there were no statements that were mathematically inaccurate, the readers believed these to be:
- Complex calculations that grouped a lot of the visible data in ways that would be difficult to verify without a calculator.
- Statements that were confusing, or where there was a grammatical error.
We should keep this in mind when trying to be too fancy or complicated when we describe text, and be conscious that this complexity affects the credibility of the observations.
This was the quality of the language used in the text description.
We noticed from our 4 human experts, various levels of grammar. These guys are data analysts, not English majors, so they make mistakes. People also have short-hand ways to describe data, which is acceptable but perhaps not grammatically correct.
Nugbot was considered to have made a slightly higher percentage of statements that were considered “Great”, and fewer statements that were considered “Poor”, however, we saw that this is also a bit subjective, given the short-hand statements. We also found that Grammar did not play a role in how insightful or accurate a statement was perceived to be.
What’s next for data stories?
At this point, there is a lot of value that intelligent technology can deliver in the sharing of information, particularly data. NLG can play a massive role in bringing together the things that typically get lost in data sharing using Dashboards and Spreadsheets. We can re-humanize the process and make it easier for Readers who rely on understanding information to make data-driven decisions.
As a next step, we have some exciting software releases planned for early 2018 that scale our NLG architecture across the entire library of data platforms that our platform analyses to generate Stories. These improvements will adopt the learnings so far, continuously improving clarity to put Nugbot among the best Human Data Storytellers so that they can share their data stories.
This post was originally shared on Linkedin.
—– END —–
About Dave Sanderson
Dave established Nugit after 13 years of experience leading Digital Marketing teams in Sydney, Hong Kong, and Singapore. It was during this time running digital campaigns for some of the largest brands in the region that he grew frustrated by the limitations of dashboards and spreadsheets, and the amount of data that was wasted. Quitting the corporate life to solve this problem with a new approach, he spent the first few months building Nugit on the beach in Malaysia, where his vision of the Nugit Data Storytelling platform materialised. He’s passionate about utilising AI-technology to do tasks previously reserved for humans and enable data analytics to scale infinitely in the form of stories that have a proper narrative, context, and purpose. Leading a team of 40+ data scientists, designers, engineers and storytellers, Dave’s goal is to build a world-class SaaS company born in Singapore.