Exploring the Text Analysis Features of SAP HANA

Image

Text Analysis is an advanced feature in SAP HANA that entails analyzing every part of the unstructured text be it words, sentiments, or semantics. This data after analysis is stored in indexed tables which makes them clearly comprehensible and easily categorized. It is broadly classified into:

  • Linganalysis or Linguistic Analysis, and
  • Entity and Fact Extraction

This analysis is done by full-text index creation on the table. It can be created on an existing table or during the creation of a new table. In both cases, a new table is created containing the linguistic analysis result. Based on this there are two types of full-text index creation:

  •  Explicit Full-Text Index – The index is manually created separately after table creation. The index created on a specific column creates a hidden indexed column which is usually queried. The name of this index is visible as $TA_ and is found in the same place as the table. The code used is:
   CREATE FULLTEXT INDEX 
      ON ()
	TEXT ANALYSIS ON; 
  • Implicit Full-Text Index – When an index is created during table creation. In this case, the table must have a text, short text or bin text column on which the text analysis will be
CREATE COLUMN TABLE    schema_name.table_name
      (	
		ID INTEGER PRIMARY KEY,
		STRING SHORTTEXT (200)
		TEXT ANALYSIS ON
		ASYNC
		)

Note: We cannot directly manipulate this index. Its name also starts with $TA_, but the rest is system-generated.

Coming the link analysis, it consists of three basic varieties namely

  • Linganalysis Basic
  • Linganalysis Stem
  • Linganalysis Full

SAP has enunciated three core processes that form the crux of linguistic analysis. They are discussed below:

  • Tokenization: This involves breaking the sentence into words or tokens.
  • Stemming/ Normalization: derives the dictionary word from which the particular word/token is derived.
  • Parts of Speech Tagging: Tags whether the token is a noun, pronoun, adjective, verb, adverb or any other parts of speech.

Linganalysis Basic does only tokenization. TA_TOKEN column is populated with the specific word. The TA_TYPE, TA_STEM and TA_NORMALIZED columns will not have valid data. Linganalysis Stem does tokenization and stemming. TA_TOKEN, TA_STEM and TA_NORMALIZED columns are populated. TA_TOKEN has the tokens/words, TA_NORMALIZED has the word which is being considered for normalization and TA_STEM has the dictionary word from which it has originated.
Linganalysis Full does all three – tokenization, stemming and tagging. TA_TOKEN, TA_STEM, TA_NORMALIZED as well as TA_TYPE. The last contains the part of the speech tag.

When we use linganalysis configurations we need the following code where, the name of the configuration is given as per our requirement.

CREATE COLUMN TABLE .
      (	
		ID INTEGER PRIMARY KEY,
		STRING SHORTTEXT (200)
		TEXT ANALYSIS ON
		ASYNC
		)

Apart from the columns mentioned above the following also show some useful information

Columns Purpose

TA_RULE

This shows which rule it follows whether Entity extraction, LXP (Linguistic analysis), etc.

TA_COUNTER

TA_COUNTER It numerically lists the tokens detected in TA_TOKEN.

TA_LANGUAGE

SAP HANA supports 31 languages and it automatically detects the language being fed into the table.

TA_SENTENCE

It numerically lists down the sentences.

TA_PARAGRAPH

It numerically lists down the paragraphs.

Here, it is worth mentioning that in case of language detection, if any sentence has words of different languages mixed together, then HANA specifies that language in which the majority of words are given. For Instance:

Sentence Language Detected

Es ist tolles Wetter

DE(German)

it is great weather

EN(English)

Es ist tolles weather

DE(German)

The Entity and Fact Extraction is another part of HANA Text Analysis. From the name itself, it is apparent that this derives meaning from the text and sentiment based on the words used. Here we can use configurations or parameters provided by SAP in the form of 4 pre-defined configurations or create our own customized parameter settings. The built-in ones are:

  • EXTRACTION_CORE – this extracts entities like people, places, firms, URLs, and other common items.
  • EXTRACTION_CORE_VOICEOFCUSTOMER – this is the most important one which enables us to derive positive or negative sentiment in the data judge the majority opinion and draw necessary conclusions.
  • EXTRACTION_CORE_ENTERPRISE – this provides common keywords related to organizations, mergers, and acquisitions. This helps to analyze opinions entailing the company and its competitors.
  • EXTRACTION_CORE_PUBLIC_SECTOR – this is data about public figures and communal events or days, etc.

These are found in HANA under the Contents folder in the below package path: Contents rootHana ta config And each has the extension hdbtextconfig.

If the built-in configurations are not sufficiently tailored to our needs, then we opt for creating configurations. These customized configurations allow us to tweak some of the parameters or create our own custom dictionary and rule sets. The inbuilt configurations provided by SAP are in themselves quite full-bodied. So in usual cases, we just expand on the in-built ones based on our industry requirements. Let’s say a specific kind of terminology is used within the company, then that can be incorporated into the custom dictionary.

For example, if

  • Neenopal,
  • Neenopal Intelligent Solutions Pvt. Ltd.
  • Neenopal Analytics,
  • Neenopal Intelligent Solutions, etc all refer to the same entity (Neenopal Intelligent Solutions Pvt. Ltd.), then we can map the same in customized dictionaries in HANA.

The custom rule sets are the most complicated of the customization files. Unless essential we do not edit these.

Written by:

Samatri Pal

Head- Technology & Product - NeenOpal Analytics

LinkedIn

Related Post

Leave a Reply