Text Analysis is an advanced feature in SAP HANA that entails analysing every part of the unstructured text be it words, sentiments or semantics. This data after analysis is stored in indexed tables which makes them clearly comprehensible and easily categorized. It is broadly classified into:

• Linganalysis or Linguistic Analysis, and
• Entity and Fact Extraction

This analysis is done by full-text index creation on the table. It can be created on an existing table or during creation of a new table. In both the cases a new table is created containing the linguistic analysis result. Based on this there are two types of full text index creation:

  • 1. Explicit Full Text Index – Index is manually created separately after table creation. The index created on a specific column creates a hidden indexed column which is usually queried. The name of this index is visible as $TA_ and found in the same place as the table. The code used is:

     
      CREATE FULLTEXT INDEX 
          ON ()
    	TEXT ANALYSIS ON;
    		
    
    
  • 2. Implicit Full Text Index – When an index is created during table creation. In this case the table must have a text, short text or bintext column on which the text analysis will be done.

     
     CREATE COLUMN TABLE    schema_name.table_name
          (	
    		ID INTEGER PRIMARY KEY,
    		STRING SHORTTEXT (200)
    		TEXT ANALYSIS ON
    		ASYNC
    		)
    
    

    Note: We cannot directly manipulate this index. Its name also starts with $TA_, but the rest is system generated.

Coming back to the linganalysis, it consists of three basic varieties namely

Linganalysis Basic
Linganalysis Stem
Linganalysis Full

SAP has enunciated three core processes that form the crux of linguistic analysis. They are discussed below:

• Tokenization: This involves breaking the sentence into words or tokens.

• Stemming/ Normalization: derives the dictionary word from which the particular word/token is derived.

• Parts of Speech Tagging: Tags whether the token is a noun, pronoun, adjective, verb, adverb or any other parts of speech.

Linganalysis Basic does only tokenization. TA_TOKEN column is populated with the specific word. The TA_TYPE, TA_STEM and TA_NORMALIZED columns will not have valid data. Linganalysis Stem does tokenization and stemming. TA_TOKEN, TA_STEM and TA_NORMALIZED columns are populated. TA_TOKEN has the tokens/words, TA_NORMALIZED has the word which is being considered for normalization and TA_STEM has the dictionary word from which it has originated.
Linganalysis Full does all three – tokenization, stemming and tagging. TA_TOKEN, TA_STEM, TA_NORMALIZED as well as TA_TYPE. The last contains the part of speech tag.

When we use linganalysis configurations we need the following code where, the name of the configuration is given as per our requirement.

 
 CREATE COLUMN TABLE .
      (	
		ID INTEGER PRIMARY KEY,
		STRING SHORTTEXT (200)
		TEXT ANALYSIS ON
		ASYNC
		)

Apart from the columns mentioned above the following also show some useful information

Columns

Purpose

TA_RULE

This show which rule it follows whether Entity extraction, LXP (Linguistic analysis), etc.

TA_COUNTER

TA_COUNTER It numerically lists the tokens detected in TA_TOKEN.

TA_LANGUAGE

SAP HANA supports 31 languages and it automatically detects the language being fed into the table.

TA_SENTENCE

It numerically lists down the sentences.

TA_PARAGRAPH

It numerically lists down the paragraphs.



Here, it is worth mentioning that in case of language detection, if any sentence has words of different languages mixed together, then HANA specifies that language in which the majority of words are given. For Instance:

Sentence

Language Detected

Es ist tolles Wetter

DE(German)

it is great weather

EN(English)

Es ist tolles weather

DE(German)



The Entity and Fact Extraction is another part of HANA Text Analysis. From the name itself it is apparent that this derives meaning from the text and sentiment based on the words used. Here we can use configurations or parameters provided by SAP in the form of 4 pre-defined configurations or create our own customised parameter settings. The built-in ones are:

• EXTRACTION_CORE – this extracts entities like people, places, firms, URLs and other common items.

• EXTRACTION_CORE_VOICEOFCUSTOMER – this is the most important one which enables us to derive positive or negative sentiment in the data and judge the majority opinion and draw necessary conclusions.

• EXTRACTION_CORE_ENTERPRISE – this provides common keywords related to organizations, mergers and acquisitions. This helps to analyse opinions entailing the company and its competitors.

• EXTRACTION_CORE_PUBLIC_SECTOR – this is data about public figures and communal events or days, etc.

These are found in HANA under Contents folder in the below package path: Contents rootHana ta config And each has extension hdbtextconfig.

If the built-in configurations are not sufficiently tailored for our needs, then we opt for creating configurations. These customized configurations allow us to tweak some of the parameters or create our own custom dictionary and rule sets. The inbuilt configurations provided by SAP are in themselves quite full-bodied. So in usual cases we just expand on the in-built ones based on our industry requirements. Let’s say a specific kind of terminology is used within the company, then that can be incorporated in the custom dictionary.

For example, if
o Neenopal,
o Neenopal intelligent Solutions Pvt. Ltd.
o Neenopal Analytics,
o Neenopal Intelligent Solutions, etc all refer to the same entity (Neenopal intelligent Solutions Pvt. Ltd.), then we can map the same in customized dictionaries in HANA.
The custom rule sets are the most complicated of the customization files. Unless essential we do not edit these.