In the first article of this series, we set out to demystify the meaning of machine learning and show how it was relevant and practical to consumer insights.
This second installment is all about doing the same with natural language processing (NLP), aka text analytics. Natural language processing is widely acknowledged as a sub field of artificial intelligence.
Its focus is on enabling computers to process and understand human languages, and perform functions such as translations, semantic analysis, text classification, extraction, and summarization.
In practice, NLP relies on multiple disciplines to include computer science, computational power, statistics, and linguistics to understand human communications.
When NLP is mentioned, what most people are referring to are:
You might be asking yourself, this is great, but how does it all work? How does the computer understand text? What type of processes and algorithms are applied?
You might be surprised to know that to get to useful and accurate NLP outcomes, it is often the case that computer scientists, engineers, linguists, and industry specific experts do a significant amount of manual and not so sexy work to get software to perform “artificially intelligent tasks”.
A lot of manual work goes into artificial intelligence!
So, let’s say you have a million product reviews you want to analyze, or 1,000 pages of text from consumer interviews or focus groups, and you would like to extract the sentiments and/or understand the main topics being discussed.
If the NLP software you are using is any good, the first step would be to clean the data. Yes, just like a messy excel sheet with numbers, the NLP software has to clean the data, or at least reduce the “level of noise” to a minimum.
This critical first step is called pre-processing and it involves both “normalization” and “tokenization”.
Normalization involves tasks such as removing non-alphabetical characters, converting letters to lower case, removing stop words (e.g. the, a, in, etc.), converting numbers to words, and stemming and lemmatization. The basic function of both is similar, to reduce words to a common variant, the “stem” or “lemma”. The stem is the part of the word to which you add influential affixes such as -ed, -ize, mis, etc. Sometimes this results in words that are not actual words. The lemma is the base or dictionary form of the word.
Tokenization refers to segmenting the text into smaller chunks. So paragraphs can be tokenized into sentences, and sentences into categories, sentiments, parts of speech, or parsing and tagging the text into anything that is meaningful to the user (e.g. name recognition, sentiment, behaviors, etc.).
While there are readily available libraries with codes and algorithms to perform the above tasks, if you are building your own lexical analysis and framework, you should tokenize your own text.
You may want to, either because it is new or you want to enhance the accuracy. You can also work with a software platform that has some custom data sets already built if they are relative to your space.
The tokenized text becomes a “golden dataset”, which is then used to “train” a statistical model applied to any new text. This is where you may come across the term “supervised machine learning”.
Depending on what you are trying to achieve, there are range of statistical models that can be applied. These range from logistic regression models, to Support Vector Machine (SVM), or deep neural learning.
The type of statistical model chosen depends on the structure and complexity of data, and frankly is the result of continuous experimentation to increase the accuracy.
Hopefully now, you feel a little better prepared to know what is available for your own research. But more importantly, be able to evaluate future solutions with a clearer understanding of the science and the technology supporting it.