Chinese NLP 101

Liu Tao, Co-Founder & CTO, MioTech2019-05-14

In this article, we speak to MioTech's CTO on the fundamentals of Chinese NLP in a financial context, its challenges, its progress and what's in store for this form of artificial intelligence in the near future.

What is natural language processing, and how different is it in a financial context?

Natural language processing is a form of artificial intelligence that deals with the interaction between computers and humans to better understand, interpret and manipulate human (natural) languages. NLP takes on many forms but the ultimate goal of NLP is the cognition, understanding and generation of human languages in a manner that is valuable.


In order for machines to derive meaning from natural language, the computer will utilize algorithms to turn textual input into associations and relationships extracted within a sentence, and from there collect and process according to a purpose.


NLP is the driver behind many forms of AI you're used to seeing. Whether it be translation (Google Translate), speech recognition (Apple’s Siri, Amazon’s Alexa), knowledge graph and intelligent graph queries (Google Search Engine) etc., NLP is enabling technologies to understand and process human languages.


Natural language processing for the finance sector is still in the exploratory phase. Since finance itself is a highly specialized field, many words carry nuanced meanings in a financial context. From jargon to technical terms, the financial field measures, understands and processes language differently from other industries.


Classification and scoring of news sentiment in AMI, MioTech’s market intelligence tool.


Therefore, NLP to serve the financial sector requires the preparation of a specialized training data set. At present, all NLP architecture requires large data sets which leads us to the biggest challenge NLP faces in the financial field, the lack of a deep, professional training data within finance.


At MioTech we apply NLP to focus on solving NER, relationship extraction and knowledge graph. Using knowledge graph to compensate for the inadequacy of training data, the data set can be supplemented with other data already associated within the map.


AMI’s knowledge graph  


Where are we at with NLP and how difficult is this process?


The development process of natural language processing is roughly the same as the development of artificial intelligence, and has experienced the following stages of development:


  • The 1950s-80s: Realization of human mastery based on human experience;
  • From the 1990s to around 2000: Based on statistical methods;
  • From 2000 till now: Due to the substantial increase in data and computing power, the gradual introduction of the deep learning method into the NLP field saw major breakthroughs in machine translation, question and answering systems, and automatic summarization.

But it is important to note that Natural Language Processing still faces many challenges. The human language is very concise and omits presumed knowledge in many conversations. Humans have the ability to easily understand the context even with omissions. But for a machine, it’s a complex process. Take for example, “I’m engaged.” When a machine doesn’t understand the specific context, it is difficult to distinguish whether the person is engaged to another person or if the person is temporarily unavailable. As you can see, parsing English with a computer is not an easy task.


How different is NLP in English as compared to Chinese? What are the challenges facing NLP Chinese as a language? What are the best methods for training Chinese NLP?


From a pure language point of view, English comparatively to Chinese, is more direct. By extracting just the nouns within a sentence, you can infer the lexical semantics to a large extent. Since English is a phonetic based language, one can also infer intention by parsing through grammar, tenses, passive/active, affixes, singular/plural forms of words.


But Chinese language or characters are similar to hieroglyphs, meaning that there is limited conversion on various parts of speech and it is near impossible to split a single word. Therefore, the machine must judge the specific semantics through context. Due to the uniqueness of Chinese language, the performance of same NLP model in the English context is generally better than in Chinese.


Word segmentation is one of the difficulties in Chinese NLP. For an English comparison, take for example, “A woman without her man is nothing” or if split/punctuated, “A woman/ without her/ man is nothing.” Different word segmentation causes ambiguity.


With the widespread use of deep learning, what sets different languages apart have gradually become a difference in the amount of training data. In the past, in terms of the NLP field, the amount of Chinese data available was far less than that of English data. However, now in China, more and more people are investing in artificial intelligence and research in the field of NLP. Therefore, the problem of insufficient Chinese data sets is improving year on year.


In the financial field, the stages of Chinese and English NLP are generally the same in terms of basic financial knowledge. But, for complex issues such as sentiment analysis, relationship extraction or topic extraction, both Chinese and English NLP have long way to go due to the complex scenarios and context, as well as the substantial amount of training data.


What do financial institutions seek to gain from a robust NLP solution?


NLP can aid with analyzing an industry's entire value chain and sentiment analysis including monitoring and analyzing news, announcements and social media. Machines can help us read and process huge amount of information in a short time which is impossible for human beings to do so.


For example, commercial banks want to utilize comprehensive data to conduct accurate credit risk management of enterprises and to predict or forecast the potential risks posed by these enterprises in advance. At present, the conventional method would be based on the annual report published by the company and then judged according to the results from the field investigation conducted by the compliance officer. The enterprise's own risk disclosure not only comes with a time lag between the report and the actual event, public information coverage is not high, resulting in an assessment that covers only a basic understanding or just the tip of the iceberg. This is where natural language processing and artificial intelligence can make a difference.


Natural language processing can mine the multi-dimensional relationships extracted within information, evaluate the relationship between enterprises, and intuitively present the relationship between enterprises through knowledge graph, and from there, set up early warning alerts in advance. Once a relevant/significant change within the enterprise network is detected, the quick assessment of the impact on the entire network can be achieved based on the weight of the relationship.



Mining an industry's value chain based on public financial reports of listed companies is another application of natural language processing within a financial context.


Data of an industry’s value chain are based on the financial reports of A-share listed companies, all of which are original data sources. According to the main business composition in public financial reports, the keywords are extracted and input into the pre-trained neural network, and the vector expression is aligned.


Next, we perform density-based clustering on the input vector, output clusters of different densities, and finally cluster naming.


What do you foresee happening in the future in terms of NLP, especially for Chinese?


As more and more data is generated each day, the number of data sets available for machine training will continue to increase. With this continued generation of data, NLP can only increasingly see significant improvements.


At the same time, with the development of deep learning, machines can now solve targeted problems more directly without relying on human input. This can be seen in AlphaGo which continued to perform better and then outperform after absorbing human experience.


After the emergence of the BERT model, the accuracy of NLP has improved. In some cases, take machine reading comprehension in particular, the accuracy of some models have completely surpassed humans. This year, AI beats humans in a Stanford reading comprehension test - Stanford Question Answering Dataset (SQuAD). The leaderboard on SQuAD's website shows machine’s EM score (Exact Match, providing exact answers to questions) to be 87.147 (as of Mar 20), which puts it at first place, using the BERT algorithm. The score is higher than a human's, which is at 86.831.


From the perspective of the Chinese language, NLP will continue to develop in tandem with deep learning. As data sets become more and more abundant, the relationship extraction for complex semantics will be more accurate, and the recognition of intentions will gradually improve. In specific application scenarios, especially in the financial sector, it will grow to become more mature.


At MioTech, we’ve adopted the knowledge map to solve the problem of insufficient training data. With the further development of NLP algorithms, paired with our core technology, entity recognition and relationship extraction will only be more accurate.

Share this to
Share this to