Firms Carve Rosetta Stones for Non-English AI Surveillance
English has long been considered the lingua franca of international business. But as geopolitical, industrial and human trends veer away from an Anglo-centric perspective and become more globalized, the importance of other languages—and the need for fluency in them—rises in turn.
In the capital markets, there is a significant impact for artificial intelligence (AI) here. While machine learning, natural-language processing and other AI subsets have evolved to the point where they are becoming intrinsic elements of trading firms’ surveillance and trading operations, there is a problem: Most have been developed and trained to analyze English. In an industry that is increasingly populated by other languages, that simply isn’t enough.
“We’re at the very tip of the iceberg. Technologies like machine learning in particular are starting to mature and you’re seeing people use this technology in their daily lives more. You can see people in compliance and those who use trading surveillance tools recognize that we can now use this technology,” says Jay Biondo, product manager at Trading Technologies. “I still think it’s very much in the early stages but it’s going to start to be adopted more widely and it does open the door to use this technology, maybe like [Apple’s] Siri or [Amazon’s] Alexa for trade surveillance where it actually translates the things for you and speaks to you.”
Determining which language is the most widely used in the world is difficult, but based on the number of native speakers, the most commonly spoken language is Chinese. Others include Spanish, modern standard Arabic, Hindi, French and Russian. With these other languages being used by a good chunk of the world, it is inevitable that they will start creeping into business discussions as a way to talk about market movements between traders and investors.
Content is King
Market sentiment analysis is one of the areas in which AI can be deployed, for instance, but if the models are restricted to just one language then investors may miss crucial signals. PanAgora Asset Management developed its own machine learning models to track chat and blog conversations in Chinese to determine market sentiment.
Mike Chen, portfolio manager at PanAgora, says its solution relies on an entire corpus used to train the natural-language processing model of languages to track conversations.
“Whether the machine can learn multiple languages at the same time depends on the relationship between the languages as they appear in the corpus used to train the natural-language processing algorithm. It might require a different clean-up or different processing, segmentation or stemming,” Chen says. “But the core engine does not know and does not care what language it is. If you have a corpus of different languages in it, as long as you have a dataset with more a sufficiently large sample, it’ll be able to learn.”
Understanding the market environment and the conversations being held around it involves not just taking in the data, but creating actionable insights from it. With natural-language processing, companies are able to take in content from emails, voice calls, chat and written documents on the internet and break that information down in order to categorize important topics. This is then used to generate analytics and insights. For PanAgora to get a big-picture view of the market, it helps to keep a full database of the language so the machine-learning algorithm is more entrenched in it. This allows the algo to learn more quickly if users start using slang or other words in market conversations.
Chen says another challenge the firm had to deal with has been the sheer number of conversations that take place in blogs and their comment spaces, some of which may not even be written by real people. The challenge is to know enough of a language’s syntax and structure to determine the difference between a human and a bot. PanAgora has had to create a filter that not only understands Chinese and parses it for information but also determines which posts may have been authored by bots.
“When we collect the Chinese retail blog discussions, we filter out a lot of the robot posts. We keep looking at it and couple this with our ability to read and understand Chinese, and our local knowledge. We did a lot of pre-processing to do this,” Chen says. “With robot blog posts, they have familiar heading patterns, or they might say ‘Company ABC recommends this.’ So we filter that. Those are some of the common examples but we have a whole host of them. They’re usually uninformative blog posts.”
But while PanAgora uses an entire database of one language, some other firms say all they need is some knowledge of non-English languages as they are more concerned about certain words in conversations and their connections to market movements. While it is important to work closely with experts and native speakers, meaning can often be determined through analyzing key works, proper nouns and other reference points without necessarily requiring fluency.
Steve LoGalbo, director of product management at Nice Actimize, says the technology concerns itself with the content of the conversation, so the machine looks for specific words used in specific contexts.
“For us, communication is communication; it doesn’t matter what language you’re speaking but you have to have the technology that can extract the things that people are saying in those various languages,” Logalbo says. “We’re using technology that understands language and this includes a text analytics component that understands English, or understands Chinese and Japanese, or different languages, and those text analytics components are extracting interesting conversation topics.”
He says the technology Nice developed is trained to recognize entities, people or places in a conversation no matter which language in the conversation it is dissecting.
As the machine monitors conversations, it can start to develop its own understanding of the different languages and mark points of knowledge by itself. Once it does learn, the software moves away from the more supervised learning. This has been proven in English-based machine learning and natural language processing but is still in its infancy for other languages.
LoGalbo says companies can customize the terms they want to monitor in the supported languages. Once these are set, the system starts to learn patterns to better classify conversations and determine their impact, the same as in English. He adds, however, that there needs to be some level of supervised learning involved, especially in a language other than English.
Limitations
This becomes important when it comes to surveillance as opposed to just gleaning insight from market chatter. Machine learning and natural-language processing in English has evolved significantly in recent years, but experiments around support for multiple languages tend to be left behind. Therefore, the technology may be a long way away from spitting out bespoke market insights based on random conversations or to predict potential fraud, but it can eventually catch up. English-based machine learning has also begun experimenting with detecting sentiment in conversations.
The process to allow a machine to learn to read and analyze is not without difficulties. And non-English machine learning can face more limitations than its English-based counterpart.
A big difference, of course, is the availability of datasets to train a system, and people who can annotate the data to begin the guided learning process for the software—the first step toward deeper, more independent learning by a machine. Catherine Havarsi, AI science lead at Agorai and a researcher at the Massachusetts Institute of Technology (MIT), points out it is rare for companies to keep records in different languages as religiously as they do for documents in English.
“Non-English natural-language processing is still behind English and most of it is because there is a lack of datasets and training data available in these languages. Machine learning datasets need to be annotated so you also need someone who understands those languages,” Havarsi says. “Most firms that really focus on keeping records are US- or UK-based companies and other firms just don’t keep documents with different languages.”
Of course, many of the documents that run businesses all over the world are written in English. Most regulations in developed markets, too, are written in English. But by not paying attention to records in a foreign language, the industry risks being blind to information that sheds more insight into certain markets.
Havarsi adds that there may be a limit to the number of non-English languages that can be programmed into a trade surveillance module, so not all languages may make their way into a machine-learning surveillance product.
“There are only [a relatively small number of] languages with their own Wikipedia page and those could be the ones we can pay more attention to. There are a lot of languages out there and there is a drop-off point for natural-language processing,” she says. “Crowdsourcing has been such a huge help in the development of the technology as well so if languages don’t have enough of a crowd—and it has to be an expert crowd too—then it might not have enough to really develop that particular use case.”
Spoken conversations also present an issue for recognizing other languages—also a problem in English—because of different accents.
LoGalbo and Trading Technologies’ Biondo both note it is important to have different training recordings of different accents, even in English, as this drastically changes how words sound and how the technology takes in information.
Slang is one other issue the technology has had to deal with in both English and non-English languages. People may take words and use them to mean something else and it takes a while for the technology to make sense of this new usage. But it is especially difficult in other languages. PanAgora’s Chen says the way the asset manager deals with slang within the Chinese internet community, for instance, itself often designed specifically to evade this exact use of AI surveillance technology, which is employed by the Chinese government to monitor communications between citizens, is to wait until it gains prominence before updating its library.
“The library is the natural-language processing model. It just keeps on updating. When a new cyber slang gains prominence, if the algorithm sees it sufficiently enough times, it will pick up on it. It’s fully automated and self-updating,” he says.
Future Innovations
Non-English machine learning, much like regular machine learning, will eventually evolve further. In the future, it will be more sophisticated and possibly develop understanding of both the English content and the non-English data it takes in.
Agorai’s Havarsi notes the more machine learning and natural language processing develop, the more non-English languages become integrated. “
As we keep developing natural language processing, we’re definitely going to see trends. One is machine translation. But for the financial services industry, you’re looking for more actionable language that may not show up in the translation,” she says.
One future possibility for the technology is generating not just insights but providing additional learning for international traders. Biondo says Trading Technologies wants to develop the technology enough that it can translate conversations around certain market movements to clients in other countries.
Currently, the company uses visual cues to point out potential market disruptions. Eventually, Biondo hopes the company will have the ability to provide explanations in different languages to contextualize the movement.
“Some people train their machine-learning models with the actual language from previous cases and that’s the starting point. Then you’re going to have to start to add another layer where you’re putting that with a foreign language while also putting additional detail on things that get to more of the mechanics of that activity,” Biondo says. “Ideally, it becomes more than just a term of art that only the domain expert in the US understands, but something that somebody in Europe can understand as well.”
While some of these potential innovations for non-English machine learning are still far off, everyone agreed the technology is improving and better solutions may emerge soon.
Originally published in WatersTechnology, February 21, 2019