Cognitive technology and Text Analytics to secure mass gathering events

Cognitive technology and Text Analytics to secure mass gathering events

By Andres Garcia-Silva, Jose Manuel Gomez-PerezExpert System Spain, and Alessio Mulas, Davide Ariu – Pluribus One.

Web Content as primary source of information.

The value of an intelligence solution for mass gathering is its ability to make use of all the information available and provide the tools that analysts need in order to gain visibility, context and insights. The Web, including social media, news, wikis, forums and web sites in general, is a prominent source of user-generated content about events that can be leveraged to identify and assess security threats.
Nevertheless, when it comes to extract meaningful information from online sources, several big data challenges need to be faced, including the distributed nature of web resources and the fact that this information is mostly unstructured text. Indeed, regardless of the scale, processing natural language is a cumbersome task, given the ambiguity of words and sentences, misspelling errors, slang and informal language used in social media, and multilingualism, to name a few issues.

Semantic Intelligence Engine

Expert System and Pluribus One work together in project LETSCROWD to deliver a Semantic Intelligence Engine (SIE) that enables security analysts to monitor and gather text from web resources via a configurable focused crawler. The extracted text is processed with Cogito, Expert System’s cognitive technology, which enriches it with content-based semantic metadata, enabling advanced visualizations that support data analysis and inspection from a security perspective.

Web Crawler

The Web Crawler Module gathers information from several web-based resources. Crawled sources are open (OSINT) and social media oriented (SOCMINT) and legal and ethical boundaries and limitations are considered. To tackle the complexity of different sources and the constantly changing scenario, the Web Crawler Module follows a plugin-based architecture where several smaller components or “plugins” manage atomic researches with customized approaches for each different source while a core module manages data retention, communication with other external modules and other tasks.

Text Understanding: Semantic Technology and Machine Learning

Cogito is able to understand text producing meaningful, actionable intelligence that enhances insight, fuels more knowledgeable decisions and strengthens what analytics can reveal. The capability to understand, relate and disseminate intelligence as it is acquired can accelerate the risk assessment of an emerging situation and contribute to the development of a threat profile or to the effectiveness of a security strategy.


The approach for text understanding is based on a representation of knowledge, the Cogito Knowledge Graph, that encodes linguistic knowledge for 14 different languages. Cogito carries out semantic analysis, including word disambiguation, to identify the correct meaning of words and expressions in context, and understands the relationships between different concepts. The output of these linguist-based analysis is used to perform more complex tasks such as information extraction, text classification in taxonomies, and author writing style analysis.

For example, security analyst can monitor categories such as “act of terror” or “religiously inspired terrorism” where documents are placed if they contain information about a terror attack or the motivation of the attack was religious. They can inspect sources where entities of type “criminal organizations” are found or specific people and places are mentioned, and the writing style analysis could be used to identify authors that are consistently using slang in criminal or cybercriminal contexts.

Text Analytics

We developed a number of intelligence dashboards that provide a high-level indicator-based unified view of the documents gathered about a given mass gathering. Such dashboards are interactive and dynamic. They allow filtering the document collection using keyword searches, dates, taxonomies, and time series (see left hand side column), enabling a detailed inspection of each of the documents that fulfil the criteria.

The dashboards include widgets that describe the metadata found in the document collection (columns on the middle and right-hand side) such as a tag cloud for frequent terms. They also include a variety of charts (bar, pie, line) that show the distribution of named entities extracted from the documents (people, places, organizations, etc) and indicate a measure of the level of slang per document or the characteristic type of slang that each author used.