A new key for security through the analysis of language and words
IN-Q-TEL(www.inqtel.com) is a CIA financed investment fund, in operation since 1999, which has as its goal “to invest in and encourage the production and research of the most innovative and promising technologies” to back up the activities of the U.S. Intelligence Community. In the knowledge that “at the frantic rate at which innovations advance, it is difficult for a public organization to keep abreast of the most recent advances in the informatics technologies”, the CIA has entrusted IN-Q-TEL with the task of the identification and rapid selection of new instruments for safeguarding national security: the most interesting industrial projects being financed by the same CIA. The shares portfolio of IN-Q-TEL presently numbers about fifty enterprises, among which, are companies that develop TAL technologies.The acronym TAL (Automatic Treatment of Language) defines the disciplines which deal with the models, methods, technologies, systems and applications concerning the automatic elaboration of the written and spoken language.The TAL includes, therefore, both the “Speech Processing” (SP) or the elaboration of speech, and the “Natural Language Processing” (NLP) or the elaboration of the text.The technologies employed for the spoken language encompass the elaboration of the word for the coding of the vocal signal and the synthesis of the text by means of equipment capable of reading, and for the recognition of speech by means of equipment capable of writing. As for the written text, the automatic elaboration of the text aims at reproducing the human capacity of understanding a language through syntactic and semantic analyzers, mostly based on algorithms or statistical modules or otherwise, models of knowledge representation and methods of automatic learning.
The technology of automatic
treatment of the text
The automatic treatment of the text can concern both the generation (synthesis) and the comprehension (analysis) of the text. Among the applications pertaining to the generation of texts, the following examples can be cited: translations, the creation of summaries for books, articles, etc.
Instead, when we speak of “comprehension”, we mean the identification of the contents of a text from a conceptual point of view; in this case, the most significant applications can range from the correctors (lexical, grammatical, syntactic, stylistic), used daily by thousands of people, to the interfaces in natural language (from NLP, Natural Language Processing, i.e. systems which are capable of elaborating the type of language which two “human” interlocutors normally use for communication), to the applications for activities for research activities, automatic classification and the selection and extraction of information from documents.
There are different methods of elaboration, each one characterized by a different level of analysis and interpretation.
Three different levels of textual elaboration
In the full-text elaboration, the text is examined on the basis of the keywords, where a keyword is a string of characters, letters and/or numbers, separated from the other strings of the text, by means of separators such as spaces and punctuation. In this system, no attempt at interpretation of the text is made: the keywords are considered literally, i.e. not for what they express, but for their graphic form.
At the lexical level of elaboration, the text is submitted to a grammatical analysis. Each element of the sentence, also if composed of more than one word, is associated with one headword of the vocabulary of the language of reference: inflected verb forms are brought back to the infinitive form of the same verb, the plurals of nouns and adjectives to the singular and so on.
Therefore, the headwords in the text are analyzed. For example: in the sentence, “The ship has entered the harbour”, the noun “harbour” is elaborated. While in the sentence “He harboured a grudge”, the verb “to harbour” is analyzed.
At the semantic level of elaboration, the text is submitted to a linguistic analysis in order to determine the most probable meaning of each term expressed in the text.
When a headword is ambiguous, that is, it could have more than one meaning, (for example: consider the word “sentence”, which could be understood as a “grammatical unit containing a finite verb” or as a “punishment given by a judge”, a process of “clarification” is set in motion, thanks to which the most probable meaning is chosen from all the other possibilities.
Such a process avails itself of other information present in the system of semantic analyses and takes into account meanings, presumed or ascertained up to that moment, of other relevant headwords in the sentence and in the rest of the text. In short, the determination of each meaning influences the clarification of the others, to the point of reaching a situation of maximum plausibility and coherence at the level of sentence, period and the entire document.
All the information fundamental for the process of clarification, that is, the entire knowledge employed by the system, is represented in a semantic network form, a lexical database organized on a conceptual basis in which the words are not placed in alphabetical order, (as in a classical dictionary), but in groups of synonyms or near-synonyms for the meaning (or concept) which they express. In this type of structure, each lexical concept coincides with a junction of the semantic network and is connected to the others by precise semantic correlation in a hierarchical and heredity structure so that each one is enriched by the characteristics and meaning of the nearby junctions. Among these relations that tie the meaning junctions together in a multiplicity of ways, the following can be cited:
- the relation of “general” - “specific” between nouns, defined hyperonymy ( e.g. dog – hunting dog – Irish terrier);
- the relation of “specific” - “general” between nouns, defined hyponymy (e.g. sheep – mammal);
- the relation of “general – “specific” between verbs, (e.g. to walk – to limp);
- the relation of “all” - “part” between nouns, (e.g. arm – hand – index finger);
- the relation of “predicate verb” – “direct object” between verbs and nouns, (e.g. to drive – car).
The final result is a graph characterized by thousands of junctions and millions of relations.
How the semantic analysis functions
To thoroughly understand the process of semantic analysis, let us consider these phrases:
- “I took an express” (type of Italian coffee)
- “He had taken an express (coffee) at the Station bar”
- “We took the express for Naples”
- “They were taking an express (coffee) at the Station bar because they had lost the express for Naples" (1)
The information extracted and analyzed in each one of these sentences varies according to the approach of the textual elaboration used.
With the full-text elaboration, the keywords which compose the text are simply placed in evidence: therefore, for the sentence, “I took an express (coffee)”, the words, “I”, “took”, “an”, “express” are analyzed; for the sentence “He had taken an express (coffee) at the Station bar”, the words, “he”, “had”, “taken”, “an”, “express”, “at”, “the”, “station”, “bar”, and so on.
In the examination of the headwords, the lexical elaboration allows a more detailed analysis: in the sentence “I took an express (coffee)”, the elaboration identifies and analyzes the verb “to take” and the noun “express”; in the sentence “He had taken an express (coffee) at the Station bar”, the verb “to take”, the nouns “express”, “bar” and “station”; in the sentence, “They were taking an express (coffee) at the Station bar because they had lost the express for Naples”, the verbs “to take” and “to lose”, the noun “express” (repeated twice) and the nouns “bar”, “station” and “Naples”.
It can be noted how, in the lexical analysis, the inflected forms “took”, “had taken”, “were taking” are brought back to the headword “take”. In this way, looking for the headword, one finds all of the documents on the word. Whereas, an analogous search for a keyword in all its possible inflected forms would be wholly impractical due to its complexity.
However, looking for the headword “express”, one finds all of the documents containing this term in all its various meanings because the search for a headword does not consider the significance, but only the lexical unit, which is exactly the same in every case.
Semantic elaboration is the most developed type of analysis because its scope is to understand the correct meaning of every single word of the text, clearly distinguishing between the various possible concepts.
Taking again the four sentences used as examples, the semantic analysis of the sentence “He had taken an express (coffee) at the Station bar”, recognizes the presence of concepts of “to take” in the sense of “eat or take something”, “express” in the sense of “coffee prepared at the moment of asking”, “bar” in the sense of “a public place” and “station” in the sense of a “place for the arrival and departure of trains”.
Let us consider the two concepts of “express” which are present in the phrase “They were taking an express at the Station bar because they had lost the express for Naples”, the semantic analysis recognizes the conceptual difference: in the first case, it assigns the meaning of “coffee prepared at the moment of asking (that taken at the bar at the station) and in the second case, a long-distance rapid train (the one lost for Naples). This is possible, thanks to the semantic network, which contains the possible meanings of the headword “express”, and because it keeps track of the meanings identified for the other terms of the sentence.
From the point of view of meaning, the sentence “I took an express” is the most ambiguous because nothing helps in the understanding of the meaning of the verb “to take” (for example, some other meanings are “obtain”, “seize”, “acquire”, “ to eat or drink something”, “to turn into a road”, “to utilize a means of transport, etc.). The same also applies to the noun “express” (which can mean a “long-distance rapid train”, “a coffee prepared at the moment of asking”, “a letter delivered in a very short time” etc.,) among other possibilities.
To cope with this difficulty, the system could proceed to “the analysis of the surrounding text” in order to find useful elements for a better interpretation or to assign to the doubtful terms, the meanings more frequently used in the area of the information being treated.
A support for intelligence activity
The quantity, the heterogeneity of the type and format of the information (1.80% of which is not structured, i.e. not organized in database, in precise schemes and, therefore, difficult to examine with the automatic systems), renders the elaboration and analysis of the information activity complex and laborious.
The sector of intelligence, understood as both a collection of organizations, resources, systems and technologies for the protection of the security of a country (homeland security), and as systematic activities which support business strategies (marketing, competitive intelligence), can derive great advantage from the use of linguistic technologies. With the relentless course of Internet popularity and the technologies of mass distribution of its contents, all in digital form (or easily convertible into digital form), the quantity of available information has literally exploded and there seems to be no limit to the proliferation of potentially interesting data. It is senseless to think of redistributing the balance of the relation between the abundance of data and the poverty of effectively useful information, if we also consider the high percentage of “background noise”, counting solely on manual capacity. Instead, it is reasonable to think of supporting the analysts with TAL instruments which are able to make automatic rapid elaborations. These technologies optimize the identification phases, the search and selection of strategic elements, thereby considerably reducing the complexity of the operations; the elaborated data then requires, necessarily, in each case, human activity in evaluation and exploitation, which, at least in the short term, remains irreplaceable.
Described in the following paragraphs are some of the possible applications of the technologies of text analysis (semantic search, classification, information mining, study of the handwriting), and with regard to the spoken language (synthesis and codifying of the vocal signal, recognition of the way of speaking, accent etc., identification and verification of the speaker) which, for their very horizontal and scalable characteristics, can be employed for numerous purposes.
The evaluation criteria of the efficacy of an information research engine is the quality of the signal/noise relation of the answer, meaning for “signal” the information that one wants to find, and for “noise” everything that is included, but is not really relevant. Posing questions to a vast archive of texts, not all the signals present will be extracted and certain unrelated information will be included. Therefore, the objective of each research system is to optimize the signal/noise relation.
The principal causes of signal loss are the declination or conjugation of terms, the presence of synonyms and the use of other ways of expressing the same concept. Whereas, for the creation of noise, the principal problems are the different meanings that a headword can have and, therefore, the presence in the contents of the words set out in the research, without, however, their expressing the information that is needed.
As an example, through the means of an ordinary research engine, one may be looking for information on a regulation approved by the Government and one expresses this request with the phrase, “The Government approves Regulation…”, in the reply will be include also the documents which contain the indicated words, but with other meanings, e.g. “The government of a ship requires the presence of an experienced captain who approves the route and the regulation of all equipment aboard.”. All the words of the request are present in the document, but it obviously does not deal with the approval of a regulation by the government. The probability of falling into this kind of “misunderstanding” is very high when searching by keyword or with statistical techniques.
Instead, going beyond the “form” of the keyword (sequence of characters) and arriving at “contents” (conceptual entity), one obtains much more satisfying results in terms of “recall” (capacity of finding the most information possible pertaining to that which is being researched), “precision” (capacity of more precisely identifying the useful information) and “ranking” (capacity of correctly ordering the results: the more relevant documents at the beginning , and the least interesting at the end because they are more “distant” from that which is requested).
With regard to the aforementioned, the most diffused research engines of today present considerable limitations, which can be easily observed by the following example. In an effort to find the term “macchina”, the Google research engine gives more than 4 million results. This would seem a very positive result: 4 million results pertaining to the request are, indeed, very many, apart from the time which one would have to employ to select just a few. However, the term “macchina” (in Italian) can be understood as “automobile”, “movie camera”, mechanical device”, “sewing machine” etc.,
If it is the intention of the user to find information relative to “macchina” in the sense of “automobile” then, naturally, he will not succeed. He will be submerged by an exaggerated quantity of useless noise… (in 4 million results, how many will speak only of the automobile and not of other machines?).
Obviously, the user will obtain scarce precision with respect to the desire to identify information relative only to the automobile.
Likewise, also the level of recall is not able to satisfy the needs of the intelligence.
In fact, many documents will not be selected although they contain information relevant to the research such as “car”, “vehicle”, “cabriolet”, “jeep”, etc.
On the contrary, research engines based on semantic logic offer better recall and precision.
A semantic research based on the conceptual parameter “macchina” (intended in the sense of automobile) (2) will not only furnish documents which contain synonyms (car, auto etc.,) and hyponyms (beetle, fiat, jeep, station wagon etc.,), but it will exclude noise, or rather, all information which contains different concepts (sewing machine, washing machine etc.,).
An interesting opportunity offered by semantic technology is the “research of circumstances”, or rather, the selection of information on the basis of determinate concepts tied to the various elements of the sentence (subject, verb, complement).
For example, determining the logical function of the elements of the sentence and identifying each group “subject-verb-object”, it is possible to reconstruct a scheme which is able to represent, with good approximation, the existence of a “problem” (verb+object) for which exists a “solution” (subject).
In practice, therefore, when one is looking for precise circumstances, or rather when one is looking for solutions to well-defined problems, it is possible to base the research on these three elements. For example, among the results obtained in reply to a research based on the sentence “Criminals (subject/solution) rob commercial concern (verb+object/problem)”, a sentence like the following could appear; “The witness declared that he had seen a delinquent (subject/solution) who was robbing a tobacconist shop verb+object/problem”, but not a type of sentence like, “The owner of the shop next-door to the robbed tobacconist shop said that he did not know the criminal arrested yesterday”.
Exactly the same concepts are present in both propositions, (delinquent, criminal; robbing, robbed; tobacconist shop, shop), but only in the first sentence do they cover the logical function requested.
The semantic technologies represent a valid support for the categorizing of information, both that which is already present in data banks, archives, records, historical databases, etc., and that coming from other sources (word documents, e-mails, pdf, web pages, news flows in real time, etc.,). Starting from the analysis of non-structured texts and combining objective rules (like those tied to linguistic analysis and to the identification of domini -- or topics – valid aside from applicable scenarios and activities) with subjective criteria designed to satisfy specific needs, one can classify in an automatic or semi-automatic way, a large volume of documents.
As an example of how it works, we shall consider this sentence: “During the night a terrorist act was recorded, damaging the offices of the XY airline company; in the explosion more than twenty cars were damaged”.
From an objective point of view one can classify the text in different categories, such as “acts of vandalism”, “acts of terrorism”, “airline companies”, “damage to material goods” etc. However, the system is not limited to objective criteria of categorization, but keeps account of the subjective rules given at the beginning. According to different viewpoints, certain categories will be preferred to others: e.g. in the interests of police administrations, the text will be assigned to the category “act of vandalism” or “acts of terrorism”, while in the case of an insurance company, the text will be assigned to the category “damage to material goods”.
In the area of applicability, there are already numerous productive concerns which utilize solutions for the automatic classification of data. We only have to think of the publishing houses, newspapers, press agencies which, on the basis of a regular taxonomy (logic of tree-structured classification) have automated the process of categorization with great efficacy and efficiency.
The automatic classification is extremely efficacious also for “pre-selecting” enormous quantities of contents on which to do a more detailed analysis at a later date. One thinks particularly of the area of investigation and its needs, above all in the complex investigations of the police where there is the necessity of analysing in the most rapid and correct manner, a mountain of different data (files, e-mails, on-line pages of sites, blogs, forums etc.,) In these cases, the objective which one pursues is not to select beforehand certain types of information, but rather, to identify a track to follow, a route towards which can be directed more in-depth manual analyses.
TAL applications already exist for the pre-selection of all the potentially important contents present who knows where in the documentary patrimony of which it disposes. After an in-depth linguistic-semantic analysis, such technologies arrange the data in a conceptual map, first identified and then classified by topic. At this point, the investigator has in front of him a clear and synthetic picture of aggregation between the different topics which emerged in the phase of automatic pre-selection; one can then more reasonably decide in what direction to concentrate attention. Perhaps, one first navigates among the most interesting elements and then opens a certain document to read it completely, etc.
The transformation of data from the non-structured form to the structured form has always been a most complex activity: besides the selection, in fact, it is necessary to proceed to a series of other complementary operations, which we can define more simply as “activities before the transformation and loading of the information”.
It is undoubted that at the actual state of the technological development, also in this context, it is implausible to hypothesize solutions which are able to totally replace the activity of the analyst. However, solutions exist which are able to furnish a real contribution to the activity of selection, transformation and the structuring of the information. Those based on the linguistic approach allow digging deep into the data (for this reason we speak of “information mining”) to identify and extract the required information and to eventually identify the relations within it.
Let us consider these sentences:
-“Mario Rossi was arrested at Milan for the murder of Franco Neri, member of the Marseilles clan”.
-“Rossi was arrested at a road block in Piazza Giuseppe Garibaldi; he was driving a Punto and was in possession of a P38”.
The TAL technologies are able to extract from the texts: proper names (of people, Mario Rossi, Franco Neri, of places, Milan, Piazza G. Garibaldi) and particular entities, such as “arms” (P38), “criminal organizations, (the Marseilles clan), “type of car”, (Fiat Punto), “type of crime”, (homicide).
Furthermore, significant relations can also be identified, for example:- the relation “people – organizations”: Franco Neri – Marseilles clan; the relation “people – crimes”: Mario Rossi – homicide; the relation “people – place”: Mario Rossi – Milan; the relation “person – car”: Mario Rossi – Fiat Punto; and so on.
Study of the handwriting
With the linguistic analysis, it is possible to easily identify the characteristics which distinguish one document from another. In fact, style, the repeated use of certain words or expressions, a preference for a certain syntactical-logic structure or for certain types of designed shapes can represent a sort of “finger-print” by which the identity of the writer of the text can be deduced.
Furthermore, the thorough analysis of handwriting can also encompass other parameters (the use of foreign words, neologisms etc., general length of the propositions, use of tenses and moods) from which it is possible to deduce, with good probability, certain information relative to the author. Thanks to the application of certain legibility indexes, which evaluate the fluency of the sentences, the use of unusual words and other factors which indicate the complexity of the document, also the level of academic education and socio-economic provenance of the writer can be established. Likewise, by examining the style of the text (use of rhetoric, linguistic expedients, characteristics of phraseology etc.,) it is possible to delineate also the psychological profile of the author. In fact, from the type of a person’s discourse, significant elements on his way of thinking, of character and temperament can emerge. Therefore, the study of handwriting is a particularly rich source of information concerning the ties which inevitably exist between expressive modality and individual personality.
The technologies of automatic treatment
of the spoken language
Let us now take up the theme of the elaboration of the spoken language, in order to illustrate the different technologies in more detail. First of all, we can divide the area into two great topics: generation - synthesis and/or coding of the voice; perception - recognition of the spoken language and/or the speaker.
Coding and synthesis of the vocal signal
The objectives of the generation of the vocal signal are two:
- the coding of the signal, which consists in memorizing it in compress form and subsequently reconstructing it;
- the generation of the voice from a written text.
The coding of the vocal signal starts from the ascertainment that the band heard by the human ear has a significant dimension (caused by the acoustic environment, by the accessory information, by the particular voice of the speaker etc.,) so that it is necessary to find methods to reduce this redundancy of the transmitted signal. Therefore, codifiers have been projected which, on the basis of different methods, try to render the signal as clear and clean as possible (think about the use of the cell phones).
Instead, synthesis of the text means the acoustic production of a written text (text-to-speech): Among the most diffused applications is the reading of newspapers or books to the blind or the reading of messages from calculators.
At the present, the precision and naturalness of the artificially produced expressions are remarkable; certain studies are now focalizing on characterization of the speakers (male, female, age etc.,) and on the introduction of emotions into the vocal tones, like pain, surprise, joy, and so on.
For information in this context should be mentioned the intelligent systems which are able to acquire a vocal request in natural language, identify the suitable reply and transform it into voice.
For these applications, the techniques of text comprehension have a role of fundamental importance and can render possible the realization of extremely sophisticated personal services, able to satisfy an extraordinary gamma of particular needs.
For example, the Navy Centre for Applied Research in Artificial Intelligence has realized an interface of a simulation software programme developed by NRL’s Tactical Electronic Warfare division. The users of this system can control the software with commands of the following kind: “How many enemy ships are there?”, “Do not show the unarmed allied ships”, “Some enemy ships are equipped with the device for…..”.
Recognition of the spoken language
The recognition of the spoken language (speech-to-text) consists, essentially, in transforming a spoken discourse into a written text, converting the words (or phonemes of the words).
To reduce the margins of error, phonetic methods of linguistic analysis have been given to the analyzers as a support, which help to understand the sense of the spoken language, in such a way as to diminish the disturbance of the noise which is created in the spoken language, also training the phonetic identifier to become familiar with the particular voice of the speaker.
The quality of recognition depends on many factors: the speed of the spoken word, the dimension of the vocabulary, the training of the system with respect to the voice of the speaker, etc.
Having said this, it is, therefore, easy to understand the feasibility and the quality of the results of some systems of speech-to-text (like the recognition of vocal commands, the dictation of texts and numbers, the compilation of forms, word spotting in a conversation and, in parallel, the particular complexity of the realization of adequate software for the recognition of the spontaneous spoken language in a conversation between people whose voices have not been previously sampled.
Identification and verification of the speaker
In the area of the technologies of the elaboration of the spoken word language, the identification and verification of the speaker is used, principally, to establish the identity of the speaker and, therefore, to confirm with certainty, the identity declared (with the end, for example, to consent the activation of a system or the go ahead for a procedure, etc.). One of the most difficult aspects is the recognition of a speaker whose identity has not been declared, but must be found by choosing from among a group of possible candidate “voices”. Obviously, the complexity of the task grows in proportion with the number of voices.
Among the applicability areas of these techniques are the biometric systems of identification of a person and recognition for identity or forensic objectives.
The promotion of the research by business corporations, coordinated with the universities, represents a fundamental element for the development of highly innovative linguistic instruments and a guarantee of continuity for the evolution of the TAL phenomenon.
In the Italian context, in 2002, on the initiative of the Ministry of Communications, the Forum-TAL www.forumtal.it was instituted with the scope of coordinating the initiatives of research and development in the field of automatic elaboration of the written and spoken language, and to promote new proposals directed at the employment of these technologies. Founding members of ForumTAL are public and private bodies, which have distinguished themselves for the dedication and results obtained in this field, and which collaborate to stimulate new interests and identify the possible needs of the consumers.
The diffusion of the knowledge of the TAL technologies constitutes an indispensable step for their progress. Unfortunately, the successes of the Italian industry in this sector are relatively still unknown. Notwithstanding, this attractive little-known market has been very busy for some years. It is an emerging market which produces highly competitive solutions which are appreciated by its users; it is a market which can offer great potentiality to the world of intelligence. In this context, it is fundamental to study and realize “non-conventional” solutions which are projected and personalized on the basis of specific problems to be resolved, because the objectives to be reached are “particular” and they are not always pursuable with the employment of standard methodologies and products.
(1) In the original Italian text, the word “espresso” is used in the sense of a type of Italian coffee. It is also used in other senses which have their equivalents to the English word “express”.
(2) In the original Italian text, the example “macchina” is intended in the sense of “car”, but it also means “machine” and is a prefix word for many other types of mechanical devices. For example, “macchina da presa” = movie camera