Stay up-to-date with the latest analytics content!
Illustration: E. Lacey, Getty Images

Detecting propaganda with NLP

"Propaganda is to the democracy is the

bludgeon is to the totalitarian state."

-- Noam Chomsky

As our world is undergoing major crises, propaganda is heavily used to control public opinion and neutralize dissent. The propagandistic toolbox is getting more and more sophisticated – thus, it becomes harder to identify propaganda in public and social media. In this article, we will review the major linguistic expressions of propaganda and see how Natural Language Processing can be used for its automated detection, also pointing out the complexity and limitations of this task.

How propaganda is expressed in language

According to the Institute for Propaganda Analysis, propaganda is the expression of opinion or action by individuals or groups deliberately designed to influence opinions or actions of other individuals or groups with reference to predetermined ends.[1] Thus, we are dealing with a methodology to control the minds of people - and since most people don't appreciate being controlled, propaganda works best when it goes unnoticed.

Propaganda can be organically woven into writing in a multitude of ways[2], such as:

  • Name calling/labeling of the object of propaganda as something the target audience hates/fears or, on the contrary, loves/praises: “Republican congressweasels”, “Bush the Lesser
  • Loaded language, the use of words or phrases with strong emotional implications to influence the audience: “how stupid and petty things have become in Washington
  • Whataboutism: discrediting the opponent’s position by charging them with hypocrisy without directly disproving their argument: “Qatar spending profusely on Neymar, not fighting terrorism
  • Generation of strong emotions, especially fear: “either we go to war or we will perish”.

In general, emotions are a powerful tool – emotional arousal knocks out our ability to think logically and leads us to accepting faulty arguments and logical fallacies. Professional propaganda is getting more and more sophisticated in targeting the emotions and the subconscious minds of readers. For example, to create fear, we might throw around some direct calls to action, such as "we must go to war or we will perish". However, the more direct the statement, the easier it can be spotted. On a more subtle level, fear can be created indirectly. As an example, in recent years we often see statements about the clinical insanity - and, thus, unpredictability - of people in power. This technique gets even more efficient in view of a trending interest of our society for mental health issues. The following example shows related statements applied to Donald Trump in 2020 – and, fast forward to 2022, also to Vladimir Putin:

Algorithms are more objective and consistent in spotting propaganda

Detecting propaganda requires not only a cool head, but also training, which is unlikely to become a part of official education curriculums. Thus, propaganda can still achieve its goal of shaping the opinion of the majority of people, with a small part of interested and trained individuals being aware of it but hardly able to create significant change in opinion. By contrast, NLP algorithms don't succumb to emotions and fallacies and can be more consistent and objective in flagging propaganda.

This doesn’t mean that automating propaganda detection is easy – on the contrary, it is a highly complex NLP task. The three main challenges of automating propaganda detection are:

  • Training data is sparse. The manual creation of training data requires a lot of training effort and exhibits high disagreement, thus requiring a lot of discussion and adjustment.
  • There are many different ways to encode propaganda (e.g. the 13 techniques by Miller), which are expressed in very different ways in writing.
  • The detection of most propaganda techniques requires rich background knowledge or even "intuition", which is difficult to teach to algorithms.

On the other hand, propaganda has a couple of advantages for automated processing. First, the most frequent techniques are also the easiest to detect since they mainly operate in short text spans on the lexical level. This is the case, for example, for loaded language and name calling/labeling. Second, a range of existing NLP techniques can be reused as part of propaganda detection (e. g. sentiment analysis, fake news detection and subjectivity detection). Finally, after detecting several expressions of propaganda in a specific article or by a specific author, we can generalize with a certain confidence that the overall content is propagandistic.

The SemEval-2020 task 11 on Detection of Propaganda Techniques in News Articles[3] has given rise to a range of NLP approaches to propaganda detection. It consists of two subtasks - the retrieval of propaganda text spans as well as the identification of the technique used. In the following, we will look at the three main building blocks of an approach to propaganda detection: the creation of training data, linguistic feature engineering as well as the machine learning architecture.

Training data creation

Training data for propaganda detection can be created manually using specific annotation guidelines. Annotation can happen on the document level or on the level of specific text spans, as in the SemEval-2020 task on propaganda detection (cf. the annotation guidelines). Especially the latter is a rather tricky annotation task: on the one hand, it requires a trained eye and knowledge of a multitude of propaganda techniques on the part of the annotators. On the other hand, propaganda detection by humans is highly subjective. Thus, a lot of reflection and discussion is required to achieve an acceptable agreement rate between the annotators. This is the main reason why most currently available propaganda datasets are rather small - for example, the SemEval gold standard set counts around 530 articles with almost 9000 propaganda text spans.

Most state-of-the-art deep learning techniques require a large quantity of training data. Automated data augmentation can be used to increase the training dataset. Some of the common techniques are Masked Language Modeling, where parts of the training data are generated by a language model[4], as well as the replacement of words by their nearest neighbours in the semantic space[5].

Linguistic feature engineering

Propaganda is a discourse phenomenon and can be expressed in a variety of ways. It also pertains to several levels of language - the lexical, syntactic, semantic and pragmatic levels. Current machine learning models are not able to capture this diversity from raw text without further abstraction. Thus, a range of additional input features and linguistic knowledge derived from the original text can be used to improve the performance of machine learning, among them:

  • Named entity recognition, i. e. the extraction of person names, locations, companies etc. from text
  • Part-of-speech tagging and syntactic trees for a more accurate syntactic representation
  • Sentiment and subjectivity analysis as a proxy for the emotional load of the content
  • Rhetorics, i. e. the identification of rhetorically salient phrases that are part of a subjective argumentation strategy[6]

Building a machine learning architecture

Once the training data is assembled and preprocessed with additional features, they can be fed into a machine learning architecture. The successful systems in the SemEval task benefit from large-scale pre-trained models and rely on transformers (BERT, GPT-2, XLNet, XLM, RoBERTa, or XLM-RoBERTa) or ensembles of these in combination with an LSTM or a CRF.

Outlook: What NLP cannot cover (yet)

The techniques outlined above are the operative tools of propaganda. They have a concrete and explicit expression at the lexical, syntactic and semantic levels. However, for this kind of misinformation to work, it needs to be rooted in an ideological foundation of deeply ingrained beliefs and attitudes. These beliefs and attitudes are constructed over a long time and spread across a multitude of different sources. Thus, they form an intricate intellectual labyrinth that is hard to unravel both for machines and for humans.

This might be one of the reasons why Noam Chomsky, who is both the father of modern linguistics and an ardent investigator and critic of propaganda, never elaborated the relation between language and propaganda.[7] While Chomsky's linguistic theory was a highly sophisticated rule-based theory of grammar, his analyses of propaganda draw from a vast knowledge of history and human nature that cannot be formalized in syntactic trees.

Summing up

Propaganda targets our unconscious minds and aims to be invisible. Algorithms are free from unconscious bias and thus have the potential to uncover propaganda in more systematic, consistent ways. At the present state-of-the-art, NLP algorithms can detect propaganda techniques that have an explicit expression in language, while the analysis of the deeper context underlying beliefs and ideologies remains a challenge for the future.

Note: This article is originally published on The Yuan platform. 

References and further readings

 [1] Ins. 1938. How to detect propaganda. In Propaganda Analysis. Volume I, chapter 2, pages 210–218. Publications of the Institute for Propaganda Analysis, New York, NY, USA.
[2] Clyde R. Miller. 1939. The Techniques of Propaganda. From “How to Detect and Analyze Propaganda,” an address given at Town Hall. The Center for learning.
[3] Giovanni Da San Martino, Alberto Barrón-Cedeño, Henning Wachsmuth, Rostislav Petrov, and Preslav Nakov. 2020. SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1377–1414, Barcelona (online). International Committee for Computational Linguistics.
[4] Andrei Paraschiv and Dumitru-Clementin Cercel. 2020. UPB at SemEval-2020 Task 11: Propaganda detection with domain-specific trained BERT. In Proceedings of the 14th International Workshop on Semantic Evaluation, SemEval ’20, Barcelona, Spain.
[5] Daryna Dementieva, Igor Markov, and Alexander Panchenko. 2020. SkoltechNLP at SemEval-2020 Task 11: Exploring unsupervised text augmentation for propaganda detection. In Proceedings of the 14th International Workshop on Semantic Evaluation, SemEval ’20, Barcelona, Spain.
[6] Verena Blaschke, Maxim Korniyenko, and Sam Tureski. 2020. CyberWallE at SemEval-2020 Task 11: An analysis of feature engineering for ensemble models for propaganda detection. In Proceedings of the 14th International Workshop on Semantic Evaluation, SemEval ’20, Barcelona, Spain.
[7] Lukin, Annabelle. (2013). Journalism, ideology and linguistics: The paradox of Chomsky’s linguistic legacy and his ‘propaganda model’. Journalism. 14. 96-110. 10.1177/1464884912442333.

[8] Chomsky, Noam (2002). Media control: The spectacular achievements of propaganda. New York: Seven Stories Press.