De-biasing is hard!

The general notion of bias

Bias, as a machine learning concept, is age-old. It deals with providing learning models with some prior information about a task at hand, which might be helpful. In early machine learning systems, models have been biased using certain statistical priors, encoding the statistician’s expert knowledge about the behaviour of a certain system. With the advent of deep learning and the availability of enormous amounts of data, the focus has shifted from injecting explicit priors to concepts such as transfer learning. Transfer learning is conceptually simple: take a lot of data and a complex-enough model, and let the model capture the nuances in the data; now that this model has captured the knowledge in this vast amount of data, it can transfer its knowledge to other tasks, just like statisticians would transfer their expert knowledge using a prior. As these new deep learning models became popular and transfer learning gained widespread adoption, a new kind of bias was spotted in machine learning systems: one which is prejudicial.

Real-life data is riddled with the imprints of human preconceptions and social biases. Let’s say we are training a large BERT model on a lot of data and incidentally, terms like “gay” or “Islam” were so frequently used in abusive comments that our language model learned to disproportionately associate those terms as being negative or abusive. Now, if we use this BERT model to transfer its knowledge to a sentiment classification task, we are met with the situation of our model classifying a sentence like “I am gay, and I follow the path of Islam” as being highly negative and toxic, even though the sentence is none of it.

Unlike our fellow statistician, who carefully selected the prior knowledge to inject, we weren’t careful when we trained our BERT model on this large corpus of data, and now, we have this problem of a false positive bias [1]This illustrates the problem of bias in the current state of the art NLP models and how they can be very unfair to already marginalised communities.

Gay, queer, homosexual, black, lesbian, Islam are some of the words that have been found to suffer from false-positive bias.

Consequences in NLP

The consequences of bias in NLP models can be felt far and wide, and have been the subject of increasing scrutiny in recent times. Many lenses can be put on to study the body of research in the context of bias and fairness. We choose to put on one that looks at a few examples of the effects of biased models with respect to various marginalised groups and communities.

Early works that hinted at the presence of bias in NLP models, studied word associations formed by their corresponding vector similarities in pre-trained models such as Word2Vec.

As a recap, it has been found that the embeddings of words learned by Word2Vec-like roughly render themselves to the rules of linear algebra; thus, if we have four words: “Rome”, “Italy”, “Paris” and “France” and their embeddings then:

Bolukbasi et al. use this idea to discover spurious associations between gender and occupation.

About Prosus

Prosus is a global consumer internet group and one of the largest technology investors in the world. Operating and investing globally in markets with long-term growth potential, Prosus builds leading consumer internet companies that empower people and enrich communities.

Read More


Media contacts

General enquiries
+ 27(0)21 406 2121
Press office
[email protected]
Investor relations
[email protected]

In fact, they take many of these word analogies and aggregate them to assess whether there is a systematic bias in terms of occupations and gender. And there is!

Similar behaviour has also been observed in the likes of BERT and other transformer-based widely-used language models [3]. It is easy to test this ourselves!

Let’s recall that BERT is trained by masking random words from a sentence and making the model predict it. So, if we want to see our biased BERT model in action, all we need to do is make it predict this [MASK] token in the context of an occupation. To do this, we can load BERT:

from transformers import pipeline

fill_mask = pipeline(

And use it to predict masks such as:

s = “[MASK] is a computer scientist”


s = “[MASK] is a babysitter”

Here are a few such results, which reflect how gender-biased BERT really is.

The ripple effect of these gender-biased models and embeddings can be seen in downstream tasks where these models are used. Take, for example, coreference resolution, the task of identifying the same entity (person, place, organisation etc.) in different places in the same text document. The result of bias is that coreference resolution systems cannot correctly identify female pronouns for certain occupation roles at all.

Similar wrongful behaviour can also be found in the context of race. For example, Sheng et al. found a peculiar behaviour in the generative capabilities of GPT2:


Models often relate black men to theft and pimpingLatin Americans to drugs and women to menial tasks.

It gets even more surprising when we delve into abuse and hate speech detection. Toxicity detection is an important problem in recent NLP after the advent of social media. To automatically prevent toxic comments from cropping up on social websites, companies like Google have dedicated teams training models to classify a piece of text as a toxic comment or benign. The result of such an endeavour is the Perspective API, wherein a comment can be evaluated to determine whether it contains abusive language. We tried to use it on statements that are not at all abusive, as below, and the results were grim.



In essence, a model that was being trained to recognise toxic comments likely formed a spurious connection between words that are racial or indicate disabilities, over the real context of the toxicity. When such a model is given a benign sentence with one of these false-positive biased words, it fires off an alarm predicting the benign sentence as toxic.

Fighting bias

The fight against bias starts with identification. To that end, one of the earliest methods proposed is WEAT or Word Embedding Association Test. WEAT aims at measuring the degree to which a model associates sets of target words (e.g., African American names, European American names, flowers, insects) with sets of attribute words (e.g., “stable”, “pleasant”, or “unpleasant”). The association between two given words is defined as the cosine similarity between the embedding vectors for the words.

For example, the target lists maybe types of flowers and insects, and the attributes are pleasant (e.g., “love”, “peace”) or unpleasant words (e.g., “hatred,” “ugly”). The overall test score is the degree to which flowers are more associated with the pleasant words, relative to insects. A high positive score (the score can range between 2.0 and -2.0) means that flowers are more associated with pleasant words, and a high negative score means that insects are more associated with pleasant words.

When WEAT was performed on popular heavily used pre-trained models such as Word2Vec and GloVe, the results showed that human biases were rampant in the embeddings generated by those models.

WEAT was a seminal paper in the area of recognizing systematic bias in NLP models, and several follow up papers have been proposed, which build upon or slightly modify the test.

After acknowledgement comes the need to remedy or resolve bias in such models. Depending upon the learning task, the model at hand and the type of biased being addressed, the remedial treatment proposed in the literature may vary greatly. For example, Lu et al. suggest a mechanism, they call Counterfactual data augmentation (CDA), to augment biased real-world training data to make it more gender insensitive. In essence, it involves creating copies of training examples with male pronouns like “he” or “him” with a replica containing female pronouns like “she” or “her”. This encourages learning algorithms to not pick up on the distinction. They find that such a simple approach is enough to alleviate, for example, occupation bias towards male and female pronouns in data.

While the above mechanism may work for simpler models, it does involve costly retraining. For much larger and more involved language models such as BERT, it may be infeasible to retrain the model from scratch. In such cases, approaches which battle bias post-hoc are more useful. One such approach, proposed by Liang et al., aims to remove bias from sentence representations obtained from BERT-like models (for tasks such as sentiment classification) by intervening after the model has been pre-trained.

Their approach relies on first identifying a so-called bias subspace, an abstract latent vector space in which bias resides for all of the data. To think simply, the representation of every biased sentence in English would reside somewhere in the bias subspace. To construct this bias subspace, Liang et al. use a variety of open-source datasets and choose all sentences from them which have, for example, a gendered pronoun (e.g. “he”, “she”, “him”, “her”). Then they use a language model (like BERT) to obtain the fixed-length sentence representation for these sentences. Finally, they perform PCA on the sentence representations and keep the top-k principal components, just like we would if we were performing dimensionality reduction on data.

Now, to make sure that any text representation for a downstream task is bias-free, all we need to do is make that representation orthogonal to the principal components, and therefore the bias subspace. The overall intuition is simple: (1) choose a type of bias to remove, let’s say gender bias, (2) collect sentences with gendered pronouns from the web, (3) get sentence representations for the aforementioned sentences using any sentence encoder model, (4) perform PCA on these vectors and store the top-k components, (5) for each sentence in the downstream task, get the sentence representation from the same sentence encoder model, (6) finally, orthogonalise each representation from the previous step with respect to the principal components from step 4.

Are we really de-biasing?

Granted, that researchers are well on their way to identify and quantify bias (as we’ve illustrated through some examples above), doubt looms over whether successful eradication of bias is even a remote possibility.

In the work of Blodgett et al., the authors survey a large number of papers from leading machine learning conferences that study systematic bias in NLP. They find that the motivations of most of the papers are vague, inconsistent and lacking in normative reasoning. They find that nearly all of the surveyed papers tackle a very naxrrow range of potential sources of bias, and do not measure the implications of their de-biasing on other parts of the NLP development and deployment life cycle.

Tackling a narrow range of biases without measuring implications is a potential hazard that may cause more harm than good. Let us discuss, for example, a case where there are two genders and two races in a dataset: “male” or “female” and “white” or “black”. If we were to follow the work in some of the papers exemplified above, we would first have to choose which group to protect, and most of these narrow-focus papers do not provide guidelines on how to extend their methods to multiple groups. So, if we choose to protect the “female” group, then an effect of that may be an increase in bias against the “male” group, to level off the bias against women. This might cause an increase in bias against the “black + male” group, as an unintended effect. The same thing would happen against the “white + female” group if we were to tackle bias against the “black” group. Such negative effects of bias-cancellation have not been extensively studied, but it is easy to argue that these undesirable effects might be present.

In the paragraphs which came before, we elucidate that bias removal in existing research may have been faulty, but there’s still hope: we simply have not found the right way yet! However, Waseem et al. describe a much more fundamental problem in bias research.

While the bias in data is undoubtedly the cause of problems downstream, it is not the only source of bias in an NLP or ML system. The choice of the dataset, the model to be trained, and the steps that an individual follows to engage in an NLP task are all subject to choice. The authors argue that these choices themselves are subjective and that bias and subjectivity in ML are inescapable and thus cannot simply be removed. By treating de-biasing as an optimization problem, as most papers do, they treat themselves as objective, elevating their status above the more subjective notions of biases in data, modelling etc. However, this is an inherent fallacy, as these so-called objective choices were based on observations from those very subjective notions. And this, precisely, is the big problem with bias research in NLP!

Bias research in NLP is narrow and incomplete; Bias and subjectivity in ML are inescapable!

Not all is lost

Thus far, we’ve discussed what bias is, how it affects NLP models and how researchers currently aim at fixing it. We’ve stopped with the utterly disappointing note on how it may all be in vain!

But not all is lost, if we can acknowledge that fixing bias in systems goes beyond numerical optimization. Blodgett et al. [9] delineate a few important paths forward to fight bias holistically.

We must acknowledge that fixing bias in systems goes beyond numerical optimization!

First is the need to explore the relationships between language and social hierarchies. Language is the means through which social groups are labeled and beliefs about social groups are transmitted, and these group labels can promote inequalities and stereotypes. An important question is how NLP systems assuage, uphold or exaggerate such pre-existing social hierarchies.

A question on the same strain is how the evolution of language and text affects NLP models. For example, in the news, we often read about the phrase “illegal immigrants”. Whether a sentence goes “Illegal immigrant causes harm to neighbour” or “Illegal immigrant rescues child”, there is a reinforcement on the association of “illegal” to “immigrant”, which will definitely result in biased language models. Presently, there have been suggestions to make written language fairer and more inclusive. Can we claim that making the language fairer would fix the problems of bias in NLP models?

Another important topic would be to further sharpen the understanding of bias in NLP by providing explicit statements of why the system behaviours that are described as bias are harmful, in what ways, and to whom. It’s not enough to say that there is bias against women in a language model; we must explore how much it affects downstream tasks and how this cycle reinforces even more bias (if at all).

Finally, it’s also important to understand how certain communities become aware of NLP systems and whether they resist them. This means that if, let’s say, there is a chatbot in an application that is underused by women, then the bot will be less efficient while chatting with women, which in turn will reduce their experience with the bot, and a vicious cycle starts. In such cases, it would be important to understand the deeper issue of why the application wasn’t used by women to begin with, before trying to de-bias the system.

Final words

In this post, we aimed to expose how bias in NLP is more nuanced than it seems and why we might not be looking at it the right way. The current strain of papers tend to propose de-biasing mechanisms by looking at very narrow areas of the bias landscape, which limit their practicality. Bias is systemic and in this blog, we hope to have provided ample rationale about how we need to move beyond mathematical optimization to deal with it!


About Prosus

Prosus is a global consumer internet group and one of the largest technology investors in the world. Operating and investing globally in markets with long-term growth potential, Prosus builds leading consumer internet companies that empower people and enrich communities. 

The group is focused on building meaningful businesses in the online classifieds, food delivery, payments and fintech, and education technology sectors in markets including India, Russia, and Brazil. Through its ventures team, Prosus invests in areas including health, logistics, blockchain, and social commerce. Prosus actively seeks new opportunities to partner with exceptional entrepreneurs who are using technology to improve people’s everyday lives.

Every day, millions of people use the products and services of companies that Prosus has invested in, acquired or built, including 99minutos, Avito, Brainly, BUX, BYJU'S, Bykea, Codecademy, DappRadar, DeHaat, dott, ElasticRun, eMAG, Eruditus, Flink, GoodHabitz, Honor, iFood, Klar, LazyPay, letgo, Meesho, Movile, Oda, OLX, PayU, Quick Ride, Red Dot Payment, Remitly, Republic, Shipper, SimilarWeb, Skillsoft, SoloLearn, Swiggy, Udemy, Urban Company and Wolt.

Hundreds of millions of people have made the platforms of Prosus’s associates a part of their daily lives. For listed companies where we have an interest, please see: Tencent, Mail.ru, Trip.com Group Limited, and DeliveryHero.

Today, Prosus companies and associates help improve the lives of more than 2 billion users globally. 

Prosus has a primary listing on Euronext Amsterdam (AEX:PRX) and secondary listings on the Johannesburg Stock Exchange (XJSE:PRX) and a2X Markets (PRX.AJ). Prosus is majority-owned by Naspers.

For more information, please visit www.prosus.com.