Natural Language Processing on Hinglish

The language Hinglish involves a hybrid mixing of Hindi and English within conversations, individual sentences and even words. An example: “nahi mei nahi aa sakta”. Translation: “no, i cannot come.” It is gaining popularity as a way of speaking that demonstrates you are modern, yet locally grounded.

India being a diverse country, one witnesses both ends of the spectrum. Namely, a hindi speaking commoner who can read and understand the devanagari script. And tourists coming in from abroad, who may or may not understand the language entirely. Since the local indian markets which are a huge source of attraction for foreign tourists, comprise a majority of purely devanagari understanding vendors, we found the need to use Hinglish as a medium for the two extreme ends to meet.

Our program aims to detect two things:

  • Whether the word typed is either english or hindi written in english
  • To translate the hinglish word to both: english and hindi

PART 1

To tackle the first we built a sequence to sequence architecture using tensorflow.

Sequence-to-sequence learning (Seq2Seq) is about training models to convert sequences from one domain (e.g. sentences in English) to sequences in another domain (e.g. the same sentences translated to French).

Seq2Seq is a method of encoder-decoder based machine translation and language processing that maps an input of sequence to an output of sequence with a tag and attention value. The idea is to use 2 RNNs that will work together with a special token and try to predict the next state sequence from the previous sequence.

A typical sequence to sequence model has two parts — an encoder and a decoder. Both the parts are practically two different neural network models combined into one giant network.

Broadly, the task of an encoder network is to understand the input sequence, and create a smaller dimensional representation of it. This representation is then forwarded to a decoder network which generates a sequence of its own that represents the output.

We made a seq2seq to translate hinglish (hindi words in latin) to hindi and further another seq2seq model to translate the hindi to english. As a result, translating hinglish to english.

We chose the linear approach rather than the direct approach. Even with this approach we faced a lot of issues mainly because there was a limited corpus for Hinglish words, which posed a major setback for translating uncommon words to Hindi and English. Also, since the corpus consisted of just words and not sentences, we could not translate Hinglish sentences to proper English or Hindi. Working with tensorflow was also a challenge because of the outdated libraries as well as reduced usage of tensorflow-core, which made it difficult to debug. Lastly, Hinglish being a casually-made language has no fixed spellings of any word. This increased the scope of error during translation.

English is based on Subject-Verb-Object (SVO) structure, but Hindi is an Subject-Object-Verb (SOV) type of language. Hindi is morphologically more rich than English. In general, these divergences are the factors which make the translation process difficult and error-prone. Furthermore, Hindi also have some inherent challenges in translating to English (1) Lack of articles in Hindi makes the translation imprecise. (2) Multiple contextual meaning of English

PART 2

Second part of the project was to classify if a given language word was english or hindi.

To make this classification model we used LSTM. Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video).

Our LSTM Model has 3 layers with ReLU and sigmoid as activation functions and binary cross entropy for loss parameters. After training it on 20 epochs we got an accuracy of 83.4%.

We further deployed this on flask using ngrok. You can check out our code here.

Project Owners -

Kopal Sharma (kopalsharma2000@gmail.com)
Sagarika Raje (raje.sagarika@gmail.com)

BTech in Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store