Breaking Language Barriers: Introducing Tamil LLaMA v0.2 and Its Expansion to Telugu and Malayalam

Abhinand
9 min readJan 23, 2024

--

Image generated using ChatGPT 4 with DALLE

Building upon the success of my recent Tamil LLaMA project, I’ve enhanced the model to be fully bilingual, achieving superior performance over Meta’s LLaMA 2 across nearly all standard benchmarks. Additionally, using this refined approach, I’ve created of the first-ever Large Language Models (LLMs) for both Telugu and Malayalam languages.

Introduction:

I’m thrilled to share the latest advancements in my journey with Tamil LLaMA, a project that has garnered incredible response since its open-source release on November 10, 2023 — GitHub repo. The enthusiasm and subsequent innovations in languages like Hindi, Odia, and Kannada have been truly inspiring.

For those new to this journey, let me take you back to September 2023. Driven by a passion for my mother tongue, Tamil — one of the world’s most ancient languages — I embarked on a mission to adapt the impressive capabilities of LLaMA 2 for Tamil. Inspired by the Chinese LLaMA, I developed 7B and 13B parameter model variants, focusing on expanding vocabulary and continued pretraining. The result was a success, with the model showcasing remarkable language generation capabilities in Tamil. This led to a detailed paper and the decision to open-source the project, aiming to improve collaboration and further adaptations for other Indian languages.

Despite all the positives, I was aware of some shortcomings which were vital to address and planned the next iterations quickly after the release of the first version. With that in mind I conducted several experiments- insights from which were posted on LinkedIn, feel free to read about it here. To keep this article focused and accessible, I won’t delve into the technicalities of the new model development. For those interested, a detailed research paper is in the works.

Since the Tamil LLaMA Instruct models were only trained on Tamil instructions, the goal of creating a bilingual model emerged as a clear necessity, addressing a wide range of use-cases. Thus, Tamil LLaMA v0.2 was born, excelling in both English and Tamil. Recognizing the scarcity of advancements in other Dravidian languages, I applied the same methodology to develop the first-ever LLMs (to the best of my knowledge) for Malayalam and Telugu. These models not only inherit the enhancements made to Tamil LLaMA but also represent significant firsts in their respective language domains.

I’d like to extend a heartfelt shoutout to JarvisLabs.ai, the Telugu and Malayalam LLaMA models wouldn’t have been possible without their support with GPUs. I covered the pretraining expenses of the models and just when I was beginning to hit my limits, JarvisLabs.ai came in and offered their generous support.

The models are published on HuggingFace Hub, please use the links below to access:

If you appreciate this work and would like to support its continued development, consider buying me a coffee. Your support is invaluable and greatly appreciated.

Stages Involved:

There are three stages to adapting the original LLaMA 2 models for Tamil/Telugu/Malayalam.

  1. Pretraining: Due to the severe lack of vocabulary and training data of the original LLaMA’s, we expand the vocabulary and carry out extended pretraining on the tageted language to improve its language generation capabilities.
  2. Fine-tuning: The foundational model obtained from stage one cannot follow human instructions so we train it on a large amount of instruction-response pairs.
  3. Alignment: This step is crucial to ensure the model responds according to human preference using techniques such as RLHF and DPO.

Note that although the models went through the alignment stage, they are still uncensored for the most part.

Base Models:

When I was looking for references, to make the model bilingual, the obvious one was Sarvam AI’s OpenHathi, which was released in December 2023— a Hindi language model which had its similarities with Tamil LLaMA’s approach but differed starkly when it came to the technical details of the pretraining. They follow a two stage approach of first training the model for translation and then followed by bilingual next token prediction task, which is interesting.

But I was able to match their English performance on standard benchmarks and also do fairly well on Tamil/Telugu/Malayalam without going down such a complicated route.

The Tamil LLaMA tokenizer was also improved compared to the first version.

We are comparing Indic Language LLMs on English benchmark scores to:

  • Understand the impact of pretraining on the original LLaMA’s capabilities.
  • Analyze their bilingual capabilities.

To assess the performance of regional languages, I worked with native speakers (except Tamil) and GPT-4 to validate the results. Also, there are no standard LLM benchmarks as of now for Indian languages.

I’d like to take a moment to thank Divya Sri for her amazing help in evaluating Telugu LLM results. Her input was invaluable in assessing the model’s performance in Telugu.

Benchmark results

The new models Tamil LLaMA v0.2, Telugu LLaMA and Malayalam LLaMA almost match LLaMA 2’s scores in every benchmark. If we compare the performances with OpenHathi, the new models marginally outperforms OpenHathi in all benchmarks.

The OpenHathi benchmark scores are based on the results from Open LLM Leaderboard rather than the reported scores in their website to ensure reproducibility. I evaluated the new models using LLM-AutoEval’s OpenLLM Leaderboard benchmark option.

It is also worth comparing our results with other models such as Kan-LLaMA from Tensoic and Ambari from Cognitive-Lab, both are Kannada LLMs which also followed a similar training strategy.

Evaluation results are from Open LLM Leaderboard for other models.

Comparing with most of the base models following a similar approach developed till date for Indian languages, the Tamil LLaMA v0.2 approach seems to marginally outperform or match most of them.

Fine-Tuned Models:

The fine-tuning stage is where a lot of the differences begin to emerge. The goal was to match or better the original LLaMA 2 model’s performance in English while also improving the abilities in Tamil/Telugu/Malayalam.

  • The base models were fine-tuned on an expansive corpus of close to half a million instructions with an equal proportion of the target language and English samples.
  • To improve the regional knowledge of the models, a synthetic dataset consisting of close to 2000 samples specific to the region’s history and culture was created.
  • DPO is performed to further enhance the capabilities.

As a result, the Tamil LLaMA v0.2 model marginally outperforms LLaMA 2 Chat on every single benchmark except Hellaswag.

Radar Chart of Meta’s LLaMA 2 vs Tamil LLaMA v0.2

Upon taking a closer look at the performance of all the models including Telugu and Malayalam, it is evident that they all match or excel the original LLaMA 2 model in almost all of the benchmarks.

Comparision with other Indic Chat/Instruct models

Example Outputs:

Some of the examples from Tamil LLaMA v0.2, Telugu LLaMA and Malayalam LLaMA are given below.

The models were loaded in GGUF 8-bit precision for inference in LM Studio.

Tamil LLaMA v0.2:

English Conversation
Tamil Conversation

Translation of Tamil Conversation:

User: Why is Chennai the capital of Tamil Nadu?

AI: Chennai is the capital of the Indian state of Tamil Nadu and serves as its political and administrative center. It reaches this stage for various reasons.

1. Historical Significance: Chennai originated as a city of Madras founded in 1639 by the British East India Company. The city soon developed into a commercial center, a cultural center, and a trading terminal. But Chennai did not always become the capital of the Indian subcontinent under British rule. It was an important administrative and military center as well as an educational and cultural center.

2. Political influence: Many factors including the political power of Tamil Nadu, its population, economic growth and historical importance may have led to the choice of Chennai as the capital of the state. It houses the state legislature, the governor’s office and various government institutions.

3. Administrative Coordination: The widespread location of central government offices, state executive branches, courts, law enforcement agencies, educational departments and other important government institutions in Chennai serves as a central point of coordination within various regions across Tamil Nadu. This will make the administration of the state easier to manage and implement.

4. Cultural Significance: Chennai is also the cultural hub of India, representing Tamil Nadu. It boasts of Tamil language, literature, art and heritage, adding to its importance as the official cultural center of Tamil Nadu.

5. Economic Impacts: Chennai has many industries which contribute to the overall economic growth of the state. Multinational companies, business centers, IT, financial services, education, healthcare, etc. are involved in various sectors, making it a central location that benefits various industries and sectors in the state.

6. Infrastructural development: Chennai is one of the largest urban areas in India, not only as one of the most important economic centers of the country, but also has key infrastructures that provide transport, communication, energy, healthcare and other essential services. This ensures effective and efficient distribution to the rest of the state.

In conclusion, Chennai has acquired a combination of historical importance, political influence, administrative integration, cultural importance, economic influences and infrastructural development to achieve its status as the capital of Tamil Nadu.

Telugu LLaMA:

English Conversation
Telugu Conversation

Translation of Telugu Conversation:

User: Does the development of artificial intelligence (AI) technology pose a threat to human society?
AI: To assess and address the potential risks of Artificial Intelligence (AI) technology, we need to consider the following factors:

1. Socio-Economic Implications: AI technologies have made significant progress in various fields, but they may also have socio-economic implications. For example, people highly skilled in AI-based applications may be less needed, increasing the unemployment rate. In addition, AI development may lead to rapid changes, which may have social consequences due to job disruptions or the inability to adapt to new skills.

2. Security and Privacy: With the rapid development and expansion of AI technology, it is crucial to ensure its security and privacy. AI programs often work in secret, and the inability or misdirection to control their performance can lead to personal data and privacy breaches.

3. Jobs and Employment: Using AI technologies can automate some jobs, which may eliminate some employees. On the other hand, AI technologies can create opportunities for other jobs, for example, enabling the development of new professions that require training or require advanced skills.

Malayalam LLaMA:

English Conversation
Malayalam Conversation

Translation of Malayalam Conversation:

User: Give 3 reasons why Artificial Intelligence (AI) can improve human life?

AI:

1. Better communication: AIs are mastering communication, resulting in more efficient and effective communication.
2. Improved decision-making ability: AIs can analyze large amounts of data, allowing them to make better decisions.
3. Improved healthcare: AIs can collect and analyze medical information and data, such as diagnoses, treatment recommendations, and drug effects, increasing cure rates and supporting healthcare.

Conclusion:

As we stand at the forefront of AI and language model advancements, the journey with Tamil LLaMA v0.2 and its expansion to Telugu and Malayalam marks an important milestone. These models are not just technological achievements; they are beacons of hope for the preservation and promotion of linguistic diversity in the digital age. By breaking down language barriers, we open doors to unexplored realms of knowledge, culture, and community engagement.

The success of these models is a testament to the power of collaboration, innovation, and shared vision. I extend my deepest gratitude to everyone who has supported and contributed to this journey.

As we continue to explore the vast potential of AI in language processing, let’s remain committed to inclusivity and diversity. The advancements in Tamil, Telugu, and Malayalam LLaMA models are just the beginning. There’s a whole world of languages waiting to be explored and given a voice in the digital universe.

Stay tuned for more updates, and let’s keep pushing the boundaries of what’s possible in AI and language technology. Together, lets shape a more connected and linguistically diverse world.

--

--