Home Technology & Innovation Bengaluru-based Startup Smallstep.ai Launches Misal 7B for Native Maharashtrian Speakers

Bengaluru-based Startup Smallstep.ai Launches Misal 7B for Native Maharashtrian Speakers

0

(Misal 7B is an AI model addresses isuues for native Marathi speaking community, which is LLM  based. The Start up draws its name from spicy Maharashtrian dish made with moth beans).

Sagar Sarkhele, Founder said he saw lack of ‘AI model in his native language Marathi, with competition growing in the category of language translation of large language model (LLM) he decided to have something in Marathi’.

He said the cost for training the AI models was around Rs.50000-60000 and is built on top of Meta’s Llama2 model, small step rolled out four versions of Misal LLM.

“It’s a staple breakfast for many,” explained Smallstep founder Sagar Sarkale. “We chose the name because it’s something familiar and relatable for Marathi speakers.

The Misal has been built on top of Meta’s Llama2 model, Smallstep rolled out four versions of Misal LLM: Marathi Pre-trained LLM – Misal-7B-base-v0.1 & Misal-1B-base-v0.1, and Marathi Instruction tuned LLM – Misal-7B-instruct-v0.1 & Misal-1B-instruct-v0.1.

As per the company, results indicated that Misal-7B outperformed ChatGPT 3.5 in reading comprehension but lagged in sentiment analysis, paraphrasing and translation.

“With mere 2% of its data representing non-English languages, it’s evident that Llama2 is not optimally fine-tuned for building GenAI applications .

Read our blogs: Kusho AI: An AI Driven Start-Up Set to Transform Software Development – Founderlabs

The bootstrapped startup adopted a three-step procedure to develop Instruction Tuned Misal models, with similar processes for both the 7-billion and 1-billion parameter versions.

The company said that it identified a significant challenge with Meta’s Llama tokenizer, particularly in handling non-English languages due to increased token requirements.

In order to improve performance for Marathi text, Smallstep created a custom Sentence Piece tokenizer designed for the language. This adds approximately 15,000 new tokens to the existing inventory of 32,000 tokens of Llama2.

NO COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Exit mobile version