Is Open-source NLP the Future?

Is Open-source NLP the Future?

About one-third of surveyed tech leaders said their NLP budgets had grown by at least 30% in 2021; 60% of the leaders said it had grown by at least 10% when compared to 2020. Can open-source NLP boost the next generation of start-ups?

The major buzzword in the world of AI over the coming few years is set to be NLP.

You’ve probably already made vast use of NLP already – digital assistants, voice-operated GPS systems, customer service chatbots, speech-to-text dictation software all make use of NLP technology.

NLP, or natural language processing is, simply put, a branch of artificial intelligence concerned with giving computers the ability to understand spoken words or text much the same way a human can. The applications and potential benefits of this technology are, as of today, limitless.

IBM defines it as a combination of “computational linguistics—rule-based modelling of human language—(and)… statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.”

Why open-source NLP?

Obvious answer: it’s too expensive otherwise!

That is not all, though.

The boom of NLP over the recent past has now driven demand for NLP-as-a-service platforms as well. In fact, a 2021 survey from AI firms Gradient Flow and John Snow Labs found almost 60% tech leaders stating that their NLP budgets have risen by at least 10% when compared to 2020 – with about 33% stating their spending had climbed by over 30%.

However, VentureBeat reports:

“Historically, training and deploying these models was beyond the reach of start-ups without substantial capital — not to mention compute resources. But the emergence of open source NLP models, datasets, and infrastructure is democratizing the technology in surprising ways.”

Developing a state-of-the-art language model isn’t easy. Whilst most with the resources for it choose not to make their systems open-source (going for licensing/commercialising instead), the models which are open-sourced are still rather difficult to commercialise given the immense computing power required.

Take Megatron 530B, for example. Created in a joint collaboration between Microsoft and Nvidia, the model was trained across 560 Nvidia DGX A100 servers hosting 8 Nvidia A100 80GB GPUs each, producing between 113 and 126 teraflops per second. This puts just the cost of training the model in the millions of dollars.

Inference — actually running the trained model — is another challenge. Getting inferencing (e.g., sentence autocompletion) time with Megatron 530B down to a half a second requires the equivalent of two $199,000 Nvidia DGX A100 systems. While cloud alternatives might be cheaper, they’re not dramatically so — one estimate pegs the cost of running GPT-3 on a single Amazon Web Services instance at a minimum of $87,000 per year”, according to VentureBeat.

Open-source solutions

Open-source providers such as Cohere, AI21 Labs and OpenAI are playing an integral role in the democratisation of artificial intelligence. As of March 2021, OpenAI reported that their GPT-3 NLP engine was being actively used by over 300 different apps by “tens of thousands of developers and producing 4.5 billion words per day.”

Recently, other open research efforts such as EleutherAI have helped lower barriers to entry into the world of AI. A ‘grassroot’ collection of AI researchers, it aims to deliver code and datasets for a model similar to GPT-3.

They recently released ‘The Pile’ – “a large, diverse, open-source language modelling data set that consists of many smaller datasets combined together. The objective is to obtain text from as many modalities as possible to ensure that models trained using The Pile will have much broader generalization abilities.”

In June 2021, EleutherAI released under the Apache 2.0 license the GPT-Neo and its successor, GPT-J, a language model trained on Google’s third gen-TPUs, performing nearly at par with the GPT-3 engine.

NLP Cloud – with its five employees – is one of the newest AI start-ups using EleutherAI’s models. Founder Julien Salinas opined to VentureBeat, that the “idea came to him when he realized that, as a programmer, it was becoming easier to leverage open-source NLP models for business applications but harder to get them to run properly in production.”

Start-ups built on open-source models such as EleutherAI’s could drive the next wave of NLP adoption. Advisory firm Mordor Intelligence forecasts that the NLP market will more than triple its revenue by 2025, as business interest in AI rises.

© 2024 Praxis. All rights reserved. | Privacy Policy
   Contact Us
Praxis Tech School
PGP in Data Science