Big Data

How to Access Llama3 with Flask?


The world of AI just got a whole lot more exciting with the release of Llama3! This powerful open-source language model, created by Meta, is shaking things up. Llama3, available in 8B and 70B pretrained and instruction-tuned variants, offers a wide range of applications. In this guide, we will explore the capabilities of Llama3 and how to access Llama3 with Flask, focusing on its potential to revolutionize Generative AI.

Learning Objectives

  • Explore the architecture and training methodologies behind Llama3, uncovering its innovative pretraining data and fine-tuning techniques, essential for understanding its exceptional performance.
  • Experience hands-on implementation of Llama3 through Flask, mastering the art of text generation using transformers while gaining insights into the critical aspects of safety testing and tuning.
  • Analyze the impressive capabilities of Llama3, including its enhanced accuracy, adaptability, and robust scalability, while also recognizing its limitations and potential risks, crucial for responsible use and development.
  • Engage with real-world examples and use cases of Llama3, empowering you to leverage its power effectively in diverse applications and scenarios, thereby unlocking its full potential in the realm of Generative AI.

This article was published as a part of the Data Science Blogathon.

Llama3 Architecture and Training

Llama3 is an auto-regressive language model that leverages an optimized transformer architecture. Yes, the regular transformer but with an upgraded approach. The tuned versions employ supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The model was pretrained on an extensive corpus of over 15 trillion tokens of data from publicly available sources, with a cutoff of March 2023 for the 8B model and December 2023 for the 70B model, respectively. The fine-tuning data incorporates publicly available instruction datasets, as well as over 10 million human-annotated examples.

Llama3 with Flask

Llama3 Impressive Capabilities

As we previously noted, Llama3 has an optimized transformer design and comes in two sizes, 8B and 70B parameters, in both pre-trained and instruction-tuned versions. The tokenizer of the model has a 128K token vocabulary. Sequences of 8,192 tokens were used to train the models. Llama3 has proven to be remarkably capable of the following:

  • Enhanced accuracy: Llama3 has shown improved performance on various natural language processing tasks.
  • Adaptability: The model’s ability to adapt to diverse contexts and tasks makes it an ideal choice for a wide range of applications.
  • Robust scalability: Llama3’s scalability enables it to handle large volumes of data and complex tasks with ease.
  • Coding Capabilities: Llama3’s coding capability is agreed to be nothing short of remarkable with an incredible 250+ tokens per second. Instead of the golden GPUs, the efficiency of LPUs is unmatched, making them the superior choice for running large language models.

The most significant advantage of Llama3 is its open-source and free nature, making it accessible to developers without breaking the bank.

llama3 with flask

Llama3 Variants and Features

As mentioned earlier, the Llama3 offers two major variants, each catering to different use cases with the two sizes of 8B and 70B:

  • Pre-trained models: Suitable for natural language generation tasks. A bit more general in performance.
  • Instruction-tuned models: Optimized for dialogue use cases, outperforming many open-source chat models on industry benchmarks.

Llama3 Training Data and Benchmarks

Llama3 was pre-trained on an extensive corpus of over 15 trillion tokens of publicly available data, with a cutoff of March 2023 for the 8B model and December 2023 for the 70B model. The fine-tuning data incorporates publicly available instruction datasets and over 10 million human-annotated examples(You heard that right!). The model has achieved impressive results on standard automatic benchmarks, including MMLU, AGIEval English, CommonSenseQA, and more.


Llama3 Use Cases and Examples

Llama can be used like other Llama family models which has also made using it very easy. We basically need to install transformer and accelerate. We will see a wrapper script in this section. You can find the entire code snippets and the notebook to run with GPU here. I have added the notebook, a flask app, and an interactive mode script to test the behavior of the model. Here’s an example of using Llama3 with pipeline:

How to Access Llama3 with Flask?

Let us now explore the steps to access Llama3 with Flask.

Step 1: Set up Python Environment

Create a virtual environment (optional but recommended):

$ python -m venv env
$ source env/bin/activate   # On Windows use `.\env\Scripts\activate`

Install necessary packages:

We install transformer and accelerate but since Llama3 is new, we go on by installing directly from Git Hub.

(env) $ pip install -q git+
(env) $ pip install -q flask transformers torch accelerate # datasets peft bitsandbytes

Step2: Prepare Main Application File

Create a new Python file called Inside it, paste the following code.

from flask import Flask, request, jsonify
import transformers
import torch

app = Flask(__name__)

# Initialize the model and pipeline outside of the function to avoid unnecessary reloading
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
    model_kwargs={"torch_dtype": torch.bfloat16},

@app.route('/generate', methods=['POST'])
def generate():
    data = request.get_json()
    user_message = data.get('message')

    if not user_message:
        return jsonify({'error': 'No message provided.'}), 400

    # Create system message
    messages = [{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}]

    # Add user message
    messages.append({"role": "user", "content": user_message})

    prompt = pipeline.tokenizer.apply_chat_template(

    terminators = [

    outputs = pipeline(

    generated_text = outputs[0]['generated_text'][len(prompt):].strip()
    response = {
        'message': generated_text

    return jsonify(response), 200

if __name__ == '__main__':

The above code initializes a Flask web server with a single route, /generate, responsible for receiving and processing user messages and returning AI-generated responses.

Step3: Run Flask Application

Run the Flask app by executing the following command:

(env) $ export
(env) $ flask run --port=5000

Now, you should have the Flask app running at http://localhost:5000. You may test the API via tools like Postman or CURL, or even write a simple HTML frontend page.

Interactive Mode Using Transformers AutoModelForCausalLM

To interactively query the model within Jupyter Notebook, paste this in a cell and run:

import requests
import sys
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_NAME ='meta-llama/Meta-Llama-3-8B-Instruct'

class InteractivePirateChatbot:
    def __init__(self):
        self._tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")
        self._tokenizer.pad_token = self._tokenizer.eos_token
        self._model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto", offload_buffers=True)
    def _prepare_inputs(self, messages):
            inputs = self._tokenizer([message['content'] for message in messages], padding='longest', truncation=True, max_length=512, return_tensors="pt")
            input_ids =
            attention_mask =
            return {'input_ids': input_ids, 'attention_mask': attention_mask}
        except Exception as e:
            print(f"Error preparing inputs: {e}")
            return None

    def ask(self, question):
            messages = [
                {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
                {"role": "user", "content": question}

            prepared_data = self._prepare_inputs(messages)
            if prepared_data is None:
                print("Error preparing inputs. Skipping...")

            output = self._model.generate(**prepared_data, max_length=512, num_beams=5, early_stopping=True)

            answer = self._tokenizer.decode(output[0], skip_special_tokens=True)
            print("Pirate:", answer)
        except Exception as e:
            print(f"Error generating response: {e}")

generator = InteractivePirateChatbot()
while True:
    question = input("User: ")

The above code will allow you to quickly interact and see how the model works. Find the entire code here.

User: "Who are you?"

Pirate: "Arrrr, me hearty! Me name be Captain Chat, the scurviest pirate chatbot to ever sail the Seven Seas! I be here to swab yer decks with me clever responses and me trusty parrot, Polly, perched on me shoulder. So hoist the colors, me matey, and let's set sail fer a swashbucklin' good time!"

Since we have seen how the model works, let’s see some safety and responsibility guides.

Responsibility and Safety

Meta has taken a series of steps to ensure responsible AI development, including implementing safety best practices, providing resources like Meta Llama Guard 2 and Code Shield safeguards, and updating the Responsible Use Guide. Developers are encouraged to tune and deploy these safeguards according to their needs, weighing the benefits of alignment and helpfulness for their specific use case and audience. All these links are available in the Hugginface repository for Llama3.

Ethical Considerations and Limitations

While Llama3 is a powerful tool, it’s essential to acknowledge its limitations and potential risks. The model may produce inaccurate, biased, or objectionable responses to user prompts. Therefore, developers should perform safety testing and tuning tailored to their specific applications of the model. Meta recommends incorporating Purple Llama solutions into workflows, specifically Llama Guard, which provides a base model to filter input and output prompts to layer system-level safety on top of model-level safety.


Meta has reshaped the landscape of artificial intelligence with the introduction of Llama3, a potent open-source language model crafted by Meta. With its availability in both 8B and 70B pretrained and instruction-tuned versions, Llama3 presents a multitude of possibilities for innovation. This guide has provided an in-depth exploration of Llama3’s capabilities and how to access Llama3 with Flask, emphasizing its potential to redefine Generative AI.

Key Takeaways

  • Meta developed Llama3, a powerful open-source language model available in both 8B and 70B pretrained and instruction-tuned versions.
  • Llama3 has demonstrated impressive capabilities, including enhanced accuracy, adaptability, and robust scalability.
  • The model is open-source and completely free, making it accessible to developers and low-budget researchers.
  • Users can utilize Llama3 with transformers, leveraging the pipeline abstraction or Auto classes with the generate() function.
  • Llama3 and Flask enable developers to explore new horizons in Generative AI, fostering innovative solutions like chatbots and content generation, pushing human-machine interaction boundaries.

Frequently Asked Questions

Q1. What is Llama3?

A. Meta developed Llama3, a powerful open-source language model available in both 8B and 70B pre-trained and instruction-tuned versions.

Q2. What are the key features of Llama3?

A. Llama3 has demonstrated impressive capabilities, including enhanced accuracy, adaptability, and robust scalability. Research and tests have shown that it delivers more relevant and context-aware responses, ensuring that each solution is finely tuned to the user’s needs.

Q3. Is Llama3 open-source and free and can I use Llama3 for commercial purposes?

A. Yes, Llama3 is open-source and completely free, making it accessible to developers without breaking the bank. Although Llama3 is open-source and free to use for commercial purposes. However, we recommend reviewing the licensing terms and conditions to ensure compliance with any applicable regulations.

Q4. Can I fine-tune Llama3 for my specific use case?

A.Yes, Llama3 can be fine-tuned for specific use cases by adjusting the hyperparameters and training data. This can help improve the model’s performance on specific tasks and datasets.

Q5. How does Llama3 compare to other language models like BERT and RoBERTa?

A. Llama3, a more advanced language model trained on a larger dataset, outperforms BERT and RoBERTa in various natural language processing tasks.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.