LLM - Llama를 NPU 활용, 성능은 어느 정도 일까?

오늘은 놀고 있는 NPU를 활용할 수 있는 NPU 활용기 및 성능이 어느정도 차이가 나는지 확인해보도록 하겠다.

NPU는 AI에서 필요로 하는 연산 작업에 최적화된 별도의 처리 장치로 NPU가 탑재가 되어 있다면 성능 탭에서 NPU를 확인할 수 있다.(여기에서는 Intel® NPU를 기반으로 글을 작성하였다.)

NPU를 이용하면 가장 큰 장점은, 로컬에서 AI를 실행하는데 기존에는 GPU의 영향을 많이 받았다. 특히 메모리 부분도 무시못할 정도로 고가의 GPU를 필요로 했다는 점이다.

하지만 NPU는 이런 부분을 해소하여 AI 연산을 전담해주게 된다. 특히 CPU에 함께 제공되어, 일반 개인 PC에서 사용할 만한 AI들을 로컬에서 CPU나 GPU를 의지 하지 않고 실행할 수 있게 되었다는 점이 가장 큰 장점이다.

필자가 NPU로 테스트 해볼 모델은 라마 모델이다.

NPU를 사용하기 위해서는 먼저 Intel NPU 드라이버를 최신으로 맞추는 것이 좋다.

Supported Operating Systems By the Intel® NPU Driver

Operating systems supported by the Intel® NPU driver

www.intel.com

그리고 Intel은 AI 에서 NPU를 잘 사용할 수 있도록 엑셀레이터 채널도 운영하고 있으므로 가입하면 유용하다.

AI PC 가속화 프로그램

AI PC 가속화 프로그램에 대해 자세히 알아보고 인텔 글로벌 네트워크의 일원이 되어 보십시오.

www.intel.co.kr

드라이버를 설치하고 이제 Llama 모델을 이용해서 로컬 챗봇을 만들어보자.

전체 코드는 다음과 같다.

import datetime
from transformers import pipeline, TextStreamer, set_seed
import intel_npu_acceleration_library
import torch
import os
#
# pip install transformers==4.42.4
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print("Loading the model...")
pipe = pipeline(
    "text-generation", model=model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
print("Compiling the model for NPU...")

pipe.model = intel_npu_acceleration_library.compile(pipe.model, torch.int8)

streamer = TextStreamer(pipe.tokenizer, skip_special_tokens=True, skip_prompt=True)

set_seed(42)


messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot. You can ask me anything.",
    },
]

print("NPU Chatbot is ready! Please ask a question. Type 'exit' to quit.")
while True:
    query = input("User: ")
    if query.lower() == "exit":
        break
    messages.append({"role": "user", "content": query})

    prompt = pipe.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    start_time = datetime.datetime.now()
    print("Assistant: ", end="", flush=True)
    out = pipe(
        prompt,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        streamer=streamer,
    )
    end_time = datetime.datetime.now()
    span_time = end_time - start_time
    print(span_time)
    reply = out[0]["generated_text"].split("<|assistant|>")[-1].strip()
    messages.append({"role": "assistant", "content": reply})

위 코드를 실행하기 위해서는 2가지를 설치해야 하는데,

pip install transformers==4.42.4
pip install intel-npu-acceleration-library

NPU 성능을 테스트를 하는 방법은 간단하다.

아래 intel-npu-acceleration-library 를 이용하는 부분을 주석 처리하면 CPU 기반으로 실행이 되고, 사용 시간을 비교해 보면 된다. 필자 기준으로는 NPU를 사용하면 18초, 사용하지 않을 경우에는 1분 20초 가량 소요되어 NPU의 능력을 체감할 수 있었다.

마치며

NPU를 활용해서 로컬 기반의 AI 봇이나 이미지 처리능력 향상등 앞으로 시장이 많아질 것 같아 기대되며, 앞으로 변화를 지켜보는 것도 큰 재미일것 같다.

'Bigdata' 카테고리의 다른 글

머신러닝 - 선형 회귀 핵심 정리 (0)	2024.12.27
회귀 알고리즘 정리 및 특징 정리, 사례 (0)	2024.12.16
LLM - GGUF 파일이란? (0)	2024.08.13
Splunk - Streaming 와 Transforming Commands 이해 (0)	2024.08.12
Jupyter Notebook 설치하기 with Anaconda (0)	2024.08.12

Asecurity

LLM - Llama를 NPU 활용, 성능은 어느 정도 일까?

'Bigdata' 카테고리의 다른 글

티스토리툴바

LLM - Llama를 NPU 활용, 성능은 어느 정도 일까?

'Bigdata' 카테고리의 다른 글

관련글

티스토리툴바