Hugging Face Library

Hugging Face Library

NLP/2주차 2021. 9. 17. 18:06
설치

!pip install transformers

Pipeline API 사용

1) Pipeline 이란?

가장 기본적인 object이며, task에 맞는 가장 적합한 pretrained model을 선택하여 classifier object가 생성될 때 다운되고 캐시된다.

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering

from transformers import pipeline classifier = pipeline("sentiment-analysis") classifier("I've been waiting for a HuggingFace course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

아래와 같은 task에 파이프라인이 사용 가능하다.

feature-extraction (get the vector representation of a text)

fill-mask

ner (named entity recognition)

question-answering

sentiment-analysis

summarization

text-generation

translation

zero-shot-classification

2) Choosing a specific model

default model이 아닌, pipeline에 사용될 모델을 직접 고를수 있다. Model Hub에서 모델들을 검색하여 사용할 수 있으며, 2021 국립국어원 인공 지능 언어 능력 평가 준비를 위해 skt의 ko-gpt-trinity-1.2B-v0.5를 사용해보자.

Model Hub에서 원하는 모델 검색하기

모델을 다운받고, 텍스트를 생성해보면 아래와 같이 꽤 괜찮은 텍스트가 생성되는 것을 확인할 수 있다.

generator = pipeline("text-generation", model="skt/ko-gpt-trinity-1.2B-v0.5") generator( "강한 햇빛을 많이 쬐면", max_length=50, num_return_sequences=1, )

[{'generated_text': '강한 햇빛을 많이 쬐면 피부암에 걸릴 위험이 높아진다는 연구결과가 나왔다.\n 미국 하버드대 의대 연구진은 미국암학회(AACR) 연례회의에서 발표한 연구보고서에서 햇빛에 노출된 쥐는 그렇지 않은 쥐에'}]

Transformers 사용하기

1) Preprocessing

모델이 텍스트를 이해하게 하기 위해서 먼저 텍스트를 전처리하는 과정이 필요하다. 이는 tokenizer를 사용해서 진행한다.

tokenizer는

input을 token이라고 불리는 word, subword 혹은 symbol로 나눈다

각 token을 integer로 맵핑한다

모델에 도움이 될 수 있는 추가적인 input을 추가한다.

모델이 pretrain된 방식과 동일하게 tokenizing 하는 것이 필요한데, 이는 AutoTokenizer 클래스를 이용해서 할 수 있다.

from transformers import AutoTokenizer checkpoint = "skt/ko-gpt-trinity-1.2B-v0.5" tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = ["나는 배가 고프다.", "인공지능 공부 재미있다."] output = tokenizer(inputs) #output = tokenizer(inputs, return_tensor='pt') output

문장을 tokenizer에 넣어주면 모델의 입력으로 사용할 수 있는 dictionary가 반환된다. input_id에는 각 문장의 토큰들을 나타내는 id가 저장된다. return_tensor 인자로 원하는 텐서를 반환받을 수 있다 e.g. return_tesnor='pt'는 pytorch tensor 반환 (단, padding이나 truncation을 이용하여 길이를 길이를 맞춰줘야한다)

{ 'input_ids': [[30496, 36775, 30021, 29292, 29975], [43226, 32899, 32579, 34045]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1]] }

tokenizer.tokenize를 이용해 문장이 어떻게 tokenize 되는지 확인할 수 있다.

tokenized = tokenizer.tokenize(inputs) tokenized

['▁나는', '▁배가', '▁고', '프', '다.', '▁인공지능', '▁공부', '▁재미', '있다.']

2) Model 사용하기

from transformers import AutoModel checkpoint = "skt/ko-gpt-trinity-1.2B-v0.5" model = AutoModel.from_pretrained(checkpoint)

outputs = model(**output) print(outputs.last_hidden_state.shape)

모델을 불러와서 입력을 넣어주면 아래와 같이 (batch, length, hidden_size) 크기의 hidden state을 얻을 수 있으며 이를 process 하여 다양한 task에 활용하면 된다.

torch.Size([2, 5, 1920]) #(batch, length, hidden_size)
'NLP > 2주차' 카테고리의 다른 글

NLP 모델 정리  (0) 2021.09.16

Pre-Tokenization  (0) 2021.09.15

GPT & BERT  (0) 2021.09.15
관련글 관련글 더보기
댓글

ABOUT ME

꾸준히 꾸준히

설치

Pipeline API 사용

Transformers 사용하기

'NLP > 2주차' 카테고리의 다른 글

티스토리툴바

NLP 모델 정리 (0)	2021.09.16
Pre-Tokenization (0)	2021.09.15
GPT & BERT (0)	2021.09.15

ABOUT ME

설치

Pipeline API 사용

Transformers 사용하기

'NLP > 2주차' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바