Extraction-Based MRC

NLP/MRC 2021. 10. 12. 16:57

Pre-processing

Tokenization → OOV 문제를 해결하기 위해 BPE를 주로 사용
Attention Mask → 입력 시퀀스 중에서 attention 연산할 때 무시할 토큰 표시
Token Type IDs → 입력이 2개 이상의 시퀀스일때 각각에 ID를 부여하여 모델이 구분해서 해석하도록 유도

Post-processing

불가능한 답 제거하기
End position이 start 보다 앞에 있는 경우
예측한 위치가 context를 벗어난 경우
미리 설정한 max_answer_length 보다 길이가 더 긴 경우

최적의 답안 찾기

start/end position prediction에서 score 가 가장 높은 N개를 각각 찾는다
불가능한 start/end 조합을 제가
가능한 조합들을 score의 합이 큰 순서대로 정렬
score가 가장 큰 조합을 최종 예측으로 선정

코드

max_seq_length = 384 # 질문과 컨텍스트, special token을 합한 문자열의 최대 길이
pad_to_max_length = True
doc_stride = 128 # 컨텍스트가 너무 길어서 나눴을 때 오버랩되는 시퀀스 길이
max_train_samples = 16
max_val_samples = 16
preprocessing_num_workers = 4 # thread를 몇개를 활용할 것이냐
batch_size = 4
num_train_epochs = 2
n_best_size = 20
max_answer_length = 30 # 답변의 최고 길이

http://jalammar.github.io/illustrated-bert/

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Discussions: Hacker News (98 points, 19 comments), Reddit r/MachineLearning (164 points, 20 comments) Translations: Chinese (Simplified), French, Japanese, Korean, Persian, Russian 2021 Update: I created this brief and highly accessible video intro to BERT

jalammar.github.io

https://huggingface.co/datasets

Hugging Face – The AI community building the future.

winogrande 73.6k WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustnes

huggingface.co

'NLP > MRC' 카테고리의 다른 글

Passage Retrieval - Scaling Up (0)	2021.10.13
Dense Embedding (0)	2021.10.13
Passage Retrieval (0)	2021.10.13
Generation based MRC (0)	2021.10.13
Introduction to MRC (0)	2021.10.12

ABOUT ME

꾸준히 꾸준히

'NLP > MRC' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'NLP > MRC' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바