Dense Embedding

NLP/MRC

Dense Embedding

jnnwnn 2021. 10. 13. 14:11

Dense Embedding

Complementary to sparse representations by design

작은 차원의 고밀도 벡터 (length 50-1000)
각 차원이 특정 term에 대응 x
대부분의 요소가 non-zero

Dense Encoder

BERT와 같은 PLM이 주로 사용. 그 외 다양한 neural network 구조도 가능!
CLS Token의 output 사용

학습목표: 연관된 question과 passage embedding 간의 거리를 좁히는 것 (inner product를 높이는 것)
Challenge: 연관된 question / passage를 어떻게 찾을 것인가? → 기존 MRC 데이터셋을 사용
연관된 question과 passage 간의 dense embedding 거리를 좁히고 (positive example), 연관되지 않은 question과 passage간의 embedding 거리를 멀게 하기 (negative example)
- Negative sample을 뽑기 위해 Corpus 내에서 랜덤하게 뽑거나, 헷갈리는 Negative sample을 뽑을 수 있다.
- In-batch-negative
Objective Function: Positive passage에 대한 negative log likelihood loss 사용

Passage Retrieval With Dense Encoder

Inference: Passage와 query를 각각 embedding 한 후, query로부터 가까운 순서대로 passage의 순서를 매김

Dense Encoding 성능 향상 방법

학습 방법 개선 (DPR) https://arxiv.org/abs/2004.04906
인코더 모델 개선 (BERT 보다 큰, 정확한 Pretrained 모델)
데이터 개선 (더 많은 데이터, 전처리 등)

Dense Passage Retrieval for Open-Domain Question Answering

Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implem

arxiv.org

https://github.com/danqi/acl2020-openqa-tutorial/blob/master/slides/part5-dense-retriever-e2e-training.pdf

GitHub - danqi/acl2020-openqa-tutorial: ACL2020 Tutorial: Open-Domain Question Answering

ACL2020 Tutorial: Open-Domain Question Answering. Contribute to danqi/acl2020-openqa-tutorial development by creating an account on GitHub.

github.com