1. NLP intro & BOW & Word Embedding

NLP/1주차

1. NLP intro & BOW & Word Embedding

jnnwnn 2021. 9. 6. 14:42

NLP: Natural Language Understanding + Natural Language Generating (major conferences: ACL, EMNLP, NAACL)

NLP tasks

Low-Level
- Tokenization, Parsing
Word and Phrase Level
- NER, POS tagging, noun-phrase chunking, dependency parsing, coreference resolution
Sentence Level
- Sentiment analysis, Machine Translation
Multi-sentence and paragraph level
- Entailment prediction: 두 문장간의 논리적 내포를 예측
- QA, dialog system, Summarization

Text Mining: Extract useful info and insights from text and document data(major confrences: KDD, WSDM, CIKM, ICWSM, The WebConf)

Document clustering
related to computational social science

Information retrieval: 검색 기술을 연구하는 분야 (major conferences: SIGIR, WSDM, CIKM, RecSys)

어느정도 성숙한 분야
추천시스템 연구를 많이 함

NLP Trend

단어를 벡터로 표현. 텍스트는 이 단어들의 sequence
Sequence를 다루기 위해 rnn 모델 사용(LSTM, GRU)
Transformer Model
- self-supervised training
- task speicific하지 않은 pretrain된 모델을 transfer learning 시키는 방향으로 발전

Bag-of-Words Representation

Bag of Words란 단어들의 순서는 전혀 고려하지 않고, 단어들의 출현 빈도(frequency)에만 집중하는 텍스트 데이터의 수치화 표현 방법입니다. Bag of Words를 직역하면 단어들의 가방이라는 의미입니다. 단어들이 들어있는 가방을 상상해봅시다. 갖고있는 어떤 텍스트 문서에 있는 단어들을 가방에다가 전부 넣습니다. 그러고나서 이 가방을 흔들어 단어들을 섞습니다. 만약, 해당 문서 내에서 특정 단어가 N번 등장했다면, 이 가방에는 그 특정 단어가 N개 있게됩니다. 또한 가방을 흔들어서 단어를 섞었기 때문에 더 이상 단어의 순서는 중요하지 않습니다.

[출처] https://wikidocs.net/22650