TextRank를 이용한 추출적 요약

NLP 2021. 1. 13. 17:31

2021/01/12 - [NLP] - TextRank를 이용한 추출적 요약 - 1

TextRank를 이용한 추출적 요약 - 1

TextRank는 PageRank 기반의 알고리즘으로, 키워드 추출과 추출적 요약에 주로 사용된다. 따라서 TextRank를 이해하기 위해서는 PageRank에 대한 이해가 선행되어야 한다. PageRank란? 구글에서 개발한 알고

jnnwnn.tistory.com

PageRank를 텍스트에 적용시켜 추출적 요약을 하기 위해 먼저 문장 그래프를 생성해야 한다. 문장이 node가 되며, edge weight은 문장간의 유사도이다. 문장의 유사도는 코사인 유사도를 이용하여 계산할 수 있다. 코사인 유사도는 문서의 크기에 관계없이 문서가 얼마나 유사한 지를 나타내주는 척도이다.

위와 같은 공식을 이용하여 두 문장간의 코사인 유사도를 나타낼 수 있는데 이를 코드로 나타내면 다음과 같다.

    def core_cosine_similarity(vector1, vector2):
    """
    measure cosine similarity between two vectors
    :param vector1:
    :param vector2:
    :return: 0 < cosine similarity value < 1
    """
    return 1 - cosine_distance(vector1, vector2)
    
    def _sentence_similarity(self, sent1, sent2, stopwords=None):
        if stopwords is None:
            stopwords = []

        sent1 = [w.lower() for w in sent1]
        sent2 = [w.lower() for w in sent2]

        all_words = list(set(sent1 + sent2))

        vector1 = [0] * len(all_words)
        vector2 = [0] * len(all_words)

        # build the vector for the first sentence
        for w in sent1:
            if w in stopwords:
                continue
            vector1[all_words.index(w)] += 1

        # build the vector for the second sentence
        for w in sent2:
            if w in stopwords:
                continue
            vector2[all_words.index(w)] += 1

        return core_cosine_similarity(vector1, vector2)

문장간의 코사인 유사도를 구하려면 먼저 문장을 벡터로 나타내야한다. "I have a cat.", "My cat likes tuna."라는 두개의 문장이 있다고 하면 두개의 문장이 포함하는 모든 단어(I, have, a, cat, My, likes, tuna)가 각각의 문장에서 등장하는 빈도를 벡터로 나타내면 된다.

	I	have	a	cat	My	likes	tuna
s1	1	1	1	1	0	0	0
s2	0	0	0	1	1	1	1

즉 s1 = [1, 1, 1, 1, 0, 0, 0], s2 = [0, 0, 0, 1, 1, 1, 1] 이라는 벡터가 나오고, 이에 대한 코사인 유사도를 계산하면 되는 것이다.

모든 문장에 대해 코사인 유사도를 계산하면, similarity matrix를 구성할 수 있다.

	def get_symmetric_matrix(matrix):
    """
    Get Symmetric matrix
    :param matrix:
    :return: matrix
    """
    return matrix + matrix.T - np.diag(matrix.diagonal())    
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        # create an empty similarity matrix
        sm = np.zeros([len(sentences), len(sentences)])

        for idx1 in range(len(sentences)):
            for idx2 in range(len(sentences)):
                if idx1 == idx2:
                    continue

                sm[idx1][idx2] = self._sentence_similarity(sentences[idx1], sentences[idx2], stopwords=stopwords)

        # Get Symmeric matrix
        sm = get_symmetric_matrix(sm)

        # Normalize matrix by column
        norm = np.sum(sm, axis=0)
        sm_norm = np.divide(sm, norm, where=norm != 0)  # this is to ignore the 0 element in norm

        return sm_norm

구한 similarity matrix를 바탕으로 앞서 얘기한 PageRank를 적용시키면 된다.

   def __init__(self):
        self.damping = 0.85  # damping coefficient, usually is .85
        self.min_diff = 1e-5  # convergence threshold
        self.steps = 100  # iteration steps
        self.text_str = None
        self.sentences = None
        self.pr_vector = None
        
    def _run_page_rank(self, similarity_matrix):

        pr_vector = np.array([1] * len(similarity_matrix))

        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr_vector = (1 - self.damping) + self.damping * np.matmul(similarity_matrix, pr_vector)
            if abs(previous_pr - sum(pr_vector)) < self.min_diff:
                break
            else:
                previous_pr = sum(pr_vector)

        return pr_vector

www.koreaherald.com/view.php?ud=20210112000709 에 있는 뉴스기사의 본문으로 코드를 실행시킨 결과 다음과 같은 PageRank 행렬이 나온다.

구한 행렬에서 가장 높은 값을 가지는 문장들을 추출하면 추출적 요약을 할 수 있다.

    def get_top_sentences(self, number=5):

        top_sentences = {}

        if self.pr_vector is not None:

            sorted_pr = np.argsort(self.pr_vector)
            sorted_pr = list(sorted_pr)
            sorted_pr.reverse()

            index = 0
            for epoch in range(number):
                #print(str(sorted_pr[index]) + " : " + str(self.pr_vector[sorted_pr[index]]))
                sent = self.sentences[sorted_pr[index]]
                sent = normalize_whitespace(sent)
                top_sentences[index] = sent + " : " + str(self.pr_vector[sorted_pr[index]])
                index += 1

        return top_sentences

요약을 한 결과 다음과 같은 문장들이 선택 되었다. (문장의 끝에는 문장의 PageRank 값을 표기해주었다.)

<Summarization>
1. “I pay around 10 million won in operation fees for the fitness club, and with the COVID-19 pandemic, I am now left with 1.9 million won in my bank account and 90 million won in bank debt,” said Oh Sung-young, head of a gym owners’ association, in a Facebook post Thursday. : 1.3474517437036213
2. Since last week, many owners of coffee shops, bars and internet cafes have staged protests against the government. : 1.3067347804620488
3. “We are filing this lawsuit out of desperation as the government’s COVID-19 regulations have put our lives on the edge,” said Ko Jang-soo, head of the cafe owners’ association. : 1.2630233945843305
4. Since late November, when Level 2 social distancing rules were imposed in Seoul, Incheon and Gyeonggi Province, coffee shops in the capital region have been barred from providing dine-in services. : 1.2134520315075341
5. In the suit, to be filed with the Seoul Central District Court on Thursday, the group said it would demand around 1 billion won ($908,638), or 5 million won for each cafe owner involved. : 1.18997645490276

현재 추출적 요약의 정확성을 판단하는 기준이 모호하여, 구현한 코드의 성능 평가가 어렵다는 문제가 있다. TF-IDF나 문장 유사도를 판별하는 다른 공식을 이용하여 각각의 결과들을 비교해보고, 직접 중요 문장을 골라 비교해보는 과정이 필요할 것 같다.

참고자료

https://medium.com/analytics-vidhya/sentence-extraction-using-textrank-algorithm-7f5c8fd568cd

'NLP' 카테고리의 다른 글

TextRank를 이용한 추출적 요약 - 1 (0)	2021.01.12

ABOUT ME

꾸준히 꾸준히

'NLP' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'NLP' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바