    Title: 以文字探勘技術分析台灣四大報文字風格
    A Case Study of Text Mining on Taiwan’s Newspapers
    Authors: 葉昱廷
    Ye, Yu-Ting
    Contributors: 余清祥

    Yue, Ching-Syang
    Cheng, Wen-Huei

    Ye, Yu-Ting
    Keywords: 寫作風格
    Writing Style
    Similarity Index
    Taiwan’s Newspaper
    Exploratory Data Analysis
    Social Network
    Date: 2019
    Abstract: 如同作者的寫作風格,即使主題相同,因為切入角度、用詞鋪陳等因素,各報紙的新聞報導經常有明顯差異,從報導文章中往往可判斷來自於哪一個媒體。本文也以研究報紙報導為目標,透過相似指標、多變量分析等文字探勘統計方法,在不考量文字意義、只著重用字頻率的前提下,比較台灣四大報紙的《蘋果日報》、《自由時報》、《聯合報》、《中國時報》的文字風格,資料期間為2012年至2018年。為避免報導題材造成的干擾,資料分析時根據各大報每天的頭版報導,其中受限於資料下載的限制,頭版標題為四大報,但內文比較僅有《蘋果日報》、《自由時報》兩家報紙。
    Like an author’s writing style, every newspaper has its own opinion and narrative methods, and it can be easily distinguished just by reading its articles. In this study, our goal is to explore the news reporting styles of Taiwan’s four major newspapers (Apple Daily, Liberty Times, United Daily News and China Times) and compare their differences. We choose the headline news for analysis in order to prevent the influence of nuisance factors, such as differences in political positions and target audience. The newspaper headlines considered are between 2012 and 2017. The titles of headlines can be downloaded for all four newspapers but the content of headlines is available only for Apple Daily and Liberty Times.
    We first applied the methods of Exploratory Data Analysis (EDA), such as Jaccard and Yue index, for the word frequencies and word types to evaluate the similarities between four newspapers. In addition, we also considered multivariate tools, including t-SNE (t-distributed Stochastic Neighbor Embedding), GAP (Generalized Association Plots), Cluster Analysis, and Neural Network. We plugged the similarity indices into these multivariate tools to visualize the differences of newspapers and to classify observations into different groups.
    For the analysis of headline titles and contents, the results show that there are significant differences in word usage between four newspapers. However, the grouping results of titles and contents based on similarity indices are quite different. For the headline titles, the Jaccard indices grouped titles by time and the Yue indices grouped titles by the media (i.e., 3 groups). For the headline contents, the words used in Apple Daily and Liberty Times, can be classified into five or six classes of topics, with Liberty Times emphasizing political terms and Apple Daily focusing social affairs and crime problems. We also applied machine learning methods to distinguish headline articles of Apple Daily and Liberty Times via cross-validation, treating the data of 2012-2017 as training set and those of 2018 as testing set. Support Vector Machine (SVM) achieved 95.35% accuracy in prediction with 3,316 variables.
