English  |  正體中文  |  简体中文  |  Post-Print筆數 : 11 |  Items with full text/Total items : 88666/118324 (75%)
Visitors : 23506412      Online Users : 218
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 商學院 > 統計學系 > 學位論文 >  Item 140.119/119087
    Please use this identifier to cite or link to this item: http://nccur.lib.nccu.edu.tw/handle/140.119/119087

    Title: 運用資料探勘及支持向量機建立運動新聞媒體分類器
    Using Exploratory Data Analysis and Support Vector Machine to Build Media Classifiers on Sport News
    Authors: 褚承威
    Chu, Cheng-Wei
    Contributors: 薛慧敏
    Chu, Cheng-Wei
    Keywords: 體育新聞
    Sports news
    Feature selection
    Support vector machine
    Text categorization
    Date: 2018
    Issue Date: 2018-07-31 13:44:58 (UTC+8)
    Abstract: 新聞是最近所發生事件的消息報導,呈現當時有關某問題、事件或過程的現實情況,而報紙為過往傳播新聞的媒介,隨著網路迅速發展民眾習慣改變,報紙平面媒體轉而發展成網路新聞。網路新聞的內容包含文字、圖片甚至是影音,各家媒體使用習慣皆有不同,過去的研究比較不同媒體新聞內容用法差異,再以人工進行判別媒體。本文則希望透過探索式資料分析(exploratory data analysis, EDA)及TF-IDF(term frequency inverse document frequency)關鍵字篩選方法來關鍵選取文字變數及非文字變數,並運用選出的變數建立支持向量機(support vector machine, SVM)媒體分類器。在建立媒體分類器中,我們發現僅採用非文字變數已有高準確率,而圖片規格為相對重要變數。若僅考慮文字變數時,則少許文字變數便能建立優異的分類器。
    News is a report which show a situation of a problem, event or process at that time. In the past, newspapers are the most common media for spreading news. As the Internet and social media grow rapidly, people’s habits have changed. Nowadays, a majority of people prefers to read digital news instead of news in paper. This study aims to develop a classifier of digital news to predict the newspaper publisher of the news. Over four thousands news articles of sport category published by the four major Taiwanese newspapers: United Daily News, Apple Daily, China Times, Liberty Times, in December, 2017, are collected as training data. Commonly every item of digital news is formed by a title, text content and photos. Hence, the first and the essential step of the analysis is input variable (feature) quantification from available information of news. Moreover, to explore the routine of every newspaper and to improve the computational efficiency, an initial exploratory data analysis (EDA) on the input variables is conducted and relative important variables are selected for classifier development. For the text data, the term frequency-inverse document frequency (TF-IDF) is applied for a keywords selection method. Then, we use these selected variables to build newspaper classifiers by support vector machine (SVM). In our study, we find that a simple classifier based on 19 non-text input variables can achieve a high accuracy. Among them, the image dimensions are the most critical variables. On the other hand, when only considering text information, we observe that few text variables can have excellent classification results.
    Reference: 中文部分
    1.Cortes C., & Vapnik V., (1995), Support vector networks, Machine Learning, Boston, Kluwer Academic, 273-297.
    2.Cristianini N., & Shawe-Taylor J., (2010), Kernel-Induced Feature Spaces, An Introduction to Support Vector Machine and Other Kernel-based Learning Methods, New York, Cambridge University, 27-37.
    3.Joachims T., (1998), Text Categorization with Support Vector Machines: Learning with Many Relevant Features, University Dortmund, Dortmund, Germany.
    Description: 碩士
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0105354020
    Data Type: thesis
    DOI: 10.6814/THE.NCCU.STAT.014.2018.B03
    Appears in Collections:[統計學系] 學位論文

    Files in This Item:

    File SizeFormat
    402001.pdf2013KbAdobe PDF0View/Open

    All items in 政大典藏 are protected by copyright, with all rights reserved.

    社群 sharing

    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback