本文共 11177 字,大约阅读时间需要 37 分钟。
在信息摘要应用中还包含着另一种理论逻辑:重要的句子中通常包含着重要的词汇,而跨语料库的差异词(discriminatory word)绝大多数数是重要词汇。因此,句子中包含具有差异很大的词汇,它就很重要。这样就得到一个非常简单的测量方法,就是计算每一个词各种的TF-IDF(term frequency-inverse document )分值,然后根据词汇的重要性找出一种标准化的凭据评分。这个评分就可以用来充当在信息摘要中选取句子的标准。
TF-IDF(term frequency–inverse document frequency)是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜寻引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜寻引擎还会使用基于连结分析的评级方法,以确定文件在搜寻结果中出现的顺序。按照其不拿整段介绍来做,只拿前三句来实践,我拿了前一段:import nltkfrom sklearn.feature_extraction.text import TfidfVectorizerf=open('news.txt')news_content=f.read()results=[]sentences=nltk.sent_tokenize(news_content)vectorizer=TfidfVectorizer(norm='l2',min_df=0,use_idf=True,smooth_idf=False,sublinear_tf=True)sklearn_binary=vectorizer.fit_transform(sentences)print(vectorizer.get_feature_names())print(sklearn_binary.toarray())
结果:
['accept', 'accepting', 'altria', 'and', 'announce', 'approaches', 'arthur', 'as', 'at', 'be', 'birth', 'britain', 'british', 'by', 'caliburn', 'ceremonial', 'character', 'decides', 'despite', 'destined', 'dies', 'draws', 'ector', 'eligible', 'embedded', 'enters', 'entrusted', 'explaining', 'fearing', 'fifteen', 'following', 'for', 'full', 'gender', 'growing', 'hardships', 'heir', 'her', 'hesitation', 'his', 'however', 'if', 'in', 'inspired', 'invasion', 'is', 'king', 'knight', 'known', 'large', 'leadership', 'leaving', 'legends', 'legitimate', 'loyal', 'mantle', 'merlin', 'monarch', 'name', 'nativity', 'never', 'no', 'not', 'of', 'or', 'pendragon', 'people', 'period', 'preserving', 'publicly', 'pulling', 'raises', 'recognize', 'responsible', 'ruler', 'saber', 'saxons', 'she', 'shoulders', 'sir', 'slab', 'son', 'soon', 'stone', 'subjects', 'surrogate', 'sword', 'symbolic', 'that', 'the', 'this', 'threat', 'throne', 'to', 'turmoil', 'uther', 'welfare', 'when', 'who', 'will', 'withdraws', 'without', 'woman'][[ 0. 0. 0.15095332 0. 0. 0. 0.31622502 0. 0. 0. 0. 0. 0. 0.20340954 0. 0. 0.31622502 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.31622502 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.31622502 0. 0.17386773 0.24504638 0. 0. 0. 0. 0. 0.31622502 0. 0. 0. 0. 0. 0.31622502 0. 0. 0. 0. 0.15095332 0. 0.31622502 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.31622502 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.15095332 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0.23250474 0. 0.11098857 0. 0.23250474 0. 0. 0.14955705 0.23250474 0. 0.23250474 0. 0. 0. 0. 0. 0. 0.23250474 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.23250474 0. 0. 0. 0. 0.18017058 0. 0. 0. 0.11098857 0. 0.23250474 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.23250474 0. 0. 0. 0. 0. 0.23250474 0.23250474 0. 0.23250474 0. 0.23250474 0. 0. 0. 0. 0.23250474 0. 0. 0. 0. 0.18017058 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.23250474 0. 0. 0. 0. 0. 0. 0. 0. 0.14955705 0. 0.18017058 0. 0. 0. 0.14955705 0. 0. 0.23250474] [ 0. 0. 0. 0. 0. 0. 0. 0.18736875 0. 0. 0. 0. 0. 0.18736875 0. 0. 0. 0. 0. 0. 0. 0. 0.29128766 0. 0. 0. 0.29128766 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.13904921 0. 0. 0. 0. 0. 0. 0. 0.1601566 0. 0.29128766 0. 0. 0. 0. 0. 0. 0.29128766 0. 0.22572213 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.29128766 0. 0. 0. 0. 0. 0.18736875 0. 0.29128766 0. 0.29128766 0. 0. 0. 0.29128766 0. 0. 0. 0. 0. 0. 0. 0.18736875 0. 0. 0. 0. 0.29128766 0. 0. 0. 0. ] [ 0. 0. 0.14155101 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.29652856 0. 0. 0.29652856 0. 0. 0. 0. 0. 0.29652856 0. 0. 0. 0. 0. 0. 0.29652856 0. 0. 0. 0. 0. 0. 0. 0. 0.16303816 0.22978336 0. 0.29652856 0. 0. 0.29652856 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.29652856 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.14155101 0. 0. 0.29652856 0.19073992 0. 0.22978336 0. 0.29652856 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.24121053 0. 0.20022545 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.31127497 0. 0. 0. 0. 0.31127497 0. 0. 0. 0.31127497 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.31127497 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.25158536 0. 0. 0. 0.31127497 0. 0. 0. 0. 0. 0. 0. 0. 0.31127497 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.25158536 0. 0.31127497 0. 0. 0.31127497 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0.10632924 0. 0. 0.22274414 0. 0.14327861 0. 0. 0. 0. 0.22274414 0. 0.17260697 0.22274414 0. 0. 0. 0.22274414 0. 0. 0. 0. 0.22274414 0. 0. 0.22274414 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.10632924 0. 0. 0. 0.22274414 0.22274414 0. 0. 0. 0. 0. 0. 0.22274414 0. 0. 0. 0. 0. 0. 0.17260697 0. 0. 0. 0. 0. 0. 0.10632924 0. 0. 0.17260697 0. 0. 0. 0. 0. 0.22274414 0. 0.17260697 0. 0. 0.14327861 0. 0. 0.22274414 0. 0.22274414 0.22274414 0. 0. 0.17260697 0. 0.22274414 0.10632924 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.14327861 0.22274414 0. 0. ] [ 0. 0.24521796 0.11705736 0.19002219 0. 0. 0. 0. 0. 0.24521796 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.24521796 0. 0. 0. 0.24521796 0. 0.11705736 0. 0. 0.24521796 0. 0. 0. 0. 0.13482643 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.24521796 0. 0. 0. 0. 0. 0.24565801 0. 0. 0.19002219 0. 0.24521796 0. 0.24521796 0. 0. 0.24521796 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.19002219 0.24521796 0. 0.19819534 0.24521796 0. 0. 0. 0. 0. 0.24521796 0. 0. 0.15773474 0. 0. 0. ] [ 0. 0. 0. 0.38872173 0. 0. 0. 0. 0. 0. 0. 0.22958532 0. 0. 0.22958532 0. 0. 0. 0.29627299 0. 0. 0.29627299 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.22958532 0. 0. 0. 0.14142901 0.29627299 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.29627299 0. 0. 0. 0. 0.29627299 0. 0. 0. 0. 0. 0. 0. 0.14142901 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.19057553 0.29627299 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.29627299 0. ]]
转载地址:http://drwgl.baihongyu.com/