博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
NLTK基础教程学习笔记(十三)
阅读量:7144 次
发布时间:2019-06-29

本文共 11177 字,大约阅读时间需要 37 分钟。

在信息摘要应用中还包含着另一种理论逻辑:重要的句子中通常包含着重要的词汇,而跨语料库的差异词(discriminatory word)绝大多数数是重要词汇。因此,句子中包含具有差异很大的词汇,它就很重要。这样就得到一个非常简单的测量方法,就是计算每一个词各种的TF-IDF(term frequency-inverse document )分值,然后根据词汇的重要性找出一种标准化的凭据评分。这个评分就可以用来充当在信息摘要中选取句子的标准。

TF-IDF(term frequency–inverse document frequency)是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜寻引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜寻引擎还会使用基于连结分析的评级方法,以确定文件在搜寻结果中出现的顺序。
按照其不拿整段介绍来做,只拿前三句来实践,我拿了前一段:

import nltkfrom sklearn.feature_extraction.text import TfidfVectorizerf=open('news.txt')news_content=f.read()results=[]sentences=nltk.sent_tokenize(news_content)vectorizer=TfidfVectorizer(norm='l2',min_df=0,use_idf=True,smooth_idf=False,sublinear_tf=True)sklearn_binary=vectorizer.fit_transform(sentences)print(vectorizer.get_feature_names())print(sklearn_binary.toarray())

结果:

['accept', 'accepting', 'altria', 'and', 'announce', 'approaches', 'arthur', 'as', 'at', 'be', 'birth', 'britain', 'british', 'by', 'caliburn', 'ceremonial', 'character', 'decides', 'despite', 'destined', 'dies', 'draws', 'ector', 'eligible', 'embedded', 'enters', 'entrusted', 'explaining', 'fearing', 'fifteen', 'following', 'for', 'full', 'gender', 'growing', 'hardships', 'heir', 'her', 'hesitation', 'his', 'however', 'if', 'in', 'inspired', 'invasion', 'is', 'king', 'knight', 'known', 'large', 'leadership', 'leaving', 'legends', 'legitimate', 'loyal', 'mantle', 'merlin', 'monarch', 'name', 'nativity', 'never', 'no', 'not', 'of', 'or', 'pendragon', 'people', 'period', 'preserving', 'publicly', 'pulling', 'raises', 'recognize', 'responsible', 'ruler', 'saber', 'saxons', 'she', 'shoulders', 'sir', 'slab', 'son', 'soon', 'stone', 'subjects', 'surrogate', 'sword', 'symbolic', 'that', 'the', 'this', 'threat', 'throne', 'to', 'turmoil', 'uther', 'welfare', 'when', 'who', 'will', 'withdraws', 'without', 'woman'][[ 0.          0.          0.15095332  0.          0.          0.   0.31622502  0.          0.          0.          0.          0.          0.   0.20340954  0.          0.          0.31622502  0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.31622502   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.31622502  0.          0.17386773   0.24504638  0.          0.          0.          0.          0.   0.31622502  0.          0.          0.          0.          0.   0.31622502  0.          0.          0.          0.          0.15095332   0.          0.31622502  0.          0.          0.          0.          0.   0.          0.          0.          0.          0.31622502  0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.15095332  0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.        ] [ 0.23250474  0.          0.11098857  0.          0.23250474  0.          0.   0.14955705  0.23250474  0.          0.23250474  0.          0.          0.   0.          0.          0.          0.23250474  0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.23250474  0.          0.          0.          0.          0.18017058   0.          0.          0.          0.11098857  0.          0.23250474   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.   0.23250474  0.          0.          0.          0.          0.   0.23250474  0.23250474  0.          0.23250474  0.          0.23250474   0.          0.          0.          0.          0.23250474  0.          0.   0.          0.          0.18017058  0.          0.          0.          0.   0.          0.          0.          0.          0.          0.23250474   0.          0.          0.          0.          0.          0.          0.   0.          0.14955705  0.          0.18017058  0.          0.          0.   0.14955705  0.          0.          0.23250474] [ 0.          0.          0.          0.          0.          0.          0.   0.18736875  0.          0.          0.          0.          0.   0.18736875  0.          0.          0.          0.          0.          0.   0.          0.          0.29128766  0.          0.          0.   0.29128766  0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.13904921  0.          0.   0.          0.          0.          0.          0.          0.1601566   0.          0.29128766  0.          0.          0.          0.          0.   0.          0.29128766  0.          0.22572213  0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.29128766  0.          0.   0.          0.          0.          0.18736875  0.          0.29128766   0.          0.29128766  0.          0.          0.          0.29128766   0.          0.          0.          0.          0.          0.          0.   0.18736875  0.          0.          0.          0.          0.29128766   0.          0.          0.          0.        ] [ 0.          0.          0.14155101  0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.   0.29652856  0.          0.          0.29652856  0.          0.          0.   0.          0.          0.29652856  0.          0.          0.          0.   0.          0.          0.29652856  0.          0.          0.          0.   0.          0.          0.          0.          0.16303816  0.22978336   0.          0.29652856  0.          0.          0.29652856  0.          0.   0.          0.          0.          0.          0.          0.          0.   0.29652856  0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.14155101  0.          0.          0.29652856  0.19073992  0.   0.22978336  0.          0.29652856  0.          0.          0.          0.   0.        ] [ 0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.24121053  0.   0.20022545  0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.31127497   0.          0.          0.          0.          0.31127497  0.          0.   0.          0.31127497  0.          0.          0.          0.          0.   0.          0.          0.          0.          0.31127497  0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.25158536  0.          0.          0.   0.31127497  0.          0.          0.          0.          0.          0.   0.          0.          0.31127497  0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.25158536  0.          0.31127497  0.          0.   0.31127497  0.          0.          0.          0.          0.          0.   0.          0.        ] [ 0.          0.          0.10632924  0.          0.          0.22274414   0.          0.14327861  0.          0.          0.          0.   0.22274414  0.          0.17260697  0.22274414  0.          0.          0.   0.22274414  0.          0.          0.          0.          0.22274414   0.          0.          0.22274414  0.          0.          0.          0.   0.          0.          0.          0.          0.          0.10632924   0.          0.          0.          0.22274414  0.22274414  0.          0.   0.          0.          0.          0.          0.22274414  0.          0.   0.          0.          0.          0.          0.17260697  0.          0.   0.          0.          0.          0.          0.10632924  0.          0.   0.17260697  0.          0.          0.          0.          0.   0.22274414  0.          0.17260697  0.          0.          0.14327861   0.          0.          0.22274414  0.          0.22274414  0.22274414   0.          0.          0.17260697  0.          0.22274414  0.10632924   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.14327861  0.22274414  0.          0.        ] [ 0.          0.24521796  0.11705736  0.19002219  0.          0.          0.   0.          0.          0.24521796  0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.24521796  0.          0.          0.   0.24521796  0.          0.11705736  0.          0.          0.24521796   0.          0.          0.          0.          0.13482643  0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.24521796  0.          0.          0.          0.   0.          0.24565801  0.          0.          0.19002219  0.   0.24521796  0.          0.24521796  0.          0.          0.24521796   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.19002219   0.24521796  0.          0.19819534  0.24521796  0.          0.          0.   0.          0.          0.24521796  0.          0.          0.15773474   0.          0.          0.        ] [ 0.          0.          0.          0.38872173  0.          0.          0.   0.          0.          0.          0.          0.22958532  0.          0.   0.22958532  0.          0.          0.          0.29627299  0.          0.   0.29627299  0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.22958532   0.          0.          0.          0.14142901  0.29627299  0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.29627299  0.          0.          0.          0.   0.29627299  0.          0.          0.          0.          0.          0.   0.          0.14142901  0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.19057553  0.29627299  0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.          0.          0.   0.          0.          0.          0.          0.29627299  0.        ]]

timg

转载地址:http://drwgl.baihongyu.com/

你可能感兴趣的文章
Python任意字符集转换
查看>>
RHEL6基础二十四之RHEL软件包管理③源码安装
查看>>
Appium自动化测试之微信元素识别和代码实战
查看>>
基于Nginx反向代理及负载均衡
查看>>
SAP ALL compile code: SGEN
查看>>
Oracle体系结构之检查点
查看>>
SQL Server 2017 AlwaysOn on Linux 配置和维护(15)
查看>>
以太坊的势与局
查看>>
如何结合使用 Subversion 和 Eclipse
查看>>
ELK 实验(五)配置数据源和仪表盘
查看>>
centos 6.3搭建个人私有云存储owncloud
查看>>
PHP中的浅拷贝和深拷贝
查看>>
利用redis-sentinel+keepalived实现redis高可用
查看>>
CloudStack4.2登录报用户名或密码错误问题解析
查看>>
逻辑备库之ORA-01403解决方法
查看>>
MySQL Replication(复制)基本原理
查看>>
分享Silverlight/WPF/Windows Phone/HTML5一周学习导读(12月5日-12月11日)
查看>>
十年老站吐血迁移实录
查看>>
配置Exchange2010的边缘传输服务器
查看>>
我的家庭私有云计划-7
查看>>