菜鸟学IT之python词云初体验-白红宇

强烈建议你试试无所不能的chatGPT，快点击我

菜鸟学IT之python词云初体验

阅读量：4306 次

发布时间：2019-06-06

本文共 5041 字，大约阅读时间需要 16 分钟。

作业来源：

1. 下载一长篇中文小说。

2. 从文件读取待分析文本。

1 txt = open(r'G:\aa\三体.txt', 'r', encoding='utf8').read()  # 打开三体小说文件2 jieba.load_userdict(r'G:\aa\three.txt')  # 读取三体小说词库3 4 Filess= open(r'G:\aa\stops_chinese.txt', 'r', encoding='utf8')  # 打开中文停用词表5 stops = Filess.read().split('\n')  # 以回车键作为标识符把停用词表放到stops列表中

3. 安装并使用jieba进行中文分词。

4. 更新词库，加入所分析对象的专业词汇。

首先下载你要搜索的txt文本

进入词库下载专业词库，参考词库下载地址：https://pinyin.sogou.com/dict/

1 # -*- coding: utf-8 -*-  2 import struct  3 import os  4    5 # 拼音表偏移，  6 startPy = 0x1540;  7    8 # 汉语词组表偏移  9 startChinese = 0x2628; 10   11 # 全局拼音表 12 GPy_Table = {} 13   14 # 解析结果 15 # 元组(词频,拼音,中文词组)的列表 16   17   18 # 原始字节码转为字符串 19 def byte2str(data): 20     pos = 0 21     str = '' 22     while pos < len(data): 23         c = chr(struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0]) 24         if c != chr(0): 25             str += c 26         pos += 2 27     return str 28   29 # 获取拼音表 30 def getPyTable(data): 31     data = data[4:] 32     pos = 0 33     while pos < len(data): 34         index = struct.unpack('H', bytes([data[pos],data[pos + 1]]))[0] 35         pos += 2 36         lenPy = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0] 37         pos += 2 38         py = byte2str(data[pos:pos + lenPy]) 39   40         GPy_Table[index] = py 41         pos += lenPy 42   43 # 获取一个词组的拼音 44 def getWordPy(data): 45     pos = 0 46     ret = '' 47     while pos < len(data): 48         index = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0] 49         ret += GPy_Table[index] 50         pos += 2 51     return ret 52   53 # 读取中文表 54 def getChinese(data): 55     GTable = [] 56     pos = 0 57     while pos < len(data): 58         # 同音词数量 59         same = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0] 60   61         # 拼音索引表长度 62         pos += 2 63         py_table_len = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0] 64   65         # 拼音索引表 66         pos += 2 67         py = getWordPy(data[pos: pos + py_table_len]) 68   69         # 中文词组 70         pos += py_table_len 71         for i in range(same): 72             # 中文词组长度 73             c_len = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0] 74             # 中文词组 75             pos += 2 76             word = byte2str(data[pos: pos + c_len]) 77             # 扩展数据长度 78             pos += c_len 79             ext_len = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0] 80             # 词频 81             pos += 2 82             count = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0] 83   84             # 保存 85             GTable.append((count, py, word)) 86   87             # 到下个词的偏移位置 88             pos += ext_len 89     return GTable 90   91   92 def scel2txt(file_name): 93     print('-' * 60) 94     with open(file_name, 'rb') as f: 95         data = f.read() 96   97     print("词库名：", byte2str(data[0x130:0x338])) # .encode('GB18030') 98     print("词库类型：", byte2str(data[0x338:0x540])) 99     print("描述信息：", byte2str(data[0x540:0xd40]))100     print("词库示例：", byte2str(data[0xd40:startPy]))101  102     getPyTable(data[startPy:startChinese])103     getChinese(data[startChinese:])104     return getChinese(data[startChinese:])105  106 if __name__ == '__main__':107     # scel所在文件夹路径108     in_path = r"C:\Users\Administrator\Downloads"   #修改为你的词库文件存放文件夹109     # 输出词典所在文件夹路径110     out_path = r"C:\Users\Administrator\Downloads\text"  # 转换之后文件存放文件夹111     fin = [fname for fname in os.listdir(in_path) if fname[-5:] == ".scel"]112     for f in fin:113         try:114             for word in scel2txt(os.path.join(in_path, f)):115                 file_path=(os.path.join(out_path, str(f).split('.')[0] + '.txt'))116                 # 保存结果117                 with open(file_path,'a+',encoding='utf-8')as file:118                     file.write(word[2] + '\n')119             os.remove(os.path.join(in_path, f))120         except Exception as e:121             print(e)122             pass

View Code

5. 生成词频统计

1 # 统计词频次数2 for word in tokens:3     if len(word) == 1:4         continue5     else:6         wcdict[word] = wcdict.get(word, 0) + 1

View Code

6. 排序

1 # 词频排序2 wcls = list(wcdict.items())3 wcls.sort(key=lambda x: x[1], reverse=True)

View Code

7. 排除语法型词汇，代词、冠词、连词等停用词。

1 Filess= open(r'G:\aa\stops_chinese.txt', 'r', encoding='utf8')  # 打开中文停用词表2 stops = Filess.read().split('\n')  # 以回车键作为标识符把停用词表放到stops列表中3     4 tokens=[token for token in wordsls if token not in stops]5 print("过滤后中文内容对比:",len(tokens), len(wordsls))

View Code

8. 输出词频最大TOP20，把结果存放到文件里

1 # 打印前25词频最高的中文2 for i in range(25):3     print(wcls[i])4 5 # 存储过滤后的文本6 pd.DataFrame(wcls).to_csv('three.csv', encoding='utf-8')7 8 # 读取csv词云9 txt = open('three.csv', 'r', encoding='utf-8').read()

View Code

9. 生成词云。

1 # 读取csv词云 2 txt = open('three.csv', 'r', encoding='utf-8').read() 3  4 # 用空格键隔开文本并把它弄进列表中 5 cut_text = "".join(jieba.lcut(txt)) 6 mywc = WordCloud().generate(cut_text) 7  8 plt.imshow(mywc) 9 plt.axis("off")10 plt.show()

View Code

默认形状：

修改背景：

源码：

转载于:https://www.cnblogs.com/JGaoLin/p/10586208.html

你可能感兴趣的文章

项目复审——Beta阶段

Android 实现切换主题皮肤功能（类似于众多app中的夜间模式，主题包等）

在Android App中集成Google登录

openstack quantum搭建过程中一些有用的链接

数据库：mysql 获取刚插入行id[转]

Egret入门学习日记 --- 第二篇

前端“黑话”polyfill

linux 下运行 tomcat

RocketMQ 使用及常见问题

UVA10785 The Mad Numerologist

var result = ![] == []; console.log(result); // 结果是？为什么？

高效率Oracle SQL语句

Maven deploy部署jar到远程私服仓库

2/19 福建四校联考

abap 中modify 的使用

tomcat调优方案Maximum number of threads (200) created for connector with address null and port 8091...

java类的加载机制

MDK linker和debug的设置以及在RAM中调试

CocosCreator2.1.0渲染流程与shader

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！-- 愿君每日到此一游！

当前时间: 2024-10-05 21:14:07 当前IP: 18.216.77.153 联系邮箱:javaeecc@qq.com Copyright © 2020 - 2022 baihongyu.com 京ICP备2021015314号-2

强烈建议你试试无所不能的CHAT-GPT，快点击我

强烈建议你试试无所不能的CHAT-GPT，快点击我