英文分词(不用类似re等工具)

发布时间：2024年01月16日

不用类似re等工具，将输入英文文本，拆分成一个个有意义的单词。

(笔记模板由python脚本于2024年01月15日 23:34:05创建，本篇笔记适合会基础编程，熟悉python字符串的coder翻阅)

【学习的细节是欢悦的历程】

Python 官网：https://www.python.org/
Free：大咖免费“圣经”教程《 python 完全自学教程》，不仅仅是基础那么简单……
地址：https://lqpybook.readthedocs.io/

??自学并不是什么神秘的东西，一个人一辈子自学的时间总是比在学校学习的时间长，没有老师的时候总是比有老师的时候多。
????????????—— 华罗庚

My CSDN主页、My HOT博、My Python 学习个人备忘录
好文力荐、老齐教室

将输入英文文本 英文分词 (拆分成有意义的单词)

笔记正在编辑中……

本文质量分：

【 $96$ 】
本文地址： https://blog.csdn.net/m0_57158496/article/details/135613713

CSDN质量分查询入口：http://www.csdn.net/qc

目?录

◆?英文分词
- 1、念想萌芽
- 2、算法解析
- - 2.1 去除非字母字符
  - 2.2 统计词频
  - 2. 分词
- 3、完整源码(Python)

◆?英文分词

1、念想萌芽

??今天在 $c s d n$ 看到 $j i e b a$ ，脑中居然浮现出一个想法：“我可不可以撰写一段代码，实现 $j i e b a$ 一样的分词效果”。于是，我就开始了尝试……

回页目录

2、算法解析

??解析

$6 k +$ 字符的试码文本 $英文美文 . t x t$
实现效果截屏图片
?
$分词列表$

?
$词频统计$

中间部分略

2.1 去除非字母字符

??描述

代码运行效果截屏图片

Python代码


    def _isletter(self):
        ''' 剔除非字母字符 '''
        lowers = ''.join(chr(i) for i in range(ord('a'), ord('z')+1)) # 生成26个小写字母字符串。
        letters = tuple(lowers+lowers.upper())
        #input(letters) # 校验字母列表。
        words = [i if i in letters else ' ' for i in self.words] # 把非字母替换成英文空格字符。
        
        return ''.join(words)

回页目录

2.2 统计词频

??描述

代码运行效果截屏图片

Python代码


    def _count(self, words):
        ''' 统计词频 '''
        words = [(i, words.count(i)) for i in set(words)] # 列表解析式统计词频。
        words.sort(key=lambda x: x[0]) # 按单词排序。
        words.sort(key=lambda x: x[-1], reverse=True) # 按词频排逆序。
        
        return words

回页目录

2. 分词

??描述

代码运行效果截屏图片

Python代码


    def split(self):
        ''' 分词 '''
        nowords = ('I', 'me', 'my', 'main', 'you', 'your', 'hers', 'she', 'her', 'hers', 'he', 'his', 'him', 'we', 'our', 'ours', 'they', 'their', 'them', 'its', 'it', 'a', 'an', 'm', 's', 'd', 'did', 'do', 'doing', 'does', 'done', 'can', 'would', 'am', 'is', 'was', 'are', 'were', 'be', 'have', 'has', 'often', 'always', 'to', 'too', 'very', 'many', 'any', 'in', 'on', 'with', 'at', 'of', 'up', 'down', 'go', 'goes', 'went', 'for', 'about', 'now', 'if', 'but', 're','from', 'the', 'there', 'this', 'that', 'than', 'when', 'what', 'where', 'who', 'why', 'so', 'as', 'yes', 'no', 'not', 'jion', 'or', 'and', 'by', 'but')
        nowords = list(nowords) + [i.title() for i in nowords]
        #input(nowords) # 校验无效单词列表。
        words = [i for i in self._isletter().split() if i and i not in nowords] # 去除空格和无效单词。
        print(words) # 打印分词列表。
        
        return self._count(words)

回页目录

3、完整源码(Python)

(源码较长，点此跳过源码)


#!/sur/bin/nve python
# coding: utf-8


'''
英文分词
'''

class EnSplit:
    
    def __init__(self, text):
        self.words = text
        
    def _isletter(self):
        ''' 剔除非字母字符 '''
        lowers = ''.join(chr(i) for i in range(ord('a'), ord('z')+1)) # 生成26个小写字母字符串。
        letters = tuple(lowers+lowers.upper())
        #input(letters) # 校验字母列表。
        words = [i if i in letters else ' ' for i in self.words] # 把非字母替换成英文空格字符。
        
        return ''.join(words)
        
    def _count(self, words):
        ''' 统计词频 '''
        words = [(i, words.count(i)) for i in set(words)] # 列表解析式统计词频。
        words.sort(key=lambda x: x[0]) # 按单词排序。
        words.sort(key=lambda x: x[-1], reverse=True) # 按词频排逆序。
        
        return words

    def split(self):
        ''' 分词 '''
        nowords = ('I', 'me', 'my', 'main', 'you', 'your', 'hers', 'she', 'her', 'hers', 'he', 'his', 'him', 'we', 'our', 'ours', 'they', 'their', 'them', 'its', 'it', 'a', 'an', 'm', 's', 'd', 'did', 'do', 'doing', 'does', 'done', 'can', 'would', 'am', 'is', 'was', 'are', 'were', 'be', 'have', 'has', 'often', 'always', 'to', 'too', 'very', 'many', 'any', 'in', 'on', 'with', 'at', 'of', 'up', 'down', 'go', 'goes', 'went', 'for', 'about', 'now', 'if', 'but', 're','from', 'the', 'there', 'this', 'that', 'than', 'when', 'what', 'where', 'who', 'why', 'so', 'as', 'yes', 'no', 'not', 'jion', 'or', 'and', 'by', 'but')
        nowords = list(nowords) + [i.title() for i in nowords]
        #input(nowords) # 校验无效单词列表。
        words = [i for i in self._isletter().split() if i and i not in nowords] # 去除空格和无效单词。
        print(words) # 打印分词列表。

        return self._count(words)


if __name__ == '__main__':
    text = '''
    I'm a old man. I love Python.
    我是一个老男人，我爱Python。
    '''
    text = open('/sdcard/Documents/英文美文.txt').read()
    en = EnSplit(text)
    print('\n'.join([f"{i[0]}: {i[-1]}" for i in en.split()]))

回页首

上一篇：? 正则表达式中的“回引用(回溯)”_{($Python$正则表达式中的“回引用_(回溯)”——分组别名引用与序号引用的差异及正则表达式中的“P”关键字详情)}
下一篇：?

我的HOT博：

??本次共计收集289篇博文笔记信息，总阅读量44.72w。数据采集于2023年12月11日 23:07:13，用时5分11.8秒。阅读量不小于4.0k的有17篇。

ChatGPT国内镜像站初体验：聊天、Python代码生成等
地址：https://blog.csdn.net/m0_57158496/article/details/129035387
浏览阅读：6.2w
点赞：127?收藏：809?评论：71
(本篇笔记于2023-02-14 23:46:33首次发布，最后修改于2023-07-03 05:50:55)
?
让QQ群昵称色变的神奇代码
地址：https://blog.csdn.net/m0_57158496/article/details/122566500
浏览阅读：5.8w
点赞：24?收藏：86?评论：17
(本篇笔记于2022-01-18 19:15:08首次发布，最后修改于2022-01-20 07:56:47)
?
Python列表(list)反序(降序)的7种实现方式
地址：https://blog.csdn.net/m0_57158496/article/details/128271700
浏览阅读：9.9k
点赞：5?收藏：30?评论：8
(本篇笔记于2022-12-11 23:54:15首次发布，最后修改于2023-03-20 18:13:55)
?
pandas 数据类型之 DataFrame
地址：https://blog.csdn.net/m0_57158496/article/details/124525814
浏览阅读：9.4k
点赞：7?收藏：34?
摘要：pandas 数据类型之 DataFrame_panda dataframe。
(本篇笔记于2022-05-01 13:20:17首次发布，最后修改于2022-05-08 08:46:13)
?
个人信息提取(字符串)
地址：https://blog.csdn.net/m0_57158496/article/details/124244618
浏览阅读：7.7k
摘要：个人信息提取(字符串)_python个人信息提取。
(本篇笔记于2022-04-18 11:07:12首次发布，最后修改于2022-04-20 13:17:54)
?
Python字符串居中显示
地址：https://blog.csdn.net/m0_57158496/article/details/122163023
浏览阅读：7.2k
评论：1
?
罗马数字转换器|罗马数字生成器
地址：https://blog.csdn.net/m0_57158496/article/details/122592047
浏览阅读：7.2k
(本篇笔记于2022-01-19 23:26:42首次发布，最后修改于2022-01-21 18:37:46)
?
斐波那契数列的递归实现和for实现
地址：https://blog.csdn.net/m0_57158496/article/details/122355295
浏览阅读：5.6k
点赞：4?收藏：2?评论：8
?
回车符、换行符和回车换行符
地址：https://blog.csdn.net/m0_57158496/article/details/123109488
浏览阅读：5.5k
点赞：2?收藏：3?
摘要：回车符、换行符和回车换行符_命令行回车符。
(本篇笔记于2022-02-24 13:10:02首次发布，最后修改于2022-02-25 20:07:40)
?
python清屏
地址：https://blog.csdn.net/m0_57158496/article/details/120762101
浏览阅读：5.3k
?
练习：字符串统计(坑：f‘string‘报错)
地址：https://blog.csdn.net/m0_57158496/article/details/121723096
浏览阅读：5.1k
?
练习：尼姆游戏(聪明版/傻瓜式?人机对战)
地址：https://blog.csdn.net/m0_57158496/article/details/121645399
浏览阅读：5.1k
点赞：14?收藏：44?
?
我的 Python.color() (Python 色彩打印控制)
地址：https://blog.csdn.net/m0_57158496/article/details/123194259
浏览阅读：4.6k
点赞：2?收藏：8?
摘要：我的 Python.color() (Python 色彩打印控制)_python color。
(本篇笔记于2022-02-28 22:46:21首次发布，最后修改于2022-03-03 10:30:03)
?
练习：生成100个随机正整数
地址：https://blog.csdn.net/m0_57158496/article/details/122558220
浏览阅读：4.6k
(本篇笔记于2022-01-18 13:31:36首次发布，最后修改于2022-01-20 07:58:12)
?
密码强度检测器
地址：https://blog.csdn.net/m0_57158496/article/details/121739694
浏览阅读：4.4k
(本篇笔记于2021-12-06 09:08:25首次发布，最后修改于2022-11-27 09:39:39)
?
罗马数字转换器(用罗马数字构造元素的值取模实现)
地址：https://blog.csdn.net/m0_57158496/article/details/122608526
浏览阅读：4.2k
(本篇笔记于2022-01-20 19:38:12首次发布，最后修改于2022-01-21 18:32:02)
?
练习：班里有人和我同生日难吗？(概率probability、蒙特卡洛随机模拟法)
地址：https://blog.csdn.net/m0_57158496/article/details/124424935
浏览阅读：4.0k
摘要：班里有人和我同生日难吗？(概率probability、蒙特卡洛随机模拟法)_生日模拟问题,计算频率,并画出随着试验次数n的增大,频率和理论概率的关系图。
(本篇笔记于2022-04-26 12:46:25首次发布，最后修改于2022-04-27 21:22:07)

推荐条件阅读量突破4.0k (更多热博，请点击蓝色文字跳转翻阅)

截屏图片

??(此文涉及ChatPT，曾被csdn多次下架，前几日又因新发笔记被误杀而落马。躺“未过审”还不如回收站，回收站还不如永久不见。😪值此年底清扫，果断移除。留此截图，以识“曾经”。2023-12-31)

回页首

精品文章：

好文力荐：齐伟书稿《python 完全自学教程》 Free连载(已完稿并集结成书，还有PDF版本百度网盘永久分享，点击跳转免费🆓下载。)
OPP三大特性：封装中的property
通过内置对象理解python'
正则表达式
python中“*”的作用
Python 完全自学手册
海象运算符
Python中的 `!=`与`is not`不同
学习编程的正确方法

来源：老齐教室

◆ Python 入门指南【Python 3.6.3】

好文力荐：

全栈领域优质创作者——[寒佬]_{(还是国内某高校学生)}博文“非技术文—关于英语和如何正确的提问”，“英语”和“会提问”是编程学习的两大利器。
【8大编程语言的适用领域】先别着急选语言学编程，先看它们能干嘛
靠谱程序员的好习惯
大佬帅地的优质好文“函数功能、结束条件、函数等价式”三大要素让您认清递归

CSDN实用技巧博文：

文章来源:https://blog.csdn.net/m0_57158496/article/details/135613713
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！