Python实现敏感词过滤的五种方法_F11 - 专业站长和开发者的学习网站

APP正在开发中...

返回顶部

分享到

Python实现敏感词过滤的五种方法

python 来源：互联网作者：佚名发布时间：2025-04-06 22:19:39 人浏览

摘要

1、replace替换 replace就是最简单的字符串替换，当一串字符串中有可能会出现的敏感词时，我们直接使用相应的replace方法用*替换出敏感词即可。缺点：文本和敏感词少的时候还可以，多的时候

1、replace替换

replace就是最简单的字符串替换，当一串字符串中有可能会出现的敏感词时，我们直接使用相应的replace方法用*替换出敏感词即可。

缺点：

文本和敏感词少的时候还可以，多的时候效率就比较差了。

示例代码：

text = '我是一个来自星星的超人，具有超人本领！'

text = text.replace("超人", '*' * len("超人")).replace("星星", '*' * len("星星"))

print(text) # 我是一个来自***的***，具有***本领！

运行结果：

如果是多个敏感词可以用列表进行逐一替换。

示例代码：

text = '我是一个来自星星的超人，具有超人本领！'

words = ['超人', '星星']

for word in words:

text = text.replace(word, '*' * len(word))

print(text) # 我是一个来自***的***，具有***本领！

运行效果：

2、正则表达式

使用正则表达式是一种简单而有效的方法，可以快速地匹配敏感词并进行过滤。在这里我们主要是使用“|”来进行匹配，“|”的意思是从多个目标字符串中选择一个进行匹配。

示例代码：

import re

def filter_words(text, words):

pattern = '|'.join(words)

return re.sub(pattern, '***', text)

if __name__ == '__main__':

text = '我是一个来自星星的超人，具有超人本领！'

words = ['超人', '星星']

res = filter_words(text, words)

print(res) # 我是一个来自***的***，具有***本领！

运行结果：

3、使用ahocorasick第三方库

ahocorasick库安装：

1	pip install pyahocorasick

示例代码：

import ahocorasick

def filter_words(text, words):

A = ahocorasick.Automaton()

for index, word in enumerate(words):

A.add_word(word, (index, word))

A.make_automaton()

result = []

for end_index, (insert_order, original_value) in A.iter(text):

start_index = end_index - len(original_value) + 1

result.append((start_index, end_index))

for start_index, end_index in result[::-1]:

text = text[:start_index] + '*' * (end_index - start_index + 1) + text[end_index + 1:]

return text

if __name__ == '__main__':

text = '我是一个来自星星的超人，具有超人本领！'

words = ['超人', '星星']

res = filter_words(text, words)

print(res) # 我是一个来自***的***，具有***本领！

运行结果：

4、字典树

使用字典树是一种高效的方法，可以快速地匹配敏感词并进行过滤。

示例代码：

class TreeNode:

def __init__(self):

self.children = {}

self.is_end = False

class Tree:

def __init__(self):

self.root = TreeNode()

def insert(self, word):

node = self.root

for char in word:

if char not in node.children:

node.children[char] = TreeNode()

node = node.children[char]

node.is_end = True

def search(self, word):

node = self.root

for char in word:

if char not in node.children:

return False

node = node.children[char]

return node.is_end

def filter_words(text, words):

tree = Tree()

for word in words:

tree.insert(word)

result = []

for i in range(len(text)):

node = tree.root

for j in range(i, len(text)):

if text[j] not in node.children:

break

node = node.children[text[j]]

if node.is_end:

result.append((i, j))

for start_index, end_index in result[::-1]:

text = text[:start_index] + '*' * (end_index - start_index + 1) + text[end_index + 1:]

return text

if __name__ == '__main__':

text = '我是一个来自星星的超人，具有超人本领！'

words = ['超人', '星星']

res = filter_words(text, words)

print(res) # 我是一个来自***的***，具有***本领！

运行结果：

5、DFA算法

使用DFA算法是一种高效的方法，可以快速地匹配敏感词并进行过滤。DFA的算法，即Deterministic Finite Automaton算法，翻译成中文就是确定有穷自动机算法。它的基本思想是基于状态转移来检索敏感词，只需要扫描一次待检测文本，就能对所有敏感词进行检测。

示例代码：

class DFA:

def __init__(self, words):

self.words = words

self.build()

def build(self):

self.transitions = {}

self.fails = {}

self.outputs = {}

state = 0

for word in self.words:

current_state = 0

for char in word:

next_state = self.transitions.get((current_state, char), None)

if next_state is None:

state += 1

self.transitions[(current_state, char)] = state

current_state = state

else:

current_state = next_state

self.outputs[current_state] = word

queue = []

for (start_state, char), next_state in self.transitions.items():

if start_state == 0:

queue.append(next_state)

self.fails[next_state] = 0

while queue:

r_state = queue.pop(0)

for (state, char), next_state in self.transitions.items():

if state == r_state:

queue.append(next_state)

fail_state = self.fails[state]

while (fail_state, char) not in self.transitions and fail_state != 0:

fail_state = self.fails[fail_state]

self.fails[next_state] = self.transitions.get((fail_state, char), 0)

if self.fails[next_state] in self.outputs:

self.outputs[next_state] += ', ' + self.outputs[self.fails[next_state]]

def search(self, text):

state = 0

result = []

for i, char in enumerate(text):

while (state, char) not in self.transitions and state != 0:

state = self.fails[state]

state = self.transitions.get((state, char), 0)

if state in self.outputs:

result.append((i - len(self.outputs[state]) + 1, i))

return result

def filter_words(text, words):

dfa = DFA(words)

result = []

for start_index, end_index in dfa.search(text):

result.append((start_index, end_index))

for start_index, end_index in result[::-1]:

text = text[:start_index] + '*' * (end_index - start_index + 1) + text[end_index + 1:]

return text

if __name__ == '__main__':

text = '我是一个来自星星的超人，具有超人本领！'

words = ['超人', '星星']

res = filter_words(text, words)

print(res) # 我是一个来自***的***，具有***本领！

运行结果：

您可能感兴趣的文章 :

原文链接 :

Tag : python(1099)敏感词(3)

Python实现敏感词过滤的五种方法

1、replace替换 replace就是最简单的字符串替换，当一串字符串中有可能会出现的敏感词时，我们直接使用相应的replace方法用*替换出敏感词即
Python Socket网络编程的7种硬核用法

上周老板拍着我的肩膀说：小花啊，我们需要一个在线客服系统，你用 Python 搞个 Socket 聊天室吧！我心里嘀咕：Socket 不就发发消息、写个
Python使用Matplotlib绘制Swarm Plot(蜂群图)的代码

Swarm Plot（蜂群图）是一种数据可视化图表，它用于展示分类数据的分布情况。这种图表通过将数据点沿着一个或多个分类变量轻微地分散，
Python中的异步与同步深度解析

Python中的异步与同步：深度解析与实践在Python编程世界里，异步和同步的概念是理解程序执行流程和性能优化的关键。这篇文章将带你深入
Python异步编程中asyncio.gather的并发控制的介绍

在Python异步编程生态中，asyncio.gather是并发任务调度的核心工具。然而当面对海量任务时，不加控制的并发可能引发资源耗尽、服务降级等问
基于Python打造一个高效开发辅助全能工具箱

在日常开发过程中，我们经常需要进行各种琐碎但又必不可少的操作，比如文件处理、编码转换、哈希计算、二维码生成、单位换算等。如
python中time模块的常用方法及应用

一、时间基石：time.time() time.time()是获取时间戳的入口函数，返回自1970年1月1日（Unix纪元）以来的秒数（浮点数）。这个10位数字像时间维
讯飞webapi语音识别接口调用代码(python)

基于python3 讯飞webAPI语音识别接口调用的实现。一、环境 1、注册讯飞平台账号：讯飞平台网址 2、进入控制台并创建应用 3、进入应用后点
Python不支持中文路径的解决方法

在编程的世界里，遇到问题并不罕见，但有些问题可能会让人感到格外棘手。比如，你是否曾经在使用Python处理文件时，因为路径中包含中
Python使用DeepSeek进行联网搜索功能

在当今信息爆炸的时代，联网搜索已成为获取数据、优化模型效果的重要手段。Python作为一种非常流行的编程语言，结合DeepSeek这一高性能的

Python实现敏感词过滤的五种方法

1、replace替换

2、正则表达式

3、使用ahocorasick第三方库

4、字典树

5、DFA算法

您可能感兴趣的文章 :

Python实现敏感词过滤的五种方法

Python Socket网络编程的7种硬核用法

Python使用Matplotlib绘制Swarm Plot(蜂群图)的代码

Python中的异步与同步深度解析

Python异步编程中asyncio.gather的并发控制的介绍

基于Python打造一个高效开发辅助全能工具箱

python中time模块的常用方法及应用

讯飞webapi语音识别接口调用代码(python)

Python不支持中文路径的解决方法

Python使用DeepSeek进行联网搜索功能

python批量下载抖音视频

利用Pyecharts可视化微信好友的方法

python爬取豆瓣电影TOP250数据

基于tensorflow权重文件的解读

解决Python字典查找报Keyerror的问题