转载

关于python操作txt自动化脚本


准备工作:安装所需的Python库

  • re(正则表达式操作,用于复杂文本匹配)
  • csv(处理CSV文件)
  • json(处理JSON文件)
  • collections(用于统计词频)
  • matplotlib 和 wordcloud(生成词云图)

1.读取txt内容

1.1.逐行读取txt文件

在数据处理的第一步就是读取txt文件。以下是逐行读取txt文件的示例代码:

def read_txt_file_by_line(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        for line in file:
            print(line.strip())

# 示例调用
read_txt_file_by_line('example.txt')

1.2.读入整个txt文件内容

如果需要将整个txt文件的内容读入到一个字符串中,可以使用以下代码:

def read_txt_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        content = file.read()
    return content

# 示例调用
content = read_txt_file('example.txt')
print(content)

2.对比两个txt文件内容

2.1.基本文本对比

有时候我们需要比较两个txt文件内容是否相同,以下代码可以实现这一功能:

def compare_txt_files(file1, file2):
    with open(file1, 'r', encoding='utf-8') as f1, open(file2, 'r', encoding='utf-8') as f2:
        content1 = f1.readlines()
        content2 = f2.readlines()
    
    for line1, line2 in zip(content1, content2):
        if line1 != line2:
            print(f'Difference found:\nFile1: {line1}\nFile2: {line2}')

# 示例调用
compare_txt_files('file1.txt', 'file2.txt')

2.2.差异高亮显示

为了更直观地显示txt文件之间的差异,可以用差异高亮显示的方法。我们使用difflib库来实现:

import difflib

def highlight_differences(file1, file2):
    with open(file1, 'r', encoding='utf-8') as f1, open(file2, 'r', encoding='utf-8') as f2:
        content1 = f1.readlines()
        content2 = f2.readlines()

    diff = difflib.unified_diff(content1, content2, fromfile='file1', tofile='file2')
    for line in diff:
        print(line)

# 示例调用
highlight_differences('file1.txt', 'file2.txt')

3.txt文件内容过滤

3.1.过滤特定关键字行

在处理txt文件时,可能需要过滤掉包含特定关键字的行。以下是一个示例代码:

def filter_lines_by_keyword(filepath, keyword):
    with open(filepath, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    
    filtered_lines = [line for line in lines if keyword not in line]
    return filtered_lines

# 示例调用
filtered = filter_lines_by_keyword('example.txt', 'filter_keyword')
for line in filtered:
    print(line.strip())

3.2.过滤空行和注释行

有时候需要过滤掉空行和注释行(比如以#开头的行)。以下是实现这一功能的代码:

def filter_empty_and_comment_lines(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    
    filtered_lines = [line for line in lines if line.strip() and not line.strip().startswith('#')]
    return filtered_lines

# 示例调用
filtered = filter_empty_and_comment_lines('example.txt')
for line in filtered:
    print(line.strip())

4.合并多个txt文件

4.1.简单合并

将多个txt文件的内容简单合并成一个文件,可以使用以下代码:

def merge_txt_files(file_list, output_file):
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for file in file_list:
            with open(file, 'r', encoding='utf-8') as infile:
                outfile.write(infile.read())
                outfile.write('\n')

# 示例调用
merge_txt_files(['file1.txt', 'file2.txt', 'file3.txt'], 'merged.txt')

4.2.按行混合合并

如果需要按行混合合并多个文件的内容,可以使用以下代码:

def merge_files_by_line(file_list, output_file):
    files = [open(file, 'r', encoding='utf-8') for file in file_list]
    with open(output_file, 'w', encoding='utf-8') as outfile:
        while True:
            lines = [file.readline() for file in files]
            if all(line == '' for line in lines):
                break
            for line in lines:
                if line:
                    outfile.write(line.strip() + '\n')
    for file in files:
        file.close()

# 示例调用
merge_files_by_line(['file1.txt', 'file2.txt', 'file3.txt'], 'merged_by_line.txt')

5.将txt文件转换为其他格式

5.1.转换为csv格式

有时候我们需要将txt文件的内容转换成csv格式以便进行数据处理或分析,下面是相关代码示例:

import csv

def txt_to_csv(txt_file, csv_file):
    with open(txt_file, 'r', encoding='utf-8') as infile, open(csv_file, 'w', newline='', encoding='utf-8') as outfile:
        writer = csv.writer(outfile)
        for line in infile:
            writer.writerow(line.strip().split())

# 示例调用
txt_to_csv('example.txt', 'output.csv')

这段代码将txt文件的内容逐行读取,并按空格或制表符拆分成csv格式。

5.2.转换为json格式

除了csv格式,JSON格式也是常用的数据存储格式。以下是将txt文件转换为JSON格式的代码示例:

import json

def txt_to_json(txt_file, json_file):
    data = []
    with open(txt_file, 'r', encoding='utf-8') as infile:
        for line in infile:
            data.append(line.strip())

    with open(json_file, 'w', encoding='utf-8') as outfile:
        json.dump(data, outfile, indent=4)

# 示例调用
txt_to_json('example.txt', 'output.json')

这段代码将txt文件的每一行内容作为JSON数组里的一个元素进行存储。

6.从txt文件提取数据

6.1.提取特定模式的文本

有时候我们需要从txt文件中提取符合特定模式的文本,可以使用正则表达式(re库)来实现。以下代码示例演示如何提取符合某个模式的文本:

import re

def extract_pattern_from_txt(pattern, txt_file):
    matches = []
    with open(txt_file, 'r', encoding='utf-8') as file:
        content = file.read()
        matches = re.findall(pattern, content)
    return matches

# 示例调用,提取所有的数字
pattern = r'\d+'
matches = extract_pattern_from_txt(pattern, 'example.txt')
print("Match found:", matches)

6.2.提取邮件地址或URL

我们可以使用类似的方法来提取邮件地址或URL:

def extract_emails_and_urls(txt_file):
    email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    with open(txt_file, 'r', encoding='utf-8') as file:
        content = file.read()
        
    emails = re.findall(email_pattern, content)
    urls = re.findall(url_pattern, content)
    
    return emails, urls

# 示例调用
emails, urls = extract_emails_and_urls('example.txt')
print("Emails found:", emails)
print("URLs found:", urls)

7.统计txt文件中的词频

7.1.统计单词出现次数

我们可以统计txt文件中单词的出现频次,并对其进行排序。以下代码示例展示如何实现:

from collections import Counter

def count_word_frequency(txt_file):
    with open(txt_file, 'r', encoding='utf-8') as file:
        words = file.read().split()
        word_freq = Counter(words)
    return word_freq

# 示例调用
word_freq = count_word_frequency('example.txt')
for word, freq in word_freq.most_common():
    print(f'{word}: {freq}')

7.2.生成词云图

对于可视化效果,可以生成词云图来显示词频分布:

from wordcloud import WordCloud
import matplotlib.pyplot as plt

def generate_word_cloud(txt_file):
    with open(txt_file, 'r', encoding='utf-8') as file:
        text = file.read()
        
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

# 示例调用
generate_word_cloud('example.txt')

8.自动生成txt报告

8.1.从模板生成报告

可以使用txt模板生成报告,将动态数据填充到模板中。以下示例展示如何从模板生成报告:

def generate_report_from_template(template_file, output_file, data):
    with open(template_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
        content = infile.read()
        for key, value in data.items():
            content = content.replace(f'{{{{ {key} }}}}', str(value))
        outfile.write(content)

# 示例调用
data = {
    'name': 'Alice',
    'date': '2024-08-17',
    'summary': 'This is a summary of the report.'
}
generate_report_from_template('template.txt', 'report.txt', data)

8.2.动态生成报告内容

有时候需要动态生成报告的内容,以下示例展示如何实现:

def generate_dynamic_report(output_file, sections):
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for section in sections:
            outfile.write(f'# {section["title"]}\n\n')
            outfile.write(f'{section["content"]}\n\n')

# 示例调用
sections = [
    {
        "title": "Introduction",
        "content": "This is the introduction section of the report."
    },
    {
        "title": "Data Analysis",
        "content": "This section contains the analysis of the data."
    }
]
generate_dynamic_report('dynamic_report.txt', sections)
Python

评论