多文件比对

要比对多个存储目录下的文件是否存在重复文件,可以通过以下步骤实现 MD5 值的比对:

1. 提取文件路径

  • 首先从你的目录结构中获取所有文件的路径,可以使用 find 命令递归列出所有文件路径:
    find /traixxxnent/zpxxxxx -type f > file_list.txt
    find /yfxxxmanent/zpxxxx -type f >> file_list.txt
    # 对表中每个目录重复上述命令,保存到同一个文件 file_list.txt
    

2. 计算文件的 MD5 值

  • 使用以下脚本对文件列表中的每个文件计算 MD5 值:
    while read filepath; do
        md5sum "$filepath" >> md5_checksums.txt
    done < file_list.txt
    
  • 输出的 md5_checksums.txt 文件会包含每个文件的路径和对应的 MD5 值。

3. 查找重复文件

  • 使用以下命令找出相同的 MD5 值(重复文件):

    awk '{print $1}' md5_checksums.txt | sort | uniq -d > duplicate_md5.txt
    
  • 使用以下脚本列出重复的文件路径:

    grep -Ff duplicate_md5.txt md5_checksums.txt > duplicate_files.txt
    
  • duplicate_files.txt 文件中会列出所有重复文件的路径。

4. 输出结果

  • 如果需要输出重复的文件或路径,可以根据你的需求格式化结果。

Python 脚本:

[root@rg2-bgw-prometheus001 mmwei3]# cat test_file_md5_compare_v2.py
import hashlib
import os

# 目录列表
directories = [
    "/train33/asrmlg/permanent/zpxie2",
    "/yfw-b3-mix01/asrmlg/permanent/zpxie2",
    # 继续添加其他目录
]

# 用于存储文件的 MD5 值和路径
md5_dict = {}

# 计算文件的 MD5 值
def calculate_md5(file_path):
    hasher = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hasher.update(chunk)
    return hasher.hexdigest()

# 遍历目录
for directory in directories:
    for root, _, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            file_md5 = calculate_md5(file_path)
            if file_md5 in md5_dict:
                md5_dict[file_md5].append(file_path)
            else:
                md5_dict[file_md5] = [file_path]

# 输出重复文件
print("重复的文件路径:")
for md5, paths in md5_dict.items():
    if len(paths) > 1:
        print(f"MD5: {md5}")
        for path in paths:
            print(f"  {path}")

5、当然针对海量的小文件,我们可以换个车略比对,比如直接抛弃大小不同的。

[root@rg2-bgw-prometheus001 mmwei3]# cat test_file_md5_compare.py
import hashlib
import os
from collections import defaultdict

# List of directories to check
directories = [
    "/train33/asrmlg/permanent/zpxie2",
    "/yfw-b3-mix01/asrmlg/permanent/zpxie2",
    # Add more directories as needed
]

size_dict = defaultdict(list)
md5_dict = {}

# Group files by size
for directory in directories:
    for root, _, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            try:
                file_size = os.path.getsize(file_path)
                size_dict[file_size].append(file_path)
            except OSError:
                continue  # Skip files that cannot be accessed

# Compute the MD5 hash for files with the same size
def calculate_md5(file_path):
    hasher = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hasher.update(chunk)
    return hasher.hexdigest()

for size, files in size_dict.items():
    if len(files) > 1:  # Only calculate MD5 for files with the same size
        for file_path in files:
            file_md5 = calculate_md5(file_path)
            if file_md5 in md5_dict:
                md5_dict[file_md5].append(file_path)
            else:
                md5_dict[file_md5] = [file_path]

# Print duplicate files
print("Duplicate file paths:")
for md5, paths in md5_dict.items():
    if len(paths) > 1:
        print(f"MD5: {md5}")
        for path in paths:
            print(f"  {path}")


6、或者使用concurrent.futures利用多线程处理也可以

[root@rg2-bgw-prometheus001 mmwei3]# cat test_file_md5_compare_v3.py
from concurrent.futures import ThreadPoolExecutor
import hashlib
import os
from collections import defaultdict

# List of directories to check
directories = [
    "/train33/asrmlg/permanent/zpxie2",
    "/yfw-b3-mix01/asrmlg/permanent/zpxie2",
    # Add more directories as needed
]

size_dict = defaultdict(list)
md5_dict = {}

# Group files by size
for directory in directories:
    for root, _, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            try:
                file_size = os.path.getsize(file_path)
                size_dict[file_size].append(file_path)
            except OSError:
                continue  # Skip files that cannot be accessed

# Function to calculate the MD5 hash of a file
def calculate_md5(file_path):
    hasher = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hasher.update(chunk)
    return file_path, hasher.hexdigest()

# Use multithreading to compute MD5 hashes
with ThreadPoolExecutor(max_workers=8) as executor:
    for size, files in size_dict.items():
        if len(files) > 1:  # Only process files with the same size
            futures = {executor.submit(calculate_md5, file): file for file in files}
            for future in futures:
                file_path, file_md5 = future.result()
                if file_md5 in md5_dict:
                    md5_dict[file_md5].append(file_path)
                else:
                    md5_dict[file_md5] = [file_path]

# Print duplicate files
print("Duplicate file paths:")
for md5, paths in md5_dict.items():
    if len(paths) > 1:
        print(f"MD5: {md5}")
        for path in paths:
            print(f"  {path}")

7、可以参考fdupes

https://github.com/adrianlopezroche/fdupes.git
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

抛物线.

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值