面试手撕题：日志文件流式读入+动态分片秒杀TOPK统计！面试官惊呆了！

问题

面试官扔给你一个100G的日志文件（默认在当前路径下），要求从中筛选所有IP地址并统计出现频率最高的TOPK个IP。时间有限，需在短时间内用Java实现高效解决方案，充分利用多核CPU，代码要简短且覆盖核心逻辑。

难点

内存限制：直接加载100G文件到内存会爆机。
处理速度：单线程逐行读取效率低下。
多线程同步：多线程统计IP时需避免数据竞争。
分片策略：如何高效切分大文件并分配给多线程处理。

解决思路

1. 流式读取（核心优化点）

• 原理：使用BufferedReader逐行读取文件，避免内存溢出。
• 实现：

try (BufferedReader reader = new BufferedReader(new FileReader("log.log"))) {
    String line;
    while ((line = reader.readLine()) != null) {
        // 处理每一行日志
    }
}

2. 动态分片策略（按IP哈希分片）

• 原理：在读取每一行时，提取IP并计算哈希值，根据哈希值分配到不同分片。
• 实现：

// 按IP哈希分片，100个分片（线程数）
int partition = Math.abs(ip.hashCode()) % THREAD_COUNT;

• 优势：
• 均匀分布负载，避免某些线程处理过多数据。
• 无需预先知道IP分布，动态适应日志内容。

3. 多线程并发处理

• 线程池：使用ExecutorService管理线程，避免频繁创建/销毁线程。
• 并发计数：

// 线程安全的计数器
private static final ConcurrentHashMap<String, LongAdder> ipCounter = new ConcurrentHashMap<>();

4. 分片结果合并

• 合并逻辑：将各线程的计数结果合并到全局Map，再通过优先队列计算TOPK。

完整方案实现


/**
 * @author : lighting
 */
import java.io.*;
import java.util.*;
import java.util.concurrent.*;
import java.util.concurrent.atomic.LongAdder;
import java.util.regex.*;

public class LogTopKIP {
    private static final int THREAD_COUNT = Runtime.getRuntime().availableProcessors();
    private static final Pattern IP_PATTERN = Pattern.compile(
            "\\b(?:(?:25[0-5]|2[0-4]\\d|1?\\d\\d?)\\.){3}(?:25[0-5]|2[0-4]\\d|1?\\d\\d?)\\b"
    );

    public static void main(String[] args) throws Exception {
        // 1. 流式读取日志文件
        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<Map<String, Long>>> futures = new ArrayList<>();

        try (BufferedReader reader = new BufferedReader(new FileReader("log.log"))) {
            String line;
            while ((line = reader.readLine()) != null) {
                // 提交分片任务
                String finalLine = line;
                Future<Map<String, Long>> future = executor.submit(() -> processLine(finalLine));
                futures.add(future);
            }
        }

        // 2. 合并统计结果
        ConcurrentHashMap<String, LongAdder> counter = new ConcurrentHashMap<>();
        for (Future<Map<String, Long>> future : futures) {
            Map<String, Long> partialResult = future.get();
            partialResult.forEach((ip, count) ->
                    counter.computeIfAbsent(ip, k -> new LongAdder()).add(count)
            );
        }

        // 3. 计算TOPK
        PriorityQueue<Map.Entry<String, LongAdder>> topKHeap = new PriorityQueue<>(
                (a, b) -> Long.compare(b.getValue().sum(), a.getValue().sum())
        );
        topKHeap.addAll(counter.entrySet());

        // 输出结果
        int k = 10;
        System.out.println("TOP " + k + " IP统计：");
        for (int i = 0; i < k && !topKHeap.isEmpty(); i++) {
            Map.Entry<String, LongAdder> entry = topKHeap.poll();
            System.out.println((i+1) + ". " + entry.getKey() + ": " + entry.getValue().sum());
        }
    }

    // 处理单行日志，提取IP并计数
    private static Map<String, Long> processLine(String line) {
        Map<String, Long> partialCount = new HashMap<>();
        Matcher matcher = IP_PATTERN.matcher(line);
        while (matcher.find()) {
            String ip = matcher.group();
            partialCount.put(ip, partialCount.getOrDefault(ip, 0L) + 1);
        }
        return partialCount;
    }
}