MIL-NCEHowTo100M的PyTorchGPU分布式训练代码_Python

共22个文件

py：13个

csv：5个

txt：2个

版权申诉

164 浏览量 2023-04-05 13:09:05 上传评论收藏 22.02MB ZIP 举报

标题中的"MIL-NCEHowTo100M"是一个针对大规模数据集的机器学习模型训练方法，它基于PyTorch框架，并且适用于GPU分布式训练。MIL（Multiple Instance Learning）是一种特殊的监督学习方法，常用于处理如图像识别、文本分类等任务，其中每个样本由多个实例组成。NCE（Noise-Contrastive Estimation）则是一种无监督学习技术，用于估计高维概率分布，特别适合在大规模数据集中进行预训练。这个压缩包包含的项目可能是一个开源实现，它提供了一个详细的步骤来指导用户如何在拥有100M级别数据的场景下，利用PyTorch进行GPU分布式训练。Python作为主要编程语言，使得代码易于理解和操作，而GPU的利用可以显著加速计算过程。在压缩包内的"MIL-NCE_HowTo100M-master"可能是一个GitHub仓库的克隆，其中可能包含以下结构和内容： 1. `README.md`：项目介绍、安装指南、训练步骤、结果展示等。 2. `requirements.txt`：列出所有必要的Python库及其版本，用户可以通过此文件安装所有依赖。 3. `data/`：存储原始数据集或预处理后的数据。 4. `scripts/`：包含训练脚本，可能包括主训练脚本`train.py`，以及用于配置参数的配置文件。 5. `models/`：MIL-NCE模型的定义，可能包含一个或多个PyTorch模块文件（`.py`）。 6. `utils/`：包含各种辅助函数，如数据加载器、优化器、损失函数、评估指标等。 7. `config.py`：全局配置文件，用于设置训练参数，如学习率、批次大小、GPU分配等。 8. `logs/`：训练过程中产生的日志和模型检查点，便于监控和恢复训练。 9. `Dockerfile`（如果有的话）：用于创建 Docker 容器的文件，以便在隔离的环境中运行代码。在实际使用这个代码库时，用户需要按照`README.md`中的指示进行操作，包括但不限于： 1. 安装所有依赖项（使用`requirements.txt`）。 2. 配置`config.py`以适应自己的硬件环境和数据集。 3. 使用训练脚本启动训练过程，可能需要指定GPU设备。 4. 监控训练进度，根据日志调整参数以优化模型性能。 5. 将训练好的模型应用于实际任务。对于初学者，理解并应用分布式训练的关键在于了解数据并行、模型并行和混合并行的概念，以及如何在PyTorch中使用`torch.nn.parallel.DistributedDataParallel`进行分布式训练。此外，对于大规模数据集的处理，还需要掌握有效的数据加载策略，如多进程数据加载和采样技术，以提高训练效率。这个压缩包提供的资源对于想要深入了解和实践PyTorch分布式训练，特别是MIL-NCE方法的开发者来说，是一个宝贵的学习和实验工具。通过学习和应用这个代码，不仅可以提升对PyTorch的理解，还能掌握处理大规模数据集的技巧，对于提升深度学习项目的效果和效率具有重要意义。

资源推荐

资源详情

资源评论

收起资源包目录

MIL-NCEHowTo100M的PyTorchGPU分布式训练代码_Python_下载.zip （22个子文件）

MIL-NCE_HowTo100M-master

utils.py 1KB

hmdb_loader.py 3KB

checkpoint

readme.txt 47B

main_distributed.py 11KB

eval_youcook.py 3KB

loss.py 616B

video_loader.py 6KB

youcook_loader.py 5KB

metrics.py 798B

LICENSE 11KB

msrvtt_loader.py 4KB

eval_hmdb.py 4KB

eval_msrvtt.py 3KB

csv

msrvtt_test.csv 73KB

validation_youcook.csv 223KB

howto100m_videos.csv 18.97MB

hmdb51.csv 482KB

all_videos.csv 18.97MB

args.py 6KB

s3dg.py 13KB

README.md 14KB

log

readme.txt 47B

# MIL-NCE End-to-End HowTo100M training on GPUs with PyTorch This repo contains an open source PyTorch distributed training code for the CVPR'20 paper: [End-to-End Learning of Visual Representations from Uncurated Instructional Videos](https://arxiv.org/abs/1912.06430) [1]. The original codebase from [1] relies on Google and DeepMind's internal tools as well as the usage of TPU v3 accelerators which makes it challenging to release as is. Instead, this repository provides an implementation of [1] using PyTorch / ffmpeg with a reasonable number of GPUs. The training code was run on the French public AI cluster [Jean-Zay](https://www.idris.fr/eng/) (see Acknowledgements below). It was specifically designed to be run on a SLURM based cluster management for multi-node distributed training but can be easily modify for any other cluster management system. This open source PyTorch implementation of the paper has several minor differences such as: - The use of a cosine learning rate decay instead of a stepwise decay described in [1]. - There is no sharing of the batch normalization statistics across different GPUs and nodes as it is much slower to perform such operation on GPUs than TPUs. - The use of slightly different spatio-temporal training video resolution of the input video clips. If you only plan to reuse the pretrained S3D model from [1], instead please visit the following [repo](https://github.com/antoine77340/S3D_HowTo100M) If you use this code, we would appreciate if you could both cite [1] and [2]. ## Requirements - Python 3 - PyTorch (>= 1.0) - [python-ffmpeg](https://github.com/kkroening/ffmpeg-python) with ffmpeg - pandas - tqdm (for evaluation only) - scikit-learn (for linear evaluation only) - SLURM cluster management for distributed training but can be easily modified for other cluster management system ## Getting preliminary word2vec and HowTo100M data for training You will first need to download the word2vec matrix and dictionary and unzip the file in the same directory as the code, in the data folder. ```sh wget https://www.rocq.inria.fr/cluster-willow/amiech/word2vec.zip unzip word2vec.zip ``` Then you will need to download the preprocessed HowTo100M captions and unzip the csv files somewhere. To download the csv files: ```sh wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/howto100m_captions.zip ``` Finally the preprocessed HowTo100M videos (12Tb in total) can be downloaded by filling this Google form: https://forms.gle/hztrfnFQUJWBtiki8. We advise you save the HowTo100M videos as well as the caption files on fast access disks such as SSDs disks to significantly speedup the training. ## Training MIL-NCE on HowTo100M The following command trains the S3D model on a single node, uses all of its GPU and checkpoints the model in the directory checkpoint/milnce, the log are written in the *log* directory. Do not forget to replace *path_to_howto_csv* by the path to the HowTo100M csv caption files and *path_to_howto_videos* to the path where the HowTo100M videos were downloaded. ```sh python main_distributed.py --n_display=1 \ --multiprocessing-distributed --batch_size=256 \ --num_thread_reader=40 --cudnn_benchmark=1 --pin_memory \ --checkpoint_dir=milnce --num_candidates=4 --resume --lr=0.001 \ --warmup_steps=10000 --epochs=300 --caption_root=path_to_howto_csv --video_path=path_to_howto_videos ``` You can also monitor the evaluation on the zero-shot YouCook2 retrieval task by specifying the argument --evaluate as well as *--eval_video_root=path_to_youcook2_video* - Note 1: The batch size value set here is the total batch size for the node, so if batch size is 256 and there are 4 GPUs, the batch size for each GPU will be 64. - Note 2: An epoch here is equivalent of processing 1238911 video-text training samples, which is the number of different videos in HowTo100M. It is not the same as the number of different training video clips as there are more than 100M clips. - Note 3: The training code should be distributed over multiple tasks with SLURM for distributed training. ## Linear evaluation of representation on HMDB-51 action recognition dataset ### Download the videos Please download the original HMDB videos at: [https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#Downloads](https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#Downloads) ### Evaluation This evaluation will extract video features on HMDB using a model checkpoint path (to replace with *the_path_to_the_checkpoint*) and train a linear SVM using scikit-learn on the features. To run the evaluation: ```sh python eval_hmdb.py --batch_size=16 --num_thread_reader=20 --num_windows_test=10 \ --eval_video_root=path_to_the_videos --pretrain_cnn_path=the_path_to_the_checkpoint ``` You will need to replace *path_to_the_videos* by the root folder containing the downloaded HMDB videos. This table compares the results of the linear evaluation of the representation on HMDB-51 with the original implementation and this one under various number of training epoch and training batch size. <table><tbody>   <th valign="bottom">Implementation</th> <th valign="bottom">Epochs</th> <th valign="bottom">Total batch size</th> <th valign="bottom">Accelerator</th> <th valign="bottom">CPU cores</th> <th valign="bottom">Training input size</th> <th valign="bottom">Top-1 accuracy</th>  <tr><td align="left">TPU Tensorflow [1]</td> <td align="center">2645</td> <td align="center">8192</td> <td align="center">64 x Cloud TPU v3 128Gb</td> <td align="center">64 x N.A.</td> <td align="center">32 frames at 200x200</td> <td align="center">53.1</td> </tr> <tr><td align="left">TPU Tensorflow [1]</td> <td align="center">206</td> <td align="center">512</td> <td align="center">4 x Cloud TPU v3 128Gb</td> <td align="center">4 x N.A.</td> <td align="center">16 frames at 200x200</td> <td align="center">54.2</td> </tr> <tr><td align="left">This implementation</td> <td align="center">150</td> <td align="center">512</td> <td align="center">2 x 4 Tesla V100 32Gb</td> <td align="center">2 x 40</td> <td align="center">16 frames at 224x224</td> <td align="center">53.5</td> </tr> <tr><td align="left">This implementation</td> <td align="center">300</td> <td align="center">1024</td> <td align="center">4 x 4 Tesla V100 32Gb</td> <td align="center">4 x 40</td> <td align="center">16 frames at 224x224</td> <td align="center">54.3</td> </tr> </tbody></table> ## Zero-Shot evaluation retrieval on MSR-VTT and YouCook2 ### Download the MSR-VTT videos A mirror link of the MSR-VTT testing videos can be found at: [https://www.mediafire.com/folder/h14iarbs62e7p/shared](https://www.mediafire.com/folder/h14iarbs62e7p/shared) ### Download the YouCook2 videos A mirror link of our downloaded YouCook2 testing videos can be found at: [https://www.rocq.inria.fr/cluster-willow/amiech/Youcook2_val.zip](https://www.rocq.inria.fr/cluster-willow/amiech/Youcook2_val.zip) ### Evaluation on MSR-VTT This evaluation will run the zero-shot text-video retrieval on the MSR-VTT subset of the test set used in [1]. You will need to replace *the_path_to_the_checkpoint* by your model checkpoint path and *path_to_the_msrvtt_videos* to the root folder containing the downloaded MSR-VTT testing videos. ```sh python eval_msrvtt.py --batch_size=16 --num_thread_reader=20 --num_windows_test=10 \ --eval_video_root=path_to_the_msrvtt_videos --pretrain_cnn_path=the_path_to_the_checkpoint ``` This table compares the retrieval results with the original implementation and this one under various number of training epoch and training batch size. <table><tbody>   <th valign="bottom">Implementation</th> <th valign="bottom">Epochs</th> <th valign="bottom">Total batch size</th> <th valign="bottom">Accelerator</th> <th valign="bottom">CPU cores</th> <t

评论收藏

内容反馈

版权申诉