## Introduction
This directory contains our pytorch implementation of Transformer-XL. Note that our state-of-the-art results reported in the paper were obtained by training the model on a large-scale TPU cluster, and our pytorch codebase currently does not support distributed training. Here we provide two sets of hyperparameters and scripts:
- `*large.sh` are for the SoTA setting with large models which might not be directly runnable on a local GPU machine.
- `*base.sh` are for the base models which can be run on a few GPUs.
The pytorch implementation produces similar results to the TF codebase under the same settings in our preliminary experiments.
## Prerequisite
- Pytorch 0.4: `conda install pytorch torchvision -c pytorch`
## Data Prepration
`bash getdata.sh`
## Training and Evaluation
#### Replicate the "bpc = 1.06" result on `enwik8` with a 12-layer Transformer-XL
- Make sure the machine have **4 GPUs**, each with **at least 11G memory**
- Training
`bash run_enwik8_base.sh train --work_dir PATH_TO_WORK_DIR`
- Evaluation
`bash run_enwik8_base.sh eval --work_dir PATH_TO_WORK_DIR`
#### Replicate the "PPL = 24.03" result on `wikitext-103` with Transformer-XL
- Make sure the machine have **4 GPUs**, each with **at least 11G memory**
- Training
`bash run_wt103_base.sh train --work_dir PATH_TO_WORK_DIR`
- Evaluation
`bash run_wt103_base.sh eval --work_dir PATH_TO_WORK_DIR`
#### Other options:
- `--batch_chunk`: this option allows one to trade speed for memory. For `batch_chunk > 1`, the program will split each training batch into `batch_chunk` sub-batches and perform forward and backward on each sub-batch sequentially, with the gradient accumulated and divided by `batch_chunk`. Hence, the memory usage will propertionally lower while the computation time will inversely higher.
- `--div_val`: when using adaptive softmax and embedding, the embedding dimension is divided by `div_val` from bin $i$ to bin $i+1$. This saves both GPU memory and the parameter budget.
- `--fp16` and `--dynamic-loss-scale`: Run in pseudo-fp16 mode (fp16 storage fp32 math) with dynamic loss scaling.
- Note: to explore the `--fp16` option, please make sure the `apex` package is installed (https://github.com/NVIDIA/apex/).
- To see performance without the recurrence mechanism, simply use `mem_len=0` in all your scripts.
- To see performance of a standard Transformer without relative positional encodings or recurrence mechanisms, use `attn_type=2` and `mem_len=0`.
#### Other datasets:
- `Text8` character-level language modeling: check out `run_text8_base.sh`
- `lm1b` word-level language modeling: check out `run_lm1b_base.sh`
没有合适的资源?快使用搜索试试~ 我知道了~
基于Python实现正弦、分段、复数、超复数位置编码,自注意力机制和互注意力机制(源码+说明文档).rar

共380个文件
py:188个
pyc:80个
csv:20个

1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 39 浏览量
2023-03-27
21:00:08
上传
评论 2
收藏 136.5MB RAR 举报
温馨提示
1、资源内容:基于Python实现正弦、分段、复数、超复数位置编码,自注意力机制和互注意力机制(源码+说明文档).rar 2、适用人群:计算机,电子信息工程、数学等专业的大学生课程设计、期末大作业或毕业设计,作为“参考资料”使用。 3、更多仿真源码和数据集下载列表(自行寻找自己需要的):https://blog.csdn.net/m0_62143653?type=download 4、免责声明:本资源作为“参考资料”而不是“定制需求”不一定能够满足所有人的需求,需要有一定的基础能够看懂代码,能够自行调试,能够自行添加功能修改代码。由于作者大厂工作较忙,不提供答疑服务,如不存在资源缺失问题概不负责,谢谢理解。
资源推荐
资源详情
资源评论


























收起资源包目录





































































































共 380 条
- 1
- 2
- 3
- 4
资源评论


Matlab仿真实验室
- 粉丝: 4w+
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 光孤子的形成与光通信中应用.doc
- 2022年网络课程在线测试系统的设计.doc
- 综合布线投标方案样本.doc
- (精品)操作系统(宗大华版)课后习题答案.doc
- 自适应神经网络专家讲座.pptx
- 嵌入式工程师笔试题.doc
- (源码)基于STM32的MobiFlight固件.zip
- 智慧类信息化项目交流材料ppt课件.ppt
- 建设工程监理与项目管理一体化发展.docx
- Python电子教案61组合数据类型.pptx
- 酒业公司研发项目管理流程.ppt
- 软件销售协议书.doc
- 下半年系统集成项目管理工程师考试上午真题.doc
- 网络与信息安全应急处置预案.doc
- 电子商务平台系统需求分析报告.doc
- 机械中文数据库检索.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
