Falcon
A key design decision in the training of LLMs is whether publicly available data is sufficient to train a powerful model. The preceding example of Dolly 2.0 showed how a relatively small, high-quality dataset of 15K prompts could be used to fine-tune a 12B-parameter model to approximate the performance of the 175B-parameter ChatGPT. However, there is also evidence that web data alone, subject to sufficient normalization and filtering without manual curation, can also produce high-quality models. The Falcon family of models, which are open-source, illustrates this idea17-18. The Falcon models make heavy use of the RefinedWeb dataset of filtered, deduplicated, and normalized publicly available web data, along with select curated additions.
We can load the Falcon-7B model using the following commands:
import transformers
import torch
model = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation...