With the release of the LLaMA3 model, I asked it in Korean, and WoW is pretty plausible. Of course, only 5% of the learning data is multilingual, so we can’t expect as much performance in English, but the future seems bright.
Unfortunately, it is a short context length, but there are some techniques to overcome this problem, and the follow-up model will support a much longer context, so it is worth looking forward to.
In my personal opinion, $$$ is the best. Two 24K H100 GPU clusters were used for learning, the size of basic data increased tremendously, human Annotators were deployed en masse, and human evaluators were deployed en masse. Of course, if you look at the list of contributors, it’s a huge project with a huge amount of engineers.
Along with various technological improvements, the importance of data seems to be becoming more and more prominent. Computing power is also important, but hardware prices will gradually fall, so in the next 5 to 10 years, anyone with data will be able to easily have 7B-level models.
Several models have emerged so far, but there seems to be no studies that have derived all of the maximum capabilities of the model’s “parameter number.” The ability of the hidden model seems to be further drawn out by the advent of various techniques, but what is needed is data. More data, more high-quality data, is needed not only for learning but also for evaluation. Pipelines for collecting data are also evolving more and more. Eventually, this collected data will naturally be used in further model research, and the gap between where the data is held and where the data is not held will widen.
Personally, I greatly appreciate Meta’s disclosure of the model, but it would be very nice to at least disclose the data that “people” have annotated used for tuning and evaluation.
✦ Unveiled Two Models in Size 8B and 70B (Base and Tuned Version)
✦ Tiktoken based torque nizer is used
↳ version with an expanded vocab size of 128k
✦ 15T tokens for pre-learning
↳ Findings that improve the performance of the model with more data injection.
↳ 7x more than LLaMA2. Four times more coding data among them
↳ Pipeline to ensure data quality has also been developed. Heuristic filtering → NSFW filtering → Semantic duplication → text data quality classifiers. The data used to learn the last stage of the data quality classifier was a synthetic dataset created by LLaMA2.
↳ Data in 30 languages account for 5%
✦ 튜닝
↳ Using Man-made 10M Data, Huggingsface Staff Says
↳ SFT, PPO, DPO steps/techniques used
↳ Using Torchtune, which was officially revealed a while ago
✦ 8k-level context length
LLaMA3 license similar to ✦ LLaMA2
↳ Totally free if you don’t have more than 700M monthly active users. Or need a contact to get a separate license.
Both ✦ 8B and 70B models are still in the early release of the LLaMA3 series
↳ Additional models are scheduled to be released continuously
↳ Including the 400B version of the model, a version capable of multi-modality is mentioned. In particular, it focuses on basic performance improvements, as well as translation/conversation between multiple languages, and much longer context lengths.
✦ 성능
↳ Based on academic benchmarks, it shows good performance of tapping and pressing almost all existing Open LLMs. If MoE versions and the like appear in the community, I’m very excited to see what will happen.
↳ 1,800 curated prompts for 12 categories other than the academic benchmarks are also evaluated through “people.” Consequently, based on the 70B model, it is rated as a better response in most cases than the response of the Claude Sonnet/MistralMedium, GPT-3.5, and LLaMA2 models.
↳ Some benchmarks have also been released with the 400B model, which is currently being trained, and significant results are expected to be available beyond the GPT4 level.