Llama 3 by Meta: Insights and the Future of Large Language Models

Okay, this isn't a review or anything, just a personal take on the Llama 3 model by Meta and how I feel about the direction of the AI industry concerning LLMs.

Don't take it too seriously.

TL;DR: Llama 3 is truly impressive.

I encourage you to try any model or their fine-tuned variations offered by the open-source community; you'll likely feel the same.

Experiment with a 70B model through Hugging Chat; it's incredibly fast and approaches GPT-4 performance, sometimes even surpassing it. On Groq (not Grok), it's even faster, thanks to their inference-optimized LPUs, although you'll need to deal with the inference queue since it's free. I would prefer they switch to a subscription model for more predictability and sustainability.

The parameter count of GPT-4 is rumored to be around 1.7 trillion, while the 70B model has 70 billion parameters. The 70B model is a fraction of the size of GPT-4, yet it often outperforms it. GPT-4 is also suspected to be a mixture of expert models, with eight 220B models combined. Do the math and you'll get 1.76 trillion.

The 70B model often delivers responses that surpass those of the latest GPT-4. I won't provide an example here to avoid any misconceptions, but imagine posing a hypothetical scenario of all-out war between two countries. The depth and breadth of the response from the 70B model are astounding, resembling a thoroughly researched and articulated article. In contrast, GPT-4 tends to produce a more generic and lackluster response.

However, there's a caveat for non-English speakers: in terms of multilingual and multimodal capabilities, GPT-4 remains the superior choice.

Claude Opus closely approaches GPT-4's performance in Korean, Japanese, and Chinese, the languages I'm comfortable with, at least in reading. When transcribing Chinese images, Claude sometimes outperforms GPT-4. Even in Korean, Claude is quite impressive, but I really don't like its overly rigid wall of ethics. I've already mentioned this in my previous posts, but unfortunately, Claude is getting worse in this politically correct bullshit department.

Llama also has a history of political correctness being a nuisance, but it's not as bad just yet. More importantly, it's open-source, so you can always tweak it to your liking, and the open-source community will always release less restrictive or entirely uncensored versions. So, it's not a big deal.

Note that what Meta released is just a starting point, foundation, or base models. The real magic happens when the community gets their hands on it and starts fine-tuning it to their liking. You can improve it in any way you want with your own data and fine-tuning techniques.

They sometimes call these models checkpoints for a good reason. They are just that: checkpoints, like save points in a video game. They are not the final destination. You can always extend their training or reverse some of the training to get a better model targeting certain layers of neurons or parameters. With open-source models, the possibilities are endless. With proprietary models, you're stuck with what they provide you.

I am hopeful that the Llama 3 400B model will significantly improve in multilingual aspects. I eagerly await its release.

On the multimodal front, I'm pretty confident it's just a matter of time before Llama 3 catches up with GPT-4. Meta already has the technology and resources to do so. They provide lightning-fast image generative models akin to SDXL Turbo or SDXL Lightning, Cascade, etc. Essentially, they can create images even as you type in your prompt, that's how fast they are.

Adding multimodal capabilities to Llama 3 won't be any issue. As I've laid out in my AI repo on GitHub, AI models don't see or hear as humans do. They analyze the given data, pixel by pixel in the case of images, using the figurative magnifying glass called filters or kernels. I won't go into the technical details here, but you get the idea.

Segmenting the image, for instance, Meta already provides an open-source solution called Segment Anything, and it has been widely used by the open-source community, including myself. Any Stable Diffusion UI implementations integrate this solution in the form of an extension or plugin.

All that remains after peeling away these outer wrappers of the multimodal capabilities is the LLM's ability to understand the context of the given medium, whether it's text, image, audio, or video, and generate a coherent response. All media just boil down to data, and LLMs are designed to process data as embeddings and tokens.

In a nutshell, Meta have all the pieces to solve the multimodal capabilities; it's just a matter of time.

In terms of context length, Llama 3 lags far behind its proprietary competitors with a meager 8K tokens. However, this isn't a significant issue for most users. Plus, don't be naive enough to believe that longer context lengths are always better. The effectiveness often deteriorates as the context length increases. Meta developers aren't foolish to limit the context length without reason. Given the current limitations of the autoregressive transformer architecture, longer context lengths might bring more problems than benefits.

Don't think OpenAI's Sora video generative AI is a model that understands the world's physics. It's still based on the same transformer architecture as GPT-4, just with a few tweaks here and there, known as diffusion transformers or vision transformers. Essentially, they are glorified transformers combined with diffusion architecture. Still limiting, still not grasping physics or the world. They just feed video data as tokens to the models, in this case called patches.

See the commonality? All these models, be it multimodal or unimodal, are based on the same transformer architecture and they have to deal with tokens.

The autoregressive part, which you might find tricky to understand, simply means the model learns its way through the data, one token at a time, predicting the next token based on the previous ones. It’s similar to a detective solving a case, one clue at a time. The detective can’t see the whole picture at once, just as the model can’t see all the data at once. This is a limitation of the current transformer architecture.

At this pace, reaching Artificial General Intelligence (AGI) anytime soon seems unlikely. We need a new architecture beyond transformers, something that can understand the world as we do, not just process tokens. Something that truly learns its way, not just predicts the next token autoregressively.

Indeed, both proprietary and open-source models are still far from achieving AGI. I won't even blink an eye unless some groundbreaking architecture emerges. Another transformer? Nah... I'm not interested.

One obvious advantage of open-source models is that open-sourcing the model always leaves room for improvements in the hands of the community, unlike proprietary models.

In the realm of open-source AI, Meta has emerged as a formidable competitor, embodying what OpenAI could have and should have been.

There are tons of open-source LLMs, right? A significant number of them are just fine-tuned versions of the base models rel eased by Meta: the llama iterations. Zoom out and let that picture sink in. Training an LLM from scratch is a daunting task and prohibitive for most due to costs and whatnot.

If it's Open-source versus Proprietary, that essentially means Meta versus OpenAI, at least for the moment. That OpenAI part? That can change. But the Meta part? That's a different story. Again, pause and think why. Meta has provided not only models but also platform-independent frameworks like PyTorch for quite a while. Big names in this department? TensorFlow, Jax, and PyTorch. In other words, Google versus Meta. You might be tempted to talk about Apple. But honestly, they don't count. Not yet. I wouldn't even mention them in the same breath as Google and Meta, to be fair.

I'm doubtful the honeymoon between Microsoft and OpenAI will last long. I know about Microsoft; let's just leave it at that. They'll part ways sooner than later. It's even specified in their contract how long their relationship could continue, even in the most ideal scenario. In other words, if the current trend continues, OpenAI can't compete with Meta in the long run.

I'll just give you a bit of a techno-history lesson: if it comes down to open-source versus proprietary, nine times out of ten, open-source wins in the long run. Big time.

One more personal take on the future of deploying AI models: local versus cloud deployment.

You might believe that locally run models offer greater security and privacy, yet the cloud provides unmatched speed, efficiency, and scalability. Despite owning the latest high-end devices capable of running these models without any issues, I've begun to favor the cloud for its convenience and speed.

It's not limited to LLMs. It's the same for all AI models, whether text, image, audio, or video. Ever wondered why Stable Diffusion models are so small? It's because they are designed to essentially run locally. They can't compete with Midjourney, a massive model, in terms of quality, for that very reason. Midjourney is designed to be run on the cloud, hence there is little limit on the model size.

The 400B Llama 3 model I'm excited about? Its sheer size makes it impossible to run locally unless you're willing to shell out arms and legs for a workstation equipped with as much GPU memory as necessary. 400B in full precision is no joke. Here’s a quick math: just multiply the parameters, 400 billions in this case, by 4 bytes, since every parameter is stored in 32-bit floating-point precision. That’s 1.6 tera bytes of GPU memory. You can't run that locally, can you?

Half precision? Multiply by 2 bytes. That's 800 gigabytes. Still not feasible for most. 8-bit precision or 1 byte per parameter? That’s 400GB. And you still need to consider overhead and other factors for wiggle room in your system.

We might consider compressing the model using techniques like quantization and pruning, but that’s a whole different story.

It might sound more like a joke than a real-world scenario, but running these huge models, even if possible, would run the risk of blowing the whole electric grid of your neighborhood. Not to mention the heat generated. You'd need a cooling system that could rival those used in data centers. Would you go to the length of remodeling your home to accommodate the power requirements and cooling systems?

Yeah, models are increasingly optimized for memory efficiency, but that's not the end of the story. Have you ever been satisfied with your bandwidth? As a lifelong audiophile and videophile, I never have. The more complex and bigger the models they churn out, the more you end up yearning for them. It's a never-ending cycle.

I'd just go with the cloud instead of trading off precision for memory. It's not worth it in the long run.

I, myself, a known nerd and geek who gobbles up the latest tech and top-end devices like candy, have begun to favor the cloud for its convenience and speed. I'm considering ceasing upgrades on my local devices and transitioning entirely to cloud-based solutions for AI models.

You never go offline for long these days anyway. Cloud services are becoming more reliable and affordable, and the convenience they offer is unparalleled: no hassle at all if you know what I mean.

It's important to note that if a major AI player lacks the infrastructure to offer cloud services to their customers, this could pose a significant problem in the future. They will just switch to competitors who offer cloud services.

One enduring truth remains: in the tech industry, customer loyalty is never guaranteed.

You might wonder about the ultimate question then: Edge vs. Cloud.

That's another story I won't tell. Not because I lack an opinion, but because it's a topic that could easily lead to a war of words.

I won't go there. Not today.

But I'm pretty sure you'd know where I stand on this matter, if you have followed me this far.