Tag Archives: google

Blending Science and Art: The Multimodal Craft of an Exceptional Gen AI Paper

5 Apr
With the entire text of Les Misérables in the prompt (1382 pages), Gemini 1.5 Pro locates a famous scene from a hand-drawn sketch

Technical writing is one of my favorite reads. It’s clear, succinct, and informative. DeepMind’s technical paper on Gemini 1.5 epitomizes all I love about technical writing. Read the abstract for a glimpse into the groundbreaking advancements encapsulated in Gemini 1.5 Pro; it’s a masterclass in effective communications. We learn how to deliver maximum insight with minimum word count.

In just 177 words, my DeepMind colleagues articulate:

  • #ProductCapabilities: “a highly compute-efficient multimodal* mixture-of-experts model** capable of recalling and reasoning*** over fine-grained information from millions of tokens of context”
  • #UniqueSellingPoint: “near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k)”
  • #UseCases: “surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content”
Gemini 1.5 Pro is able to translate from English to Kalamang with similar quality to a human

The science of writing succinctly

In a few words, the paper abstract communicates the model’s superior performance, its leap over existing benchmarks, and its novel capabilities. It sparks curiosity about the future potentials of large language models—a true testament of powerful, precise, impactful technical communication.

How did the Gemini 1.5 paper authors achieve this mastery? By following the guiding principles of Brevity (saying more with fewer words) that my friend and thought partner D G McCullough and I recently summarized as: “Trust, Commit, Distill”:

  • #Trust means believing in the power of your message without over-explaining nor adding unnecessary details. Trust empowers the communicator to eliminate redundancy, focusing on what’s truly important. The Gemini 1.5 paper authors trust their curious readers to look up terms that may be new to them. On first read, I had to look up “mixture-of-experts” but the context I’ve had from my 2 years of working with data and AI allowed me to “guesstimate” its meaning before getting the proper definition.
  • #Commit refers to sticking with the essentials of your message, understanding your message’s objective, and resisting tangents or unnecessary explanations diluting the message’s impact. (Which requires discipline!)
  • #Distill requires breaking down your message to full potency. Like distilling a liquid to increase its purity, we must strip away the non-essential until the most impactful, clear, and concise message remains. Every word and idea then serves a purpose–and voila! Your message becomes clearer, and more memorable.

The art of replacing 100s of words with a single image

The saying “A picture is worth a thousand words” truly shines in technical communication. A single, well-chosen image can articulate complex ideas with more efficiency and impact than verbose descriptions. The Gemini 1.5 paper’s authors skillfully weave in visual elements, showcasing a deep grasp of conciseness. This approach not only makes complex AI and machine learning concepts approachable and captivating but also boosts understanding and enhances the reader’s journey. It demonstrates that when it comes to sharing the latest scientific breakthroughs, visual simplicity can convey a wealth of information.

With the entire text of Les Misérables in the prompt (1382 pages), Gemini 1.5 Pro locates a famous scene from a hand-drawn sketch

Simplify complexity with brevity

In our rapid world, where attention is a rare commodity and people often skim rather than read, the skill of conveying ideas briefly and through visual storytelling stands out as a significant edge. Simplifying complex concepts into engaging visuals and concise explanations can mean the difference between being noticed or ignored.

Richard Feynman, the celebrated physicist, Nobel laureate, and cherished educator, famously stated, “If you can’t explain it simply, you don’t understand it well enough.”

Richard Feynman quotes

Feynman’s approach isn’t just about words; it involves using visuals and images to make intricate ideas more approachable. After all, the deepest insights are usually the easiest to understand when we apply brevity to break down complexity.

DeepMind’s Gemini 1.5 technical paper exemplifies this principle perfectly. It’s essential reading for anyone intrigued by general AI (especially with #GoogleCloud #NEXT24 on the horizon), and it’s an exemplary model for those dedicated to honing their communication skills.

#TechnicalWriting #Innovation #ArtificialIntelligence #LanguageModels #Brevity #BrevityRules #GoogleCloud #NEXT24 #DeepMind

Read the full abstract

“In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra’s state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro’s long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.” https://storage.googleapis.com/deepmindmedia/gemini/gemini_v1_5_report.pdf

Define the key terms used in the abstract

* #Multimodality: Gemini is natively multimodal.  Prior to Gemini, AI models were first trained on a single modality, such as text, or image, and then corresponding embeddings were concatenated. For example, the embedding of an image would be generated by an AI model trained on images, the embedding of the text describing the image would be generated by an AI model trained on texts, and then the two embeddings would be concatenated to represent the image and its transcript. Instead, the Gemini family of models was trained on content that is inherently multimodal such as text, images, videos, code, and audio. Imagine being able to ask a question about a picture, or generate a poem inspired by a song – that’s the power of Gemini.

** #Mixture-of-Experts Model: At the core of Gemini’s groundbreaking capabilities lies its innovative mixture-of-experts model architecture. Unlike traditional neural networks that route all inputs through a uniform set of parameters, the mixture-of-experts model consists of numerous specialized sub-networks, each adept at handling different types of information or tasks—these are the “experts.” Upon receiving an input, a gating mechanism intelligently directs the input to the most relevant experts. This selective routing allows the model to leverage specific expertise for different aspects of the input, akin to consulting specialized departments within a larger organization for their unique insights. For Gemini, this means an unparalleled ability to process and integrate a vast array of multimodal data—whether it’s textual, visual, auditory, or code-based—by dynamically engaging the most suitable experts for each modality. The result is a model that not only excels in its depth and breadth of understanding but also in computational efficiency, as it can focus its processing power where it matters most, without overburdening the system with irrelevant data processing. This approach revolutionizes how AI models handle complex, multimodal inputs, enabling more nuanced interpretations and creative outputs than ever before.

A Mixture of Experts (MoE) layer embedded within a recurrent language model https://openreview.net/pdf?id=B1ckMDqlg

*** #Reasoning: Gemini goes beyond simple pattern recognition. It utilizes a novel architecture called “uncertainty-routed chain-of-thought” to reason and understand complex relationships within and across modalities. This enables it to answer open-ended questions, solve problems, and generate creative outputs that are not just factually accurate but also logically coherent.

The 6 Layers of Generative AI Technology Stack

3 Apr

Did you know that “T” in Chat-GPT stands for Transformer, which is Google’s revolutionary architecture that brings the concept of “self-attention” to AI? And that Google pioneered silicon for deep learning workloads with TPUs? After combing through dozens of technical papers and posts, I summarized my learnings in one visual below.

All the recent AI talk brought back the memory of the fall semester of 2003, when I signed up for a Neural Networks course 👩‍🎓. After several classes of advanced algebra and calculus, I was excited to see their practical applications in natural language processing and speech recognition use cases. Little did I know that in 2023, computers would not only be able to almost perfectly understand human speech, they would also gain a voice of their own thanks to decision making capability similar to humans.

I majored in Telecommunications Engineering and always found the Open Systems Interconnection Reference Model, more commonly known as the OSI model, extremely useful in visually depicting all the key layers of the networking tech stack. So I thought to myself: what if I build a similar reference model for AI? At the end, at the core of AI lies a neural network. And I’ve successfully demystified a variety of tech stacks using the good ol’ OSI model before, from PaaS to SDN/NFV. Let me know what you think!

Thank you for inspiration to Philip Moyer and to Priyanka Vergadia and Neama Dadkhahnikoo for technical review.

#artificialintelligence#deeplearning#machinelearning#selfattention#naturallanguageprocessing#googlecloud

And here’s an animated version of “The 6 Layers of Generative AI Technology Stack”. To me, it’s like watching a delicious multi-layer cake being assembled layer by layer, except instead of vanilla cake, lemon custard and cream-cheese frosting, our recipe calls for infrastructure, modeling and application layers as key ingredients. Who knew that a stack of AI layers could be so captivating?

Check out my LinkedIn post The 6 Layers of Generative AI Technology Stack

Easy Search… because tapping is superior to typing

5 Feb

Do you think that Google search engine has not really innovated since its launch in 1997? Would you agree that the user experience while searching Internet from a mobile device can be improved? Good news is, somebody already did it!

In my first Tech Surprise, “Digital Sandwich“, I asked a question: how long did it take to research all the lunch options and which device(s)/app(s) were used? The answer is below:

Device(s): ___________________ iPad
App(s): _____________________ EasySearch 4 iPad
Time used: ___________________ 30 seconds

25% of the American smartphone owners go online mostly from their mobile devices. Some of them want to quickly find information on a specific topic, others are just looking for inspiration on what to do when they have some time to kill, for example when exploring an unknown city.

iPhones and iPads are more versatile than the devices we used to play with in the past: laptops, music players or game consoles. They are light and equipped with large touchscreens. A third of smartphone owners prefer using them for Web browsing or e-mail even when they are near their computers. Over the past two years, iPhone users have spent 45 percent more time e-mailing on their smartphones and 15 percent less time e-mailing on their laptops.

The touchscreens of iPhones and iPads make tapping the more natural way of interacting with the Web then typing we were accustomed to in the era of PCs and laptops. There is enough content out there, so the challenge is more on the side of synthesizing it quickly then adding to it. That’s why tapping is superior to typing.

The recent financial results of Apple confirm that consumers expect to do “more” stuff with “less” of a device. Sales of “tapping” mobile devices soar (iPhones: 142 percent unit growth; iPads: 183 percent unit increase), while fewer people choose traditional “typing” and bulky gear (Macs: 14 percent unit increase).

Easy Search

With Easy Search you take advantage of the touch screen of your iPad or iPhone, you type once, search everywhere:
– Images – how does it look?
– Wikipedia – what is it?
– Maps – where is it?
– Facebook – are my friends using it?
– Twitter– are my friends using it NOW?

Doing more with less is a necessity in modern nomads’ lives. Constantly on the road, they rely on their mobile devices to quickly synthesize a lot of information and decide on where they are heading next. The nomadic lifestyle is powered by apps that help them navigate the digital and real worlds with more fun and less stress.