Internet News
Google Glass in an AI World
I often use surfing as a metaphor for new technology. Go too early and you don't catch the wave. Go too late and you don't catch it either. Similarly next generation hardware or software may be too early for its time. I found myself wondering if this was the case for Google Glass and AI.
For those who don't remember, Google Glass was an early augmented reality headset that despite early excitement was ultimately shuttered. I spent time with the developer version of Google Glass in 2013 and, while promising, didn't think it was ready. But the technical capabilities of the device were impressive especially for its time. Glass featured:
- a camera for taking photos and video
- a microphone for accepting voice commands
- a speaker for audio input only you could hear (bone conduction)
- a mini projector to display information and interface controls in the corner of your field of vision
- a trackpad for controlling the interface and voice commands
- a number of sensors for capturing and reacting to device movement, like head gestures
- WiFi and Bluetooth connectivity
What Google Glass didn't have is AI. That is, vision and language models that can parse and react to audio and video from the real World. As I illustrated in a look at early examples of multi-modal personal assistants: faced with a rat's nest of signs, you want to know if it's ok to park your car. A multi-modal assistant could take an image (live camera feed or still photo), a voice command (in natural language), and possibly some additional context (time, location, historical data) as input and assemble a response (or action) that considers all these factors.
Google Glass had a lot of the technical capabilities (except for processing power) to make this possible in a lightweight form factor. Maybe it just missed the AI wave.
iOS18 Photos: Tab Bar to Single Scroll View
The most significant user interface change from iOS 17 to iOS 18 are the navigation differences in Apple's Photos app. The ubiquitous tab bar that's became the default navigation model in mobile apps is gone and in its place is one long scrolling page. So how does it work and why?
Most mobile applications have adopted a bottom bar for primary navigation controls. On Android it's called bottom navigation and on iOS, a tab bar, but the purpose is the same: make the top-level sections of an application visible and let people move between them.
And it works. Across multiple studies and experiments, companies found when critical parts of an application are made more visible, usage of them increases. For example, Facebook saw that not only did engagement go up when they moved from a “hamburger” menu to a bottom tab bar in their iOS app, but several other important metrics went up as well. Results like this made use of tab bars grow.
But in iOS 18, Apple removed the tab bar in their Photos app. Whereas the prior version had visible tabs for the top-level sections (Library, For You, Albums, Search), the redesign is just a single scroll view. The features previously found in each tab are now accessed by scrolling up and down vs. switching between tabs. One notable exception is Search which stays anchored at the top of the screen.
In addition to the persistent Search button, there's also a Select action and user profile image that opens a sheet with account settings. As you scroll up into your Photo library a persistent set of View controls appears at the bottom of the screen as well. The Close action scrolls you to the end of your Photo library and reveals a bit of the actions below making the location of features previously found in tabs more clear.
It's certainly a big change and given the effectiveness of tab bars, its also a change that has people questioning why? I have no inside information on Apple's decision-making process here but based on what I've learned about how people use Google Photos, Yahoo! Photos, and Flickr, I can speculate.
- By far the dominant use of a Photo gallery is scrolling to find an image whether to share, view, or just browse.
- Very few people organize their photo libraries and those that do, do it rarely.
- People continue to have poor experiences with searching images, despite lots of improvements, so they default to browsing when trying to find photos.
- Most automatic curation features like those found in For You just get ignored.
All that together can easily get you to the design answer of "the app should just be a scrolling list of all your Photos". Of course there's trade-offs. The top-level sections, and their features are much less visible, and thereby less obvious. The people who do make use of features like Albums and Memories now need to scroll to them vs. tapping once. But as iOS18 rolls out to everyone in the Fall, we'll see if these trade-offs were worth it.
A Visual Approach to Help Pages
As the functionality and scope of Web sites and applications has grown over the years, so has the prevalence of Help pages. Nearly every feature has an explanatory article outlining how to use it and why. But most Help pages are walls of text making them hard to act on. So a few years ago, we tried something different.
First let's look at the status quo. This Help page from Amazon is both pretty typical and by those standards, pretty good. It's specific to one topic, brief, outlines steps clearly, and includes links to help people accomplish their intended task. Companies iterated to these kinds of Help pages because they mostly work and because they're less work.
Keeping Help text up to date and accurate is less labor-intensive than updating images or videos with the same information. But as the old saying goes, a picture is worth a lot of words and there's a reason many people turn to video tutorials to learn how to do things instead of reading about how to do them.
When building Polar several years ago, we wanted a more approachable and fun way of helping people learn how to use our product. And while you might say "the best Help pages are no Help pages -just make your app easy to use" not all Help pages are smearing over usability issues. Some introduce higher level concepts, others outline capabilities, and some serve as marketing for specific features.
So with those goals in mind, we iterated to a simple formula. Each concept or feature gets a Help page that has a title alongside 1-2 sentences and as many sections consisting of a title, 1-2 sentences, plus a graphic as needed.
This approach meant people primarily relied on images (or their alt tags if visually impaired) to figure out how to get things done. So we iterated a fair amount on the images to find the right balance of detail and abstraction. Make the UI too realistic and it becomes hard to focus on the relevant elements. Realistic UI images also need updating anytime the actual product UI changes. Conversely, make the image too simplistic and it doesn't provide enough detail for people to actually learn how to do things.
Of course, not all Help topics are well suited to an image but the process of trying to create one often triggers ideas on how to simplify the actual UI or concepts within a product. So it's worth the iteration.
But is a visual approach to Help pages able to scale? Assuming it works, can companies invest the time and effort needed to generate all these images and keep them up to date? Perhaps in a time of image generation AI models, it's increasingly possible through automated or supervised pipelines. Time will tell!
Intent-driven User Interfaces
Increasingly when I see designers defaulting to more Ul controls and form elements in software interface designs, I encourage them to consider the implications of intent-driven instructions. Here's why...
For years l've used this image of Adobe Illustrator's user interface evolution to highlight the continuous march of "more features, more Ul" that drives nearly every software company's releases. The end result for end users is more functions they don't know about and don't use. Not great.
So what's the alternative? Perhaps something like Christian Cantrell's Photoshop assistant demos. In this series of videos, Christian uses natural language instructions connected to Photoshop's APIs to do things like mask the subject of a series of photos, blur the background in images, create layers and more. All without needing to know how and without clicking a bunch of windows, icons, menus, and pointers (WIMP).
Intent-driven instructions to mask the subject of multiple images in Photoshop:
Intent-driven instructions to mask the blur the backgrounds of multiple images in Photoshop:
Intent-driven instructions to create layers and objects in Photoshop:
While these kinds of interactions won't immediately replace conventional graphical user interface controls, it's pretty clear they enable a new way of control software with hundreds of features... just tell it what you want to do.
Distraction Control for the Web
Browsing the Web on your smartphone these days can feel like a gauntlet: accept this cookie consent, close this newsletter promo, avoid this app install banner. This morass of attention-seeking actions makes it hard to focus on content. Enter Apple's Distraction Control feature.
There's more than 7 billion active smartphones on the planet. This is the Web they are getting.
I won't get into how the Web became a minefield of pop-ups, banners, overlays, modals, and other forms of annoyance. For that you can take a look at my Mind the Gap presentation which goes into depth on why and what designers can do about it. But it's pretty clear the average mobile Web experience sucks.
And when things suck, people usually decide to do something about it. In this case, with iOS 18, Apple is giving average folks a chance to fight back with Distraction Control. When turned on, this new feature allows anyone to remove distracting elements on Web pages complete with a satisfying animation.
Newsletter pop-up? Boom, gone. Mobile app banner? Boom. Interstitial ad? Boom. Is it perfect? No. Elements might come back after you remove them if the page is reloaded. Accessing the control takes a few taps. But it's a way for people to fight back against Web clutter and we need more.
The Death of Lorem Ipsum
For years, designers have used Lorem Ipsum text as a placeholder in interface design layouts. But unless you're designing a pseudo-Latin text reader, using actual content provides a much more realistic picture of what a UI design needs to support. Today Large Language Models (LLMs) can provide designers with highly relevant content instantly so Lorem Ipsum can finally die.
It's long been argued (well at least by me in 2019) that using Lorem Ipsum text to mock up application interfaces fails to represent real content, often leading to usability issues and unrealistic designs that don't account for actual text lengths, line breaks, or content hierarchy in a final product. But Lorem Ipsum persisted as a design tool of choice because getting real content was hard.
To get very realistic content, designers would need access to where real content existed or pester engineers or domain experts to collect realistic content for them. It's not hard to see why some of these requests took a while or never got prioritized. And while some teams took the time to build tooling that enabled more realistic content in the design process, Lorem Ipsum was a much easier path for most.
Today, Large Language Models (LLMs) can not only generate sample content but also create highly specific and relevant content for just about any application you're designing. And given these tools are fast, widely available and free, there's no excuse to not use very realistic content in application designs. For example, if designing a food delivery app. A few prompts will give you real content, real quick.
So there's no excuses for Lorem Ipsum no more.
A Proliferation of Terms
When working through the early stages of a product design, it's common that labels for objects and actions emerge organically. No one is overly concerned about making these labels consistent (yet). But if this proliferation of terms doesn't get reined in early, both product design and strategy get harder.
Do we call it a library, a folder, a collection, a workspace, a section, a category, a topic? How about a document, page, file, entry, article, worksheet? And.. what's the difference? While these kinds of decisions might not be front and center when working out designs for a product or feature, they can impact a lot.
For starters, having clear definitions for concepts helps keep teams on the same page. When engineering works on implementing a new object type, they're aligned with what design is thinking, which is what the sales team is pitching potential customers on. Bringing a product to life is hard enough, why complicate things by using different terms for similar things or vice versa?
Inconsistent terms are obviously also a comprehension issue for the people using our products. "Here's it's called a Document, there it's called an Article. Are those the same?" Additionally, undefined terms often lead to miscellaneous bins in our user interfaces. "What's inside Explore?" When the definition of objects and actions isn't clear, what choice do we have but to drop them into vague sounding containers like Discover?
The more a product gets developed (especially by bigger teams) the more things can diverge because people's mental model of what terms mean can vary a lot. So it's really useful to proactively put together a list of the objects and actions that make up an application and draft some simple one-liner definitions for each. These lists almost always kick off useful high-level discussions within teams on what we're building and for who. Being forced to define things requires you to think them through: what is this feature doing and why?
And of course, consistent labels also ease comprehension for users. Once people learn what something means, they'll be able to apply that knowledge elsewhere -instead of having to contend with mystery meat navigation.
Ask LukeW: PDF Parsing with Vision Models
Over the years, I've given more than 300 presentations on design. Most of these have been accompanied by a slide deck to illustrate my points and guide the narrative. But making the content in these decks work well with the Ask Luke conversational interface on this site has been challenging. So now I'm trying a new approach with AI vision models.
To avoid application specific formats (Keynote, PowerPoint), I've long been making my presentation slides available for download as PDF documents. These files usually consist of 100+ pages and often don't include a lot of text, leaning instead on visuals and charts to communicate information. To illustrate, here's of few of these slides from my Mind the Gap talk.
In an earlier article on how we built the Ask Luke conversational interface, I outlined the issues with extracting useful information from these documents. I wanted the content in these PDFs to be available when answering people's design questions in addition to the blog articles, videos and audio interviews that we were already using.
But even when we got text extraction from PDFs working well, running the process on any given PDF document would create many content embeddings of poor quality (like the one below). These content chunks would then end up influencing the answers we generated in less than helpful ways.
To prevent these from clogging up our limited context (how much content we can work with to create an answer) with useless results, we set up processes to remove low quality content chunks. While that improved things, the content in these presentations was no longer accessible to people asking questions on Ask Luke.
So we tried a different approach. Instead of extracting text from each page of a PDF presentation, we ran it through an AI vision model to create a detailed description of the content on the page. In the example below, the previous text extraction method (on the left) gets the content from the slide. The new vision model approach (on the right) though, does a much better job creating useful content for answering questions.
Here's another example illustrating the difference between the PDF text extraction method used before and the vision AI model currently in use. This time instead of a chart, we're generating a useful description of a diagram.
This change is now rolled out across all the PDFs the Ask Luke conversational interface can reference to answer design questions. Gone are useless content chunks and there's a lot more useful content immediately available.
Thanks to Yangguang Li for the dev help on this change.
Ask LukeW: Text Generation Differences
As the number of highly capable large language models (LLMs) released continues to quickly increase, I added the ability to test new models when they become available in the Ask Luke conversational interface on this site.
For context there's a number of places in the Ask Luke pipeline that make use of AI models to transform, clean, embed, retrieve, generate content and more. I put together a short video that explains how this pipeline is constructed and why if you're interested.
Specifically for the content generation step, once the right content is found, ranked, and assembled into a set of instructions, I can select which large language model to send these instructions to. Every model gets the same instructions unless they can support a larger context window. In which case they might get more ranked results than a model with a smaller context size.
Despite the consistent instructions, switching LLMs can have a very big impact on answer generation. I'll leave you to guess which of these two answers is powered by OpenAI's GPT-4 and which one comes from Antrhopic's new (this week) Claude 3.5 Sonnet.
Some of you might astutely point out that the instruction set could be altered in specific ways when changing models. Recently, we've found the most advanced LLMs to be more interchangeable than before. But there's still differences in how they generate content as you can clearly see in the example above. Which one is best though... could soon be a matter of personal preference.
Thanks to Yangguang Li and Sam for the dev help on this feature.
Ask LukeW: Dynamic Preview Cards
After adding the Ask Luke feature to my site last year, I began sharing interesting questions people asked and their answers. But doing so manually meant creating an image in Photoshop and attaching it to posts on Twitter, LinkedIn, etc. Now with dynamic Open Graph previews, these preview cards get created on the fly- pretty sweet.
Ask Luke is an AI-powered conversational interface that uses the thousands of articles, videos, audio files, and PDFs I've created over the years to answer people's questions about digital product design. Every time the system answers a question, it does so dynamically. So technically, each answer is unique.
To make each question and answer pair sharable, the first step was to enable creating a unique link to it. The second was to use Vercel's image generation library to create a preview card each time someone makes a link.
The dynamic preview card for each question and answer pair includes as much of the question we can in addition to a bit of the response. It also adapts to varying question and answer lengths since it is generated dynamically.
When shared on Twitter, LinkedIn, Apple Messages, Slack, and any other application that supports Open Graph previews, an image with the question and answer is displayed providing a sense of what the link leads to.
Thanks to Yangguang Li, Thanh Tran, and Sam for the tips and help with this.
Bolting on AI Features
As more companies embrace new AI-enabled capabilities, a commonly held position is that established players will "win" by integrating AI features into their existing platforms and products. But the more established these companies are... the more competing interests they face when doing so.
Consider Microsoft's Web browser Edge and its start-up experience. Edge is now your AI-powered browser. But it's also your way to browse, shop, find, create, game, protect, learn, pin, personalize, sign-in, import, sync on the go, and discover. In other words, any new AI feature faces stiff competition from all the other existing features that are still vying for people's attention and use. (Lots of internal team objectives to hit at Microsoft)
Sure, the AI features are mentioned first but they're likely tuned out and skipped over like all the other browser features being promoted during setup. After years of being asked to adopt sign-in, shopping, syncing, personalization and more, people have learned to ignore and dismiss marketing messages especially when they come as fast and furious as they do on Windows.
And it's not just setup. Once you start using Edge, the right side-panel is loaded with icons. Of course, the most brightly colored icon is the AI feature but it's right there alongside search, shopping, tools, games, Microsoft 365, Outlook, Drop, Browser essentials, Collections, and more. Once again the effect is to tune it all out. Too much, too often. And the new AI features fare the same as all the existing features.
Looking beyond Edge, the same issues persist across Windows. Yes, there's a new AI feature icon but it's competing with Windows Start, Microsoft Start, and god knows what else (after a while I was afraid to click on anything else).
Contrast this situation with new products and companies that start with AI capabilities at their core. They don't have a laundry list of pre-AI features competing for attention. They are not beholden to the revenue streams and teams behind those features. They can build from the ground up and use AI-based capabilities to build the core of their offering leading to new paradigms and value adds.
Of course, new entrants don't have massive user bases to leverage. But often a large existing user base is a disadvantage, because adoption of new features isn't earned. Bolt on an AI feature to existing user base and some subset will try it or use it. The numbers "look good" but even turkeys fly in hurricanes.
Building from the ground up means you have to earn each user by providing value not just promotions. But it also means you're creating something valuable if people decide to come and especially if they stay. When you're integrating features, it's often harder to tell.