Developer News

AI Models and Headsets

LukeW - Thu, 02/15/2024 - 2:00pm

We've all heard the adage the "the future is here it just isn't evenly distributed yet. With rapid advancement in multi-modal AI models and headset computing, we're at a stage where the future is clear but is isn't implemented (yet). Here's two examples:

Multi-modal AI models can take videos, images, audio, and text as input and apply them as context to provide relevant responses and actions. Coupled with a lightweight headset with a camera, microphone, and speakers, this provides people with new ways to understand and interact with the World around them.

While these capabilities both exist, large AI models can't run locally on glasses ...yet. But the speed and cost of running models keeps decreasing while their abilities keep increasing.

Video generation models can not only go from text to video, image to video, but also video to video. This enables people to modify what they're watching on the fly. Coupled with immersive video on a spatial computing platform, this enables dynamic environments, entertainment, and more.

Again these capabilities exist seperately but the kind of instant immersive (high resolution) video generations needed for Apple's Vision Pro format isn't here... yet.

There are no Original Ideas. But...

LukeW - Wed, 02/14/2024 - 2:00pm

Mark Twain is believed to have said "There is no such thing as an original idea." The implication is that, as a species, we are constantly building on what came before us: inspired and driven by what we've seen and experienced. Personally, I like to phrase a similar sentiment as "There are no original ideas. But there are original executions."

Anyone who has been around product design has probably experienced idea fetish: the belief that a good idea is all you need to be successful. This was elegantly debunked by Steve Jobs in 1995.

"You know, one of the things that really hurt Apple was after I left John Sculley got a very serious disease. It’s the disease of thinking that a really great idea is 90% of the work. And if you just tell all these other people “here’s this great idea,” then of course they can go off and make it happen. And the problem with that is that there’s just a tremendous amount of craftsmanship in between a great idea and a great product.

Designing a product is keeping five thousand things in your brain and fitting them all together in new and different ways to get what you want. And every day you discover something new that is a new problem or a new opportunity to fit these things together a little differently.

And it’s that process that is the magic."

As Steve points out designing products is fitting thousands of different things together in different ways. So every execution of an idea is an explosion of possibility and thereby originality. How you execute an idea is always original simply because of the number of variables in play. Hence "There are no original ideas. But there are original executions."

Apple Vision Pro: First Experience

LukeW - Sun, 02/04/2024 - 2:00pm

Like many technology nerds out there, I got to explore Apple's new Vision Pro headset and Spatial Computing operating system first hand this weekend. Instead of a product review (there's plenty out there already), here's my initial thoughts on the platform interactions and user interface potential. So basically a UI nerd's take on things.

Apple's vision of Spatial Computing essentially has two modes: an infinite canvas of windows and their corresponding apps which can be placed and interacted with anywhere within your surroundings and an immersive mode that replaces your physical surroundings with a fully digital environment. Basically a more Augmented Reality (AR)-ish mode and a more Virtual Reality (VR)-ish mode.

Whether viewing a panoramic photo, exploring an environment, or watching videos in cinematic mode, the ability to fully enter a virtual space is really well done. Experiences made for this format are the top of my list to try out. It's where things are possible that make use of and alter the entirety of space around you. That said, the apps and content that make use of this capability are few and far between right now.

But I expect lots of experimentation and some truly spatial computing first interactions to emerge. Kind of like the way Angry Birds fully embraced multi-touch on the iPad and created a unique form of gameplay as a result.

While definitely being used to render the spatial OS, the deep understanding Apple Vision Pro's camera and sensor system has of your environment feels under-utilized by apps so far. I say this without a deep understanding of the APIs available to developers but when I see examples of the data Apple Vision Pro has available (video below), it feels like more is possible.

So that's certainly compelling but what about the app environment, infinite windows, and getting things done in the AR-ish side of Spatial OS? With this I worry Apple's ecosystem might be holding them back vs. moving them forward. The popular consensus is that having a deep catalog of apps and developers is a huge advantage for a new platform. And Apple's design team has made it clear that they're leaning into this existing model as a way to transition users toward something new.

But that also means all the bad stuff comes along with the good. At a micro level, I found it very incongruent with the future of computing to face a barrage of pop-up modal windows during device setup and every time I accessed a new app. I know these are consistent patterns on MacOS, iOS, and iPadOS but that's the point: do they belong in a spatial OS? And frankly given their prominence, frequency, and annoyance... do they belong in any OS?

Similarly, retaining the WIMP paradigm (windows, icons, menus, and pointers) might help bridge the gap for people familiar with iPhones and Macs but making this work with the Vision Pro's eye tracking and hand gestures, while technically very impressive, created a bunch of frustration for me. It's easily the best eye and hand tracking I've experienced but I still ended up making a bunch of mistakes with unintended consequences. Yes, I'm going to re-calibrate to see if it fixes things but my broader points stands.

Is Apple now locked in a habit of porting their ecosystem from screen to screen to screen? And, as a result, tethered to too many constraints, requirements, and paradigms about what an app is and we should interact with it? Were they burned by skeuomorphic design and no longer want to push the user interface in non-conventional ways?

One approach might be to look outside of WIMP and lean more into a model like OCGM (objects, containers, gestures, manipulations) designed for natural user interfaces (NUI). Another is starting simple, from the ground up. As a counter example to Apple Vision Pro, consider Meta's Ray Ban glasses. They are light, simple, and relatively cheap. For input there's an ultra-wide 12 MP camera and a five-microphone array. The only user interface is your voice and a single hardware button.

When combined with vision and language AI models, this simple set of controls offers up a different way of interacting with reality than the Apple Vision Pro. One without an existing app ecosystem, without windows, menus, and icons. But potentially one with a new of bringing computing capabilities to the real World around us.

Which direction this all goes... we'll see. But it's great to have these two distinct visions for bringing compute capabilities to our eyes.

Common Visual Treatments

LukeW - Tue, 01/30/2024 - 2:00pm

In the context of a software interface, things that work the same should mostly look the same. This isn't to say consistency always wins in UI design but common visual treatments teach people how to get things done in software. So designers need to be intentional when applying them.

People make sense of what they see in an interface by recognizing the similarities and differences between visual elements. Those big white labels in a menu? Because they all look the same, we assume they work the same as well. When one label opens a dropdown menu, another links to a different page, and a third reveals a video on hover, we no longer know what to expect. Consequently our ability to use the software to accomplish things degrades along with our confidence.

Though Amazon's header has lots of options, there's a common visual representation for the elements in it that reinforces how things work. A white label on the dark blue background is a link to a distinct section of the site. If there's a light gray triangle to the right of the label, you'll get a set of choices that appear when you hover over it. And last, but not least, the white label with a menu icon to the left of it, reveals a side panel on top of the current page when you click. Here's a simplified image of this:

Each distinct visual representation (white label, white label with arrow to right, white label with icon to left) is consistently matched with a distinct action (link to a section, reveal choices on hover, open side panel). The end result is that once people interact with a distinct visual element, they know what to expect from all the elements that look the same.

If one of the white labels in Amazon's header that lacked a light gray triangle also revealed a menu but did it on click instead of on hover, people's prior experience wouldn't line up with this behavior and they'd have to reset their understanding of how navigation on Amazon works.

While one such instance doesn't seem like a big deal... I mean it's only a little gray triangle... do this enough times in an interface design and people will increasingly get confused and feel like it's harder and harder to get what they need done.

Discomfort is a Strategic Advantage

LukeW - Fri, 01/19/2024 - 2:00pm

When things are going well, it's natural to feel comfortable. The better things go... the more comfortable you get. This isn't just true for humans it applies to companies as well. But in both cases, being uncomfortable is a strategic advantage.

I often got confused looks from co-workers when they asked me how a project was going: "things are going really well, I don't like it." Why would you not like when things are going well? My mindset has always been if things are good, there's more opportunity for them to get worse. But if things are bad there's a lot of room for them to get better.

In reflecting on this, there's an underlying belief that being uncomfortable is a better state to be in than being comfortable. Discomfort means you're not satisfied with the current situation. You know it can be better and you're motivated to make it so.

"It’s wild, but comfort can be a poison— John Nack

In the context of product design, this ends adds up to a mindset that design is never done and there's always things to improve. So you spend time understanding what is broken at a deeper level and keep iterating to improve it. Usually this type of process leads you back to core, critical flows. Fixing what really matters.

When you're comfortable, you instead assume the core product is doing fine and begin to fill time by thinking up what else to do, adding new features, or veering away from what actually matters. Discomfort with the status quo drives urgency and relevance.

"To grow new markets means making yourself uncomfortable. It means you can’t keep doing more of what got you here." -What Steve Jobs taught me about growth

Discomfort is also a prerequisite of doing something new. When you're solving a problem in a different way, it won't be immediately understood by others and you'll get a lot more head shakes than nods of agreement. To get through that, you need to be ok with being uncomfortable. The bigger the change, the longer you'll be uncomfortable.

But how do you motivate yourself and your teams to be uncomfortable? I often find myself quoting the words of the late, great Bill Scott. When explaining how he decided what to do next, he always looked for "butterflies in the stomach and a race in the heart". He wanted to be both uncomfortable (butterflies) and excited (race). Because comfort, while nice, isn't really that exciting.

Video: Using Website Content in AI Interfaces

LukeW - Sun, 01/07/2024 - 2:00pm

In this two minute video from my How AI Ate My Website talk, I outline how to automatically answer people's design questions using the content from Web site using embeddings. I also explain why that approach differs from how broader Large Language Model (LLM) generate answers. It's a quick look at how to make use of AI models to rethink how people can interact with Web sites.

Transcript

When we have all these cleaned up bits of content, how do we get the right ones to assemble a useful answer to someone's question? Well, all those chunks of content get mapped to a multi-dimensional vector space that puts related bits of information together. So things that are mobile-touch-ish end up in one area, and things that are e-commerce-ish end up closer to another area.

This is a pretty big simplification, but it's a useful way of thinking about what's happening. To get into more details... enter the obligatory system diagram.

The docs that we have, videos, audios, webpages, get cleaned up and mapped to parts of that embedding index. When someone asks a question, we retrieve the most relevant parts, rank them, sometimes a few times, put it together for an AI language model to summarize in the shape of an answer.

And sometimes we even get multiple answers and rank the best one before showing it to anybody. Feedback is also a really important part of this, and why kind of starting with something that roughly works and iterating is more important than doing it exactly right the first time.

So what's the impact of doing all this versus just using something like ChatGPT to ask questions?

Well for starters, you get very different kinds of answers, much more focused and reflecting a particular point of view versus general world knowledge. As you can see in the difference between a ChatGPT answer on the left to, why do designs look the same, versus the answer you get from Ask Luke.

On the Ask Luke side, you also get citations, which allow us to do a bunch of additional things, like object-specific experiences. On Ask Luke, you ask a question, get an answer, with citations to videos, audio files, webpages, PDFs, etc. Each one has a unified, but document-type specific interface.

The More Features You Add...

LukeW - Tue, 12/19/2023 - 2:00pm

As Dave Fore once said: "features are the currency of software development and marketing." Spend time in any software company and you'll begin to echo that sentiment. But there's consequences...

The first of which is feature-creep: loosely be defined as “the tendency to add just another little feature until the whole product is overwhelmed with them”. That pretty much sounds like a bad thing, so why does it keep happening?

Multiple studies have shown that before using a product, people judge its quality based on the number of features it has. It's only after using the product that they realize the usability issues too many features create.

So in order to maximize initial sales companies build products with many features. But to maximize repeat sales, customer satisfaction, and retention companies need to prioritize ease-of-use over features. Cue the inevitable redesign cycle that software applications go through... design is never done.

The more you own, the more you maintain.

The other key issue with more features is more maintenance. Every feature that goes out the door is a commitment to bug fixes, customer support, and the resources required to keep the feature running and updated. Too often these costs aren't considered enough when features get launched. And an increasing number of features inevitably begin to bog down what a company can do going forward. Companies get stuck in their self-inflicted feature morass negatively impacting their ability to move quickly to address new customer and market needs, which often matters more than a few incremental features.

Like consumer shopping decisions, product team decisions are weighted toward short-term vs. long-term value. Launching new features within software companies typically gets you the accolades, promotions, and clout. Maintaining old features, much less so.

For both consumers and product teams the upfront allure of more features usually wins out, but in both cases, long-term consequences await. So sail the feature-seas mindfully please.

Video: PDFs & Conversational Interfaces

LukeW - Sun, 12/17/2023 - 2:00pm

This two minute video from my How AI Ate My Website talk, highlights the importance of cleaning up the source materials used for conversational interfaces. It illustrates the issues PDF documents can have on large-language model generated answers and how to address them.

Transcript

PDFs are special in another way, as in painfully special. Let's look at what happened to our answers when we added 370 plus PDFs to our embedding index. On the left is an answer to the question, what is design? Pretty good response and sourced from a bunch of web pages.

When PDFs got added to the index, the response to this question changed a lot and not in a way that I liked. But more importantly, only one PDF was cited as a source instead of multiple web pages.

So what happened?

What happened is a great demonstration of the importance of the document processing, aka cleanup step, I emphasized before. This ugly spreadsheet shows the ugly truth of PDFs. They have a ton of layout markup to achieve their good looks.

But when breaking them down, you can easily end up with a bunch of bad content chunks like the ones here. After scoring all our content embeddings, we were able to get rid of a bunch that were effectively junk and clogging up our answers.

Removing those now gives a much better balance of PDFs, videos, podcasts, and web pages, all of which gets cited in the answer to what is design. More importantly, however, the answer itself actually got better.

Video: Suggested Questions in Conversational UI

LukeW - Wed, 12/13/2023 - 2:00pm

If you've ever designed a conversional interface, you've probably found that people often don't know what they could or should ask. In this 2 minute video from my How AI Ate My Website talk, I discuss the importance of suggested questions in the Ask Luke conversational UI on this site and walk through some of the design iterations we tried before landing on our current solution.

Transcript

So now we have an expandable conversational interface that collapses and extends to make finding relevant answers much easier. But there's something missing in this screenshot... and that's suggested questions.

For the purpose of this presentation, I simplified the UI a bit in the past few examples. But on the real site, each answer also includes a series of suggested questions. The first few of these are related to the question you just asked, and additional ones come from the rest of the corpus of content.

Suggested questions are pretty critical because they address the issue of, what should I ask? And it turns out, lots of people have that problem, because a very large percent of all the questions asked kick off with one of these suggestions.

We knew from the start these were important, but it took a bit to get to the design solution you see here. At first, we experimented with an explicit action to trigger suggested questions.

Need an idea for what to ask? Just hit the lightbulb icon.

We then iterated to a more clear, what can I ask, link and icon that works the same way. But in both cases, the burden was on the user to ask for suggested questions.

So we began exploring a series of designs that put suggested questions directly after each answer, automatically. With this approach, there was no work required on the part of the user to show suggested questions.

These iterations continued until we got to suggested questions directly in line in our expandable conversational interface.

Video: Embedded Experiences in Conversational UI

LukeW - Tue, 12/05/2023 - 2:00pm

In this 2.5min video from my How AI Ate My Website talk, I walk through how a conversational (chat) interface powered by generative AI can cite the materials it uses to answer people's questions through a unified embedded experience for different document types like videos, audio, Web pages, and more.

Transcript

Now, as I mentioned, answers stem from finding the most relevant parts of documents, stitching them together, and citing those replies. You can see one of these citations in this example.

This also serves as an entry point into a deeper, object-specific experience. What does that mean? Well, when you see these cited sources, you can tap into any one of them to access the content. But instead of just linking out to a separate window or page, which is pretty common, we've tried to create a unified way of exploring each one.

Not only do you get an expanded view into the document, but you also get document-specific interactions, and the ability to ask additional questions scoped just to that open document.

Here's how that looks in this case for an article. You can select a citation to get the full experience, which includes a summary, the topics in the article, and again, the ability to ask questions just of that document. In this case, about evolving e-commerce checkout.

There's more document types than just webpages, though. Videos, podcasts, PDFs, images, and more. On Ask Luke, you ask a question, get an answer, with citations to videos, audio files, webpages, PDFs, etc. Each one has a unified, but document-type specific interface.

The video experience, for example, has an inline player, a scrubber with a real-time transcript, the ability to search that transcript, some auto-generated topics, summaries, and the ability to ask questions just of what's in the video.

When you search within the transcript, you can also jump directly to that part of the video in the inline player. Audio works the same way, just an audio player instead of a video screen. Here you can see the diarization and cleanup work at play, which is how we have the conversation broken down by speakers and their names and the timestamp for the transcript.

Webpages have a reader view, just like videos and audio files. We show a summary, key topics, give people the ability to ask questions scoped to that article, and by now you get the pattern.

Video: Structuring Website Content with AI

LukeW - Sun, 12/03/2023 - 2:00pm

To create useful conversational interfaces for specific sets of content like this Website, we can use a variety of AI models to add structure to videos, audio files, and text. In this 2.5 minute video from my How AI Ate My Website talk, I discuss how and also illustrate if you can model a behavior, you can probably train a machine to do it at scale.

Transcript

There's more document types than just web pages. Videos, podcasts, PDFs, images, and more. So let's look at some of these object types and see how we can break them down using AI models in a way that can then be reassembled into the Q&A interface we just saw.

For each video file, we first need to turn the audio into written text. For that, we use a speech-to-text AI model. Next, we need to break that transcript down into speakers. For that, we use a diarization model. Finally, a large language model allows us to make a summary, extract keyword topics, and generate a list of questions each video can answer.

We also explored models for identifying objects and faces, but don't use them here. But we did put together a custom model for one thing, keyframe selection. There's also a processing step that I'll get to in a bit, but first let's look at this keyframe selection use case.

We needed to pick out good thumbnails for each video to put into the user interface. Rather than manually viewing each video and selecting a specific keyframe for the thumbnail, we grabbed a bunch automatically, then quickly trained a model by providing examples of good results. Show the speaker, eyes open, no stupid grin.

In this case, you can see it nailed the which Paris girl are you backdrop, but left a little dumb grin, so not perfect. But this is a quick example of how you can really think about having AI models do a lot of things for you.

If you can model the behavior, you can probably train a machine to do it at scale. In this case, we took an existing model and just fine-tuned it with a smaller number of examples to create a useful thumbnail picker.

In addition to video files, we also have a lot of audio, podcasts, interviews, and so on. Lots of similar AI tasks to video files. But here I wanna discuss the processing step on the right.

There's a lot of cleanup work that goes into making sure our AI generated content is reliable enough to be used in citations and key parts of the product experience. We make sure proper nouns align, aka Luke is Luke. We attach metadata that we have about the files, date, type, location, and break it all down into meaningful chunks that can be then used to assemble our responses.

Video: Expanding Conversational Interfaces

LukeW - Thu, 11/30/2023 - 2:00pm

In this 4 minute video from my How AI Ate My Website talk, I illustrate how focusing on understanding the problem instead of starting with a solution can guide the design of conversational (AI-powered) interfaces. So they don't all have to look like chatbots.

Transcript

But what if instead we could get closer to the way I'd answer your question in real life? That is, I'd go through all the things I've written or said on the topic, pull them together into a coherent reply, and even cite the sources, so you can go deeper, get more context, or just verify what I said.

In this case, part of my response to this question comes from a video of a presentation just like this one, but called Mind the Gap. If you select that presentation, you're taken to the point in the video where this topic comes up. Note the scrubber under the video player.

The summary, transcript, topics, speaker diarization, and more are all AI generated. More on that later, but essentially, this is what happens when a bunch of AI models effectively eat all the pieces of content that make up my site and spit out a very different interaction model.

Now the first question people have about this is how is this put together? But let's first look at what the experience is, and then dig into how it gets put together. When seeing this, some of you may be thinking, I ask a question, you respond with an answer.

Isn't that just a chatbot? Chatbot patterns are very familiar to all of us, because we spend way too much time in our messaging apps. The most common design layout of these apps is a series of alternating messages. I say something, someone replies, and on it goes. If a message is long, space for it grows in the UI, sometimes even taking up a full screen.

Perhaps unsurprisingly, it turns out this design pattern isn't optimal for iterative conversations with sets of documents, like we're dealing with here. In a recent set of usability studies of LLM-based chat experiences, the Nielsen-Norman group found a bunch of issues with this interaction pattern, in particular with people's need to scroll long conversation threads to find and extract relevant information. As they called out, this behavior is a significant point of friction, which we observed with all study participants.

To account for this, and a few additional considerations, we made use of a different interaction model, instead of the chatbot pattern. Through a series of design explorations, we iterated to something that looks a little bit more like this.

In this approach, previous question and answer pairs are collapsed, with a visible question and part of its answer. This enables quick scanning to find relevant content, so no more scrolling massive walls of text. Each question and answer pair can be expanded to see the full response, which as we saw earlier can run long due to the kinds of questions being asked.

Here's how things look on a large screen. The most recent question and answer is expanded by default, but you can quickly scan prior questions, find what you need, and then expand those as well. Net-net, this interaction works a little bit more like a FAQ pattern than a chatbot pattern, which kind of makes sense when you think about it. The Q&A process is pretty similar to a help FAQ. Have a question, get an answer.

It's a nice example of how starting with the problem space, not the solution, is useful. I bring this up because too often designers start the design process with something like a competitive audit, where they look at what other companies are doing and, whether intentionally or not, end up copying it, instead of letting the problem space guide the solution.

In this case, starting with understanding the problem versus looking at solutions got us to a more of a FAQ thing than a chatbot thing. So now we have an expandable conversational interface that collapses and extends to make finding relevant answers much easier.

AI Models Enable New Capabilities

LukeW - Tue, 11/28/2023 - 2:00pm

In the introduction to my How AI Ate My Website talk, I frame AI capabilities as a set of language and vision operations that allows us to rethink how people experience Web sites. AI tasks like text summarization, speech to text, and more can be used to build new interactions with existing content as outlined in this short 3 minute video.

Transcript

How AI Ate My Website What do most people picture of that title, AI Eating a Website? They might perhaps imagine some scary things, like a giant computer brain eating up web pages on its way to global dominance.

In truth though, most people today probably think of AI as something more like ChatGPT, the popular large language model from OpenAI. These kinds of AI models are trained on huge amounts of data, including sites like mine, which gives them the ability to answer questions such as, Who is Luke? ChatGPT does a pretty good job, so I guess I don't need an intro slide in my presentations anymore.

But it's not just my site that's part of these massive training sets. And since large language models are essentially predicting the next token in a sequence, they can easily predict very likely, but incorrect answers. For instance, it's quite likely a product designer like me went to CMU, but I did not. Even though ChatGPT keeps insisting that I did, in this case, for a master's degree.

No problem though, because of reinforcement learning, many large language models are tuned to please us. So correct them, and they'll comply, or veer off into weird spaces.

Let's zoom out to see this relationship between large language models and websites. A website like mine, including many others, has lots of text. That text gets used as training data for these immense auto-completion machines, like ChatGPT. That's how it gets the ability to create the kinds of responses we just looked at.

This whole idea of training giant machine brains on the totality of published content on the internet can lead people to conjure scary AI narratives.

But thinking in terms of a monolithic AI brain, isn't that helpful to understanding AI capabilities and how they can help us? While ChatGPT is an AI model, it's just one kind, a large language model. There's lots of different AI models that can be used for different tasks, like language operations, vision operations, and more.

Some models do more than one task, others are more specialized. What's very different from a few years ago though, is that general purpose models, things that can do a lot of different tasks, are now widely available and effectively free.

We can use these AI models to rethink what's possible when people interact with our websites, to enable experiences that were impossible before, to go from scary AI thing to awesome new capabilities, and hopefully make the web cool again, because right now, sorry, it's not very cool.

Early Glimpses of Really Personal Assistants

LukeW - Fri, 11/24/2023 - 2:00pm

Recently I've stumbled into a workflow that's starting to feel like the future of work. More specifically, a future with really personal assistants that accelerate and augment people's singular productivity needs and knowledge.

"The future is already here – it's just not evenly distributed." -William Gibson, 2003

Over the past few months, I've been iterating on a feature of this Website that answers people's digital product design questions in natural language using the over 2,000 text articles, 375 presentations, 100 videos, and more that I've authored over the past 28 years. While the project primarily started as testbed for conversational interface design, it's morphed into quite a bit more.

Increasingly, I've started to use the Ask Luke functionality as an assistant that knows my work almost as well as I do, can share it with others, and regularly expands its usefulness. For example, when asked a question on Twitter (ok, X) I can use Ask Luke to instantly formulate an answer and respond with a link to it.

Ask Luke answers use the most relevant parts of my archive of writings, presentations, and more when responding. In this case, the response includes several citations that were used to create the final answer:

  • a video that begins that the 56:04 timestamp where the topic of name fields came up in a Q&A session after my talk
  • a PDF of a presentation I gave on on Mobile checkout where specific slides outlined the pros and cons of single name fields
  • and several articles I wrote that expanded on name fields in Web forms

It's not hard to see how the process of looking across thousands of files, finding the right slides, timestamps in videos, and links to articles would have taken me a lot longer than the ~10 seconds it takes Ask Luke to generate a response. Already a big personal productivity gain.

I've even found that I can mostly take questions as they come to me and produce responses as this recent email example shows. No need to reformat or adjust the question, just paste it in and get the response.

But what about situations where I may have information in my head but haven't written anything on the topic? Or where I need to update what I wrote in light of new information or experiences I've come across? As these situations emerged, we expanded the admin features for Ask Luke to allow me to edit generated answers or write new answers (often through audio dictation).

Any new or edited answer then becomes part of the index used to answer subsequent questions people ask. I can also control how much an edited or new answer should influence a reply and which citations should be prioritized alongside the answer. This grows the content available in Ask Luke and helps older content remain relevant.

Having an assistant that can accept instructions (questions) in the exact form you get them (no rewriting), quickly find relevant content in your digital exhaust (documents, presentations, recordings, etc.), assemble responses the way you would, cite them in detail, and help you grow your personal knowledge base... well it feels like touching the future.

And it's not hard to imagine how similar really personal assistants could benefit people at work, home, and school.

Further Reading

AI Models in Software UI

LukeW - Sun, 11/19/2023 - 2:00pm

As more companies work to integrate the capabilities of powerful generative AI language and vision models into new and existing software, high-level interaction patterns are emerging. I've personally found these distinct approaches to AI integration useful for talking with folks about what might work for their specific products and use cases.

In the first approach, the primary interface affordance is an input that directly (for the most part) instructs an AI model(s). In this paradigm, people are authoring prompts that result in text, image, video, etc. generation. These prompts can be sequential, iterative, or un-related. Marquee examples are OpenAI's ChatGPT interface or Midjourney's use of Discord as an input mechanism. Since there are few, if any, UI affordances to guide people these systems need to respond to a very wide range of instructions. Otherwise people get frustrated with their primarily hidden (to the user) limitations.

The second approach doesn't include any UI elements for directly controlling the output of AI models. In other words, there's no input fields for prompt construction. Instead instructions for AI models are created behind the scenes as people go about using application-specific UI elements. People using these systems could be completely unaware an AI model is responsible for the output they see. This approach is similar to YouTube's use of AI models (more machine learning than generative) for video recommendations.

The third approach is application specific UI with AI assistance. Here people can construct prompts through a combination of application-specific UI and direct model instructions. These could be additional controls that generate portions of those instructions in the background. Or the ability to directly guide prompt construction through the inclusion or exclusion of content within the application. Examples of this pattern are Microsoft's Copilot suite of products for GitHub, Office, and Windows.

These entry points for AI assistance don't have to be side panels, they could be overlays, modals, inline menus and more. What they have in common, however, is that they supplement application specific UIs instead of completely replacing them.

Actual implementations of any of these patterns are likely to blur the lines between them. For instance, even when the only UI interface is an input for prompt construction, the system may append or alter people's input behind the scenes to deliver better results. Or an AI assistance layer might primarily serve as an input for controlling the UI of an application instead of working alongside it. Despite that, I've still found these three high-level approaches to be helpful in thinking through where and how AI models are surfaced in software applications.

Until the Right Design Emerges...

LukeW - Wed, 11/15/2023 - 2:00pm

Too often, the process of design is cut short. When faced with user needs or product requirements, many designers draft a mockup or wireframe informed by what they've seen or experienced before. But that's actually when the design process starts, not ends.

"Art does not begin with imitation, but with discipline."—Sun Ra, 1956

Your first design, while it may seem like a solution, is usually just an early definition of the problem you are trying to solve. This iteration surfaces unanswered questions, puts assumptions to the test, and generally works to establish what you need to learn next.

"Design is the art of gradually applying constraints until only one solution remains."—Unknown

Each subsequent iteration is an attempt to better understand what is actually needed to solve the specific problem you're trying to address with your design. The more deeply you understand the problem, the more likely you are to land on an elegant and effective solution. The process of iteration is a constant learning process that gradually reveals the right path forward.

"True simplicity is, well, you just keep on going and going until you get to the point where you go... Yeah, well, of course." —Jonathan Ive, September, 2013

When the right approach reveals itself, it feels obvious. But only in retrospect. Design is only obvious in retrospect. It takes iteration and discipline to get there. But when you do get there, it's much easier to explain your design decisions to others. You know why the design is the right one and can frame your rationale in the context of the problem you are trying to solve. This makes presenting designs easier and highlights the strategic impact of designers.

Multi-Modal Personal Assistants: Early Explorations

LukeW - Mon, 11/13/2023 - 2:00pm

With growing belief that we're quickly moving to a world of personalized multi-modal software assistants, many companies are working on early glimpses of this potential future. Here's a few ways you can explore bits of what these kinds of interactions might become.

But first, some context. Today's personal multi-modal assistant explorations are largely powered by AI models that can perform a wide variety of language and vision tasks like summarizing text, recognizing objects in images, synthesizing speech, and lots more. These tasks are coupled with access to tools, information, and memory that makes them directly relevant to people's immediate situational needs.

To simplify that, here's a concrete example: faced with a rat's nest of signs, you want to know if it's ok to park your car. A personal multi-modal assistant could take an image (live camera feed or still photo), a voice command (in natural language), and possibly some additional context (time, location, historical data) as input and assemble a response (or action) that considers all these factors.

So where can you try this out? As mentioned, several companies are tackling different parts of the problem. If you squint a bit at the following list, it's hopefully clear how these explorations could add up to a new computing paradigm.

OpenAI's native iOS app can take image and audio input and respond in both text and speech using their most advanced large language model, GPT4... if you sign up for their $20/month GPT+ subscription. With an iPhone 15 Pro ($1,000+), you can configure the phone's hardware action button to directly open voice control in OpenAI's app. This essentially gives you an instant assistant button for audio commands. Image input, however, still requires tapping around the app and only works with static images not a real-time camera feed.

Humane's upcoming AI Pin (preorder $699) handles multiple inputs with a built in microphone, camera, touch surface, and sensors for light, motion, GPS, and more. It likewise, makes use of a network connection ($24/month) and Large Language Models to respond to natural language requests but instead of making use of your smartphone screen and microphone for output, it makes use of it's own speaker and laser projection display. Definitely on the "different" end of hardware and display spectrum.

Rewind's Pendant (preorder for $59) is a wearable that captures what you say and hear in the real world and then transcribes, encrypts, and stores it on your phone. It's mostly focused on the audio input side of a multi-modal personal assistant but the company's goal is to make use what the device captures to create a "personalized AI powered by truly everything you’ve seen, said, or heard."

New Computer's Dot app (not yet available) has released some compelling videos of a multi-modal personal assistant that runs on iOS. In particular, the ability to add docs and images that become part of a longer term personal memory.

While I'm sure more explorations and developed products are coming, this list let's you touch parts of the future while it's being sorted out... wrinkles and all.

Always Be Learning

LukeW - Wed, 11/08/2023 - 2:00pm

The mindset to “always be learning” is especially crucial in the field of digital product design where not only is technology continouosly evolving, but so are the people we're designing for.

To quote Bruce Sterling, because people are “time bound entities moving from cradle to grave”, their context, expectations, and problems are always changing. So design solutions need to change along with them.

As a result, designers have to keep learning about how our products are being used, abused, or discarded and we need to feed those lessons back into our designs. Good judgement comes from experiences, and experience comes from bad judgements. Therefore, continuous learning is crucial for refining judgement and improving design outcomes.

"There’s the object, the actual product itself, and then there’s all that you learned. What you learned is as tangible as the product itself, but much more valuable, because that’s your future." -Jony Ive, 2014

So how can we always be learning? Start with the mindset that you have a lot to learn and sometimes unlearn. Spend your time in environments that encourage deeper problem-understanding and cross-disciplinary collaboration. This means not just designing but prototyping as well. Design to build, build to learn.

Recognize the patterns you encounter along the way and make time to explore them. This extends what you've learned into a more broadly useful set of skills and better prepares you for the next set of things you'll need to learn.

Rapid Iterative Testing and Evaluation (RITE)

LukeW - Thu, 11/02/2023 - 2:00pm

Rapid Iterative Testing and Evaluation or RITE is a process I've used while working at Yahoo! and Google to quickly make progress on new product designs and give teams a deeper shared understanding of the problem space they're working on.

RITE is basically a continuous process of designing and building a prototype, testing it with users, and making changes within a short period, typically a few days. The goal is to quickly identify and address issues, and then iterate on the design based on the what was learned. This gives teams regular face time with end users and collectively grows their knowledge of the needs, environments, and expectations of their customers.

The way I've typically implemented RITE is every Monday, Tuesday, and Wednesday, we design and build a prototype. Then every Thursday, we bring in people to use the prototype through a series of 3-5 usability tests that the whole team attends. On Friday, we discuss the results of that testing together and decide what to change during the following week. This cycle is repeated week after week. In some cases running for months.

This approach puts customers front and center in the design process and allows for quick adaptation to issues and opportunities each week. The RITE method is also useful because it provides insights not just opinions. In other words, if there's a debate about a design decision, we can simply test it with users that week. This squashes a lot of open-ended discussions that don't result in action because the cost of trying something out is incredibly low. "OK we'll try it."

The cadence of weekly user tests also really aligns teams on common goals as everyone participates in observing problems and opportunities, exploring solutions, and seeing the results of their proposals. Over and over again.

Wed, 12/31/1969 - 2:00pm
Syndicate content
©2003 - Present Akamai Design & Development.