Faye Zhang on Using AI to Improve Discovery – O’Reilly

By admin

September 18, 2025

0

90

O’Reilly Media

Generative AI in the Real World: Faye Zhang on Using AI to Improve Discovery

00:00
/
22m 12s

In this episode, Ben Lorica and AI Engineer Faye Zhang talk about discoverability: how to use AI to build search and recommendation engines that actually find what you want. Listen in to learn how AI goes way beyond simple collaborative filtering—pulling in many different kinds of data and metadata, including images and voice, to get a much better picture of what any object is and whether or not it’s something the user would want.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

0:00: Today we have Faye Zhang of Pinterest, where she’s a staff AI engineer. And so with that, very welcome to the podcast.

0:14: Thanks, Ben. Huge fan of the work. I’ve been fortunate to attend both the Ray and NLP Summits. I know where you serve as chairs. I also love the O’Reilly AI podcast. The recent episode on A2A and the one with Raiza Martin on NotebookLM have been really inspirational. So, great to be here.

0:33: All right, so let’s jump right in. So one of the first things I really wanted to talk to you about is this work around PinLanding. And you’ve published papers, but I guess at a high level, Faye, maybe describe for our listeners: What problem is PinLanding trying to address?

0:53: Yeah, that’s a great question. I think, in short, trying to solve this trillion-dollar discovery crisis. We’re living through the greatest paradox of the digital economy. Essentially, there’s infinite inventory but very little discoverability. Picture one example: A bride-to-be asks ChatGPT, “Now, find me a wedding dress for an Italian summer vineyard ceremony,” and she gets great general advice. But meanwhile, somewhere in Nordstrom’s hundreds of catalogs, there sits the perfect terracotta Soul Committee dress, never to be found. And that’s a $1,000 sale that will never happen. And if you multiply this by a billion searches across Google, SearchGPT, and Perplexity, we’re talking about a $6.5 trillion market, according to Shopify’s projections, where every failed product discovery is money left on the table. So that’s what we’re trying to solve—essentially solve the semantic organization of all platforms versus user context or search.

2:05: So, before PinLanding was developed, and if you look across the industry and other companies, what would be the default—what would be the incumbent system? And what would be insufficient about this incumbent system?

2:22: There have been researchers across the past decade working on this problem; we’re definitely not the first one. I think number one is to understand the catalog attribution. So, back in the day, there was multitask R-CNN generation, as we remember, [that could] identify fashion shopping attributes. So you would pass in-system an image. It would identify okay: This shirt is red and that material may be silk. And then, in recent years, because of the leverage of large scale VLM (vision language models), this problem has been much easier.

3:03: And then I think the second route that people come in is via the content organization itself. Back in the day, [there was] research on join graph modeling on shared similarity of attributes. And a lot of ecommerce stores also do, “Hey, if people like this, you might also like that,” and that relationship graph gets captured in their organization tree as well. We utilize a vision large language model and then the foundation model CLIP by OpenAI to easily recognize what this content or piece of clothing could be for. And then we connect that between LLMs to discover all possibilities—like scenarios, use case, price point—to connect two worlds together.

3:55: To me that implies you have some rigorous eval process or even a separate team doing eval. Can you describe to us at a high level what is eval like for a system like this?

4:11: Definitely. I think there are internal and external benchmarks. For the external ones, it’s the Fashion200K, which is a public benchmark anyone can download from Hugging Face, on a standard of how accurate your model is on predicting fashion items. So we measure the performance using the recall top-k metrics, which says whether the label appears among the top-end prediction attribute accurately, and as a result, we were able to see 99.7% recall for the top ten.

4:47: The other topic I wanted to talk to you about is recommendation systems. So obviously there’s now talk about, “Hey, maybe we can go beyond correlation and go towards reasoning.” Can you [tell] our audience, who may not be steeped in state-of-the-art recommendation systems, how you would describe the state of recommenders these days?

5:23: For the past decade, [we’ve been] seeing tremendous movement from foundational shifts on how RecSys essentially operates. Just to call out a few big themes I’m seeing across the board: Number one, it’s kind of moving from correlation to causation. Back then it was, hey, a user who likes X might also like Y. But now we actually understand why contents are connected semantically. And our LLM AI models are able to reason about the user preferences and what they actually are.

5:58: The second big theme is probably the cold start problem, where companies leverage semantic IDs to solve the new item by encoding content, understanding the content directly. For example, if this is a dress, then you understand its color, style, theme, etc.

6:17: And I think of other bigger themes we’re seeing; for example, Netflix is merging from [an] isolated system into a unified intelligence. Just this past year, Netflix [updated] their multitask architecture where [they] shared representations, into one they called the UniCoRn system to enable company-wide improvement [and] optimizations.

6:44: And very lastly, I think on the frontier side—this is actually what I learned at the AI Engineer Summit from YouTube. It’s a DeepMind collaboration, where YouTube is now using a large recommendation model, essentially teaching Gemini to speak the language of YouTube: of, hey, a user watched this video, then what might [they] watch next? So a lot of very exciting capabilities happening across the board for sure.

7:15: Generally it sounds like the themes from years past still map over in the following sense, right? So there’s content—the difference being now you have these foundation models that can understand the content that you have more granularly. It can go deep into the videos and understand, hey, this video is similar to this video. And then the other source of signal is behavior. So those are still the two main buckets?

7:53: Correct. Yes, I would say so.

7:55: And so the foundation models help you on the content side but not necessarily on the behavior side?

8:03: I think it depends on how you want to see it. For example, on the embedding side, which is a kind of representation of a user entity, there have been transformations [since] back in the day with the BERT Transformer. Now it’s got long context encapsulation. And those are all with the help of LLMS. And so we can better understand users, not to next or the last clicks, but to “hey, [in the] next 30 days, what might a user like?”

8:31: I’m not sure this is happening, so correct me if I’m wrong. The other thing that I would imagine that the foundation models can help with is, I think for some of these systems—like YouTube, for example, or maybe Netflix is a better example—thumbnails are important, right? The fact now that you have these models that can generate multiple variants of a thumbnail on the fly means you can run more experiments to figure out user preferences and user tastes, correct?

9:05: Yes. I would say so. I was lucky enough to be invited to one of the engineer network dinners, [and was] speaking with the engineer who actually works on the thumbnails. Apparently it was all personalized, and the approach you mentioned enabled their rapid iteration of experiments, and had definitely yielded very positive results for them.

9:29: For the listeners who don’t work on recommendation systems, what are some general lessons from recommendation systems that generally map to other forms of ML and AI applications?

9:44: Yeah, that’s a great question. A lot of the concepts still apply. For example, the knowledge distillation. I know Indeed was trying to tackle this.

9:56: Maybe Faye, first define what you mean by that, in case listeners don’t know what that is.

10:02: Yes. So knowledge distillation is essentially, from a model sense, learning from a parent model with larger, bigger parameters that has better world knowledge (and the same with ML systems)—to distill into smaller models that can operate much faster but still hopefully encapsulate the learning from the parent model.

10:24: So I think what Indeed back then faced was the classic precision versus recall in production ML. Their binary classifier needs to really filter out the batch job that you would recommend to the candidates. But this process is obviously very noisy, and sparse training data can cause latency and also constraints. So I think back in the work they published, they couldn’t really get effective separate résumé content from Mistral and maybe Llama 2. And then they were happy to learn [that] out-of-the-box GPT-4 achieved something like 90% precision and recall. But obviously GPT-4 is more expensive and has close to 30 seconds of inference time, which is much slower.

11:21: So I think what they do is use the distillation concept to fine-tune GPT 3.5 on labeled data, and then distill it into a lightweight BERT-based model using the temperature scale softmax, and they’re able to achieve millisecond latency and a comparable recall-precision trade-off. So I think that’s one of the learnings we see across the industry that the traditional ML techniques still work in the age of AI. And I think we’re going to see a lot more in the production work as well.

11:57: By the way, one of the underappreciated things in the recommendation system space is actually UX in some ways, right? Because basically good UX for delivering the recommendations actually can move the needle. How you actually present your recommendations might make a material difference.

12:24: I think that’s very much true. Although I can’t claim to be an expert on it because I know most recommendation systems deal with monetization, so it’s tricky to put, “Hey, what my user clicks on, like engage, send via social, versus what percentage of that…

12:42: And it’s also very platform specific. So you can imagine TikTok as one single feed—the recommendation is just on the feed. But YouTube is, you know, the stuff on the side or whatever. And then Amazon is something else. Spotify and Apple [too]. Apple Podcast is something else. But in each case, I think those of us on the outside underappreciate how much these companies invest in the actual interface.

13:18: Yes. And I think there are multiple iterations happening on any day, [so] you might see a different interface than your friends or family because you’re actually being grouped into A/B tests. I think this is very much true of [how] the engagement and performance of the UX have an impact on a lot of the search/rec system as well, beyond the data we just talked about.

13:41: Which brings to mind another topic that is also something I’ve been interested in, over many, many years, which is this notion of experimentation. Many of the most successful companies in the space actually have invested in experimentation tools and experimentation platforms, where people can run experiments at scale. And those experiments can be done much more easily and can be monitored in a much more principled way so that any kind of things they do are backed by data. So I think that companies underappreciate the importance of investing in such a platform.

14:28: I think that’s very much true. A lot of larger companies actually build their own in-house A/B testing experiment or testing frameworks. Meta does; Google has their own and even within different cohorts of products, if you’re monetization, social. . . They have their own niche experimentation platform. So I think that thesis is very much true.

14:51: The last topic I wanted to talk to you about is context engineering. I’ve talked to numerous people about this. So every six months, the context window for these large language models expands. But obviously you can’t just stuff the context window full, because one, it’s inefficient. And two, actually, the LLM can still make mistakes because it’s not going to efficiently process that entire context window anyway. So talk to our listeners about this emerging area called context engineering. And how is that playing out in your own work?

15:38: I think this is a fascinating topic, where you will hear people passionately say, “RAG is dead.” And it’s really, as you mentioned, [that] our context window gets much, much bigger. Like, for example, back in April, Llama 4 had this staggering 10 million token context window. So the logic behind this argument is quite simple. Like if the model can indeed handle millions of tokens, why not just dump everything instead of doing a retrieval?

16:08: I think there are quite a few fundamental limitations towards this. I know folks from contextual AI are passionate about this. I think number one is scalability. A lot of times in production, at least, your knowledge base is measured in terabytes or petabytes. So not tokens. So something even larger. And number two I think would be accuracy.

16:33: The effective context windows are very different. Honestly, what we see and then what is advertised in product launches. We see performance degrade long before the model reaches its “official limits.” And then I think number three is probably the efficiency and that kind of aligns with, honestly, our human behavior as well. Like do you read an entire book every time you need to answer one simple question? So I think the context engineering [has] slowly evolved from a buzzword, a few years ago, to now an engineering discipline.

17:15: I’m appreciative that the context windows are increasing. But at some level, I also acknowledge that to some extent, it’s also kind of a feel-good move on the part of the model builders. So it makes us feel good that we can put more things in there, but it may not actually help us answer the question precisely. Actually, a few years ago, I wrote kind of a tongue-and-cheek post called “Structure Is All You Need.” So basically whatever structure you have, you should help the model, right? If it’s in a SQL database, then maybe you can expose the structure of the data. If it’s a knowledge graph, you leverage whatever structure you have to provide the model better context. So this whole notion of just stuffing the model with as much information, for all the reasons you gave, is valid. But also, philosophically, it doesn’t make any sense to do that anyway.

18:30: What are the things that you are looking forward to, Faye, in terms of foundation models? What kinds of developments in the foundation model space are you hoping for? And are there any developments that you think are below the radar?

18:52: I think, to better utilize the concept of “contextual engineering,” that they’re essentially two loops. There’s number one within the loop of what happened. Yes. Within the LLMs. And then there’s the outer loop. Like, what can you do as an engineer to optimize a given context window, etc., to get the best results out of the product within the context loop. There are multiple tricks we can do: For example, there’s the vector plus Excel or regex extraction. There’s the metadata fillers. And then for the outer loop—this is a very common practice—people are using LLMs as a reranker, sometimes across the encoder. So the thesis is, hey, why would you overburden an LLM with a 20,000 ranking when there are things you can do to reduce it to top hundred or so? So all of this—context assembly, deduplication, and diversification—would help our production [go] from a prototype to something [that’s] more real time, reliable, and able to scale more infinitely.

20:07: One of the things I wish—and I don’t know, this is wishful thinking—is maybe if the models can be a little more predictable, that would be nice. By that, I mean, if I ask a question in two different ways, it’ll basically give me the same answer. The foundation model builders can somehow increase predictability and maybe provide us with a little more explanation for how they arrive at the answer. I understand they’re giving us the tokens, and maybe some of the, some of the reasoning models are a little more transparent, but give us an idea of how these things work, because it’ll impact what kinds of applications we’d be comfortable deploying these things in. For example, for agents. If I’m using an agent to use a bunch of tools, but I can’t really predict their behavior, that impacts the types of applications I’d be comfortable using a model for.

21:18: Yeah, definitely. I very much resonate with this, especially now most engineers have, you know, AI empowered coding tools like Cursor and Windsurf—and as an individual, I very much appreciate the train of thought you mentioned: why an agent does certain things. Why is it navigating between repositories? What are you looking at while you’re doing this call? I think these are very much appreciated. I know there are other approaches—look at Devin, that’s the fully autonomous engineer peer. It just takes things, and you don’t know where it goes. But I think in the near future there will be a nice marriage between the two. Well, now since Windsurf is part of Devin’s parent company.

22:05: And with that, thank you, Faye.

22:08: Awesome. Thank you, Ben.

Faye Zhang on Using AI to Improve Discovery – O’Reilly

Transcript

Related Articles

Myanmar junta using ‘brutal violence’ to force people to vote in sham election, UN says

Sausage Balls

How to Get from FUKUOKA to NAGASAKI

LEAVE A REPLY Cancel reply

Latest Articles

Myanmar junta using ‘brutal violence’ to force people to vote in sham election, UN says

Sausage Balls

How to Get from FUKUOKA to NAGASAKI

53-year-old customs broker wants to ‘Make Trade Boring Again,’ saying you won’t believe how complex cheese is these days

Blackhawks make history with Kalshi partnership, first pro sports prediction deal