Towards talking to computers with Codex
About seven years ago, when I was a junior in high school, I built a “self-learning natural language search engine” called Wykki.
It used “natural language” in that it was able to separate a user’s prompt like “How old is Barack Obama” into a question stub (“How old is blank ”) and a subject (“Barack Obama”) using some hard-coded tricks and simple heuristics.
It then had a backend that connected those question stubs to properties in Freebase — think Wikipedia-as-a-database — so it could answer that question about Obama with his person.age property.
Wykki was also “self-learning” in that, if it came across a question stub it hadn’t seen before, it had a UI that let users “teach” it which Freebase property that question referred to.
Once you knew about those tricks it wasn’t all that impressive — I wouldn’t use “natural language” or “self-learning” to describe Wykki today — but seeing it work for the first time was a pretty cool experience.
Wykki was never more than a short-lived side project, but it got me really excited about the idea of accessing APIs (Freebase in this case) using natural language — and made me realize how difficult of a problem it is.
Over the past few years I’ve had a lot of shower thoughts about how I’d approach it with the background knowledge I have now — like maybe learning APIs from their docs pages or auto-generated OpenAPI specs — but those never materialized into anything.
The Codex live demo
This week, a thirty-minute Codex demo by OpenAI’s Greg Brockman, Ilya Sutskever and Wojciech Zaremba, showed me we’re now much closer to solving this problem than I could’ve imagined even a year ago.
As I wrote about last month, Codex is OpenAI’s latest giant language model that can write code, which also powers GitHub’s autocomplete-on-steroids Copilot tool.
Tuesday’s Codex demo started off with a bit of history, mostly about how the code generation demos that people were doing with GPT-3 last summer inspired the researchers to build a benchmark capturing those types of tasks, and to then start optimizing a GPT-type model to solve it (mostly by training it on a lot of open-source code).
Thus the initial version of Codex powering Copilot was born, followed quickly by the improved version that was behind the private beta, a coding challenge, and the four demonstrations during the presentation.
(Demo links go to timestamps in the YouTube recording.)
The first demo was fairly simple: telling Codex to “say Hello World” produced a Python program that prints “Hello, World!” to the console.
In the next commands, they asked it to “say that with empathy,” “say that five times,” and “wrap it in a web server,” showing that Codex can write some more complex code, but more importantly that it can keep track of the commands and code it received and wrote so far, and use them as context for the next code it wrote.
Another demo was quite similar, but a lot more complex: writing a simple game in Javascript: “add a person,” “make it 100 pixels tall,” “make it move left and right with the arrow keys,” “stop it from going off-screen,” and “make the person lose when it collides with a boulder.” The key thing here was that Codex works best when you keep asking it to take small steps, and that it’s always easy to go back and try slightly different phrasing to improve your results.
“We think these text instructions will become a type of source code that people can pass around [instead of the actual code].”
The next demo (which actually happend second) is where it gets really interesting.
Viewers were asked to leave their email addresses in a web form.
We then watched as the demonstrators used Codex to create a small python script that looked up the current Bitcoin price and email it to us.
Crucially, they did not explicitly tell Codex how to find out the current Bitcoin price — from training millions of lines of open-source code, it apparently already knew that Coinbase has a world-readable API for querying this.
You could ask Codex to write a program that uses the current Bitcoin price without knowing what Coinbase is, or without even knowing what a REST API is!
They took this idea to the next level in the fourth and final demo, for which they switched to an iPad running Microsoft Word with a custom Codex plugin.
The plugin mostly consisted of a big button to trigger speech recognition, the output of which got fed into Codex, which then translated it to code and ran it (with a bit of glue to give it access to Word’s APIs).
This enabled some really cool interactions.
After pasting some badly-formatted text, they could for example say “remove initial spaces,” and a few seconds later Codex had written and run code that used the Word API to iterate through each line of text and delete any leading spaces.
Next, they said “make every fifth line bold,” and a few seconds later… every fifth line was bold!
That’s where the demo ended, but this got me really excited.
There is so much functionality in modern software and services that’s hidden three layers deep in some convoluted UI or API, that most people today don’t know how to use.
Codex plugins like this can enable those people to use that functionality — and they won’t even have to know that under the hood it’s doing this by generating code on the fly.
Brockman on Twitter, a few hours after the demo:
The history of computing has been moving the computer closer to the human — moving from punch cards to assembly to higher level languages.
Codex represents a step towards a new interface to computers — being able to talk to your computer and having it do what you intend.
There are a lot of unanswered questions about how well this works with arbitrary APIs and outside of a controlled demo environment, but given OpenAI’s track record with GPT-x I’m not too worried about those.
I really think that during that half hour last Tuesday evening, I witnessed the next big thing in how we’ll interact with our computing devices a few years from now.
Exciting!!
Karpathy on Tesla Autopilot at CVPR'21
Karpathy on Tesla Autopilot at CVPR ‘21
Tesla’s head of AI Andrej Karpathy did a keynote at the CVPR 2021 Workshop on Autonomous Driving with updates on the company’s Autopilot self-driving system.
Just like his talk last year at Scaled ML 2020, this was a great watch if you’re interested in productized AI.
The talk kicks off with the value that “incremental autonomy” is already providing today, in the form of automatic emergency braking, traffic control warnings (“there’s a red light ahead!”), and pedal misapplication mitigation (PMM) — stopping the driver from flooring it when they meant to hit the brakes.
Examples of “incremental autonomy”
Karpathy then goes into details of the next generation of Autopilot: Tesla has “deleted” the radar sensor from recent new cars and is now relying on vision alone.
“If our [human] neural network can determine depth and velocity, can synthetic neural nets do it too?
Internally [at Tesla], our answer is an unequivocal yes.” This is backed by the fact that the new vision-only approach for Autopilot has a higher precision and recall than the previous sensor fusion approach.
Where does the Autopilot team get a large and diverse enough dataset to train a vision model like this?
From the million-car fleet of course!
There are now 221 manually-implemented triggers running on the Tesla fleet to detect scenarios that they may want to look at for training data.
(Could “inactive traffic lights on the back of a moving truck” be the 222nd?) Once collected, these images are labeled offline with a combination of human annotators, the old radar sensors, and very large neural nets — which would be too slow to deploy in the cars, but are very useful in this offline setting.
The loop of the Tesla Data Engine is then: (1) deploy models in ghost mode; (2) observe their predictions; (3) fine-tune triggers for collecting new training data; (4) create new unit tests out of wrong predictions; (5) add similar examples to the dataset; (6) retrain; and repeat.
At 1.5 petabytes, the final dataset for this first release of the new Autopilot system went through this shadow mode loop seven times.
It contains six billion labeled objects across one million 10-second videos.
The neural network trained on this data has a ResNet-ish backbone for basic image processing, which branches into “heads,” then “trunks,” and then “terminal” detectors.
This amortizes learning into different levels, and allows multiple engineers to first work on different heads in parallel and then sync up to retrain the backbone.
I hadn’t heard of this structure for letting a large (50-ish person) team collaborate on one big neural network before — very cool.
And finally, on the deployment side, Tesla is now also vertically-integrated: they built their own FSD (“Full Self Driving”) Computer, with their own neural engine.
Karpathy wrapped by re-emphasizing auto-labeling: using a much heavier model than you could ever use in production to do (a first stab at) data labeling offline, to then be cleaned up a bit by a human, is very powerful.
And his overall conclusion remained in line with Tesla’s overall stance on self-driving: no fleet, no go.
GitHub Copilot + OpenAI Codex = Microsoft synergy?
GitHub Copilot
GitHub previewed Copilot, “your AI pair programmer,” this week.
Accessed through a Visual Studio Code extension and powered by OpenAI’s brand-new Codex language model, it auto-suggests “whole lines or entire functions right inside your editor.” These suggestions are based on context from the rest of your code.
You can, for example, write a method’s signature and a docstring comment describing what it should do, and Copilot may be able to synthesize the rest of the method for you.
Other use cases include autofilling repetitive code, generating tests based on method implementations (which seems a bit backward?), and showing alternative code completions.
One of the places where Copilot really shines is in helping developers navigate new packages and frameworks.
In my job as ML engineer I often run into the problem of finding a package that may help me do a thing I need to do, but not knowing exactly how I can get it to do that thing because I’m not familiar with the package’s architecture, standards and quirks (hi pandas).
In that situation, I now usually context switch to Google and StackOverflow to see a few examples of the package in use.
Copilot can bring this process right into my IDE: I could just import the package, write a comment describing what I want to do, and cycle through a few examples that Copilot learned from open-source code until I understand how the package wants me to interact with it.
OpenAI’s Harri Edwards describes this quite eloquently:
Trying to code in an unfamiliar language by googling everything is like navigating a foreign country with just a phrase book.
Using GitHub Copilot is like hiring an interpreter.
I also like Patrick McKenzie’s take on Twitter:
I’m probably more bullish on this product than my model of most programmers.
Contrary to naive expectations, it doesn’t decrease demand for programmers; it probably decreases unproductive time of junior programmers stumped by the “white page problem.”
For many years folks, often non-technical, have mentioned tauntingly “Wait until you automate programmers out of a job” and that was the exact opposite of what happened when we introduced cutting edge “AI” [emphasis mine] like compilers and interpreters to liberate programmers from programming.
Beside looking like it’ll be a very cool and useful tool, Copilot’s launch is also interesting in a broader productized AI context.
From last October’s OpenAI and Microsoft: GPT-3 and beyond in DT #50:
So this suggests that the partnership goes beyond just the exchange of Microsoft’s money and compute for OpenAI’s trained models and ML brand strength (an exchange of cloud for clout, if you will) that we previously expected.
Are the companies actually also deeply collaborating on ML and systems engineering research?
I’d love to find out.
If so, this could be an early indication that Microsoft — who I’m sure is at least a little bit envious of Google’s ownership of DeepMind — will eventually want to acquire OpenAI.
And it could be a great fit.
Looking at Microsoft’s recent acquisition history, it has so far let GitHub (which it acquired two years ago) continue to operate largely autonomously.
Microsoft hasn’t acquired OpenAI (yet?), but we can obviously see its stake in the company at work here.
After last month’s launch of GPT-3-powered code completion in Microsoft Power Platform, I expected to see more of the same: mostly small features in Microsoft’s Office-related suite of products, powered by fine-tuned GPT-3 models.
This is different.
First, Copilot is powered by a new, as of yet unpublished, OpenAI model: Codex, which “has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.” This isn’t just a slightly finetuned GPT-3.
Second, Copilot is distinctly a feature built into GitHub , not into a Microsoft-branded product.
GitHub still appears to operate mostly independently (other than a few Azure integrations) but — and I hate to use the word — that’s some serious synergy between these two companies Microsoft has a stake in.
From the Copilot FAQ:
If the technical preview is successful, our plan is to build a commercial version of GitHub Copilot in the future.
We want to use the preview to learn how people use GitHub Copilot and what it takes to operate it at scale.
I’m guessing that right now, GitHub’s use of Codex is free (or at least covered by Microsoft’s OpenAI investment), and that they’re sharing a lot of data back and forth about how Copilot is being used.
When GitHub commercializes though, I wonder what this relationship will be.
Will Microsoft (exclusively?) license and run the model on their own infrastructure, or will they ping OpenAI’s Codex API?
And if it’s the latter, what will differentiate Copilot from any other IDE plugins that ping that same API?
Can anyone just undercut Copilot’s pricing by piping Codex results into an editor extension at lower margins?
As I wrote in last July’s One AI model, four competing services, though, there may be room in a productized AI market for many apps / services powered by the same model — you can always differentiate on UX or a more specialized market.
The technical preview of Copilot works best for Python, JavaScript, TypeScript, Ruby, and Go.
I’ve joined the waitlist for Copilot, and I’m excited to try it out at work.
For one, I wonder how well it’ll understand our internal Python packages, which don’t appear in any open-source codebases — I guess that may be a good test of how well we adhere to coding standards.
In line with that, I imagine a version of Codex / Copilot finetuned to a company’s entire internal codebase could be a very cool upsell for this product, especially when that company’s code is already hosted on GitHub.
Dare I say synergy again?
Artificial Intelligence and COVID-19
Although my daily new arXiv submissions notification emails have been full of papers about fighting COVID-19 with AI for the past year and a half, I’ve so far decided against writing about them in DT.
From early on in the pandemic, the preprints all seemed quite far removed from real-world applications, and I’m generally always a bit hesitant when I see AI pitched as a silver bullet solution to big societal problems.
I’m revisiting that now because Maxime Nauwynck, biomedical engineer and former PhD student at the UAntwerp Vision Lab, has written an extensive overview of how AI has contributed to dealing with the COVID-19 pandemic for The Gradient.
I still think I was mostly right to skip covering all the preprints — as Nauwynck highlights for example, a review of 300+ arXiv articles on detecting COVID-19 in CT images by Roberts et al.
(2020) found that not a single one was fit for clinical use — but there are actually now a few cool AI-powered systems related to COVID-19 deployed in the real world.
These are all from Nauwynck’s article, so check that out for the full details, but I’ll highlight a few of the ones I found most interesting:
- BlueDot and HealthMap, two companies that use natural language processing to scrape local news, warned customers about “a new type of pneumonia in Wuhan, China” on December 30th and 31st 2019, respectively — a solid week before the US Centers for Disease Control and World Health Organization did the same.
- Alizila (part of Alibaba) has a system for detecting COVID-19 in CT scans, that by March of 2020 had already helped diagnose over 30,000 people across 26 hospitals in China. Now that PCR tests and rapid tests have become much more widely available over the past year, though, I don’t know if such systems are still in use.
- To forecast/nowcast the actual (not just positive-tested) numbers of COVID-19 cases, hospitalizations, and deaths for a region, several organizations now use machine learning models and ensembles. Youyang Gu’s model was quite popular on Twitter for a while, and the US CDC has one too.
- DeepMind used AlphaFold 2 to predict the shapes of some proteins related to COVID-19.
Nauwynck also goes into some more cutting-edge research, like AI-powered (or at least AI-assisted) medicine and vaccine development, but beyond some automated electron microscopy image segmentation tools that help reduce manual labor, those approaches don’t seem to have had many real-world applications yet.
I do think, though, that we’ll now see a lot more attention (and funding) going to AI-assisted medicine than we did before the pandemic, similar to how the development of COVID-19 vaccines has accelerated mRNA-based vaccine technology.
That means the coming few years will be pretty exciting for AI-assisted life science.
To follow along with those developments, I recommend Nathan Benaich’s monthly Your Guide to AI newsletter, which has a recurring AI in Industry: life (and) science section .
The AI Incident Database
The Discover app of the AI Incident Database
The Partnership on AI to Benefit People and Society (PAI) is an international coalition of organizations with the mision “to shape best practices, research, and public dialogue about AI’s benefits for people and society.” Its 100+ member organizations cover a broad range of interests and include leading AI research labs (DeepMind, OpenAI); several universities (MIT, Cornell); most big tech companies (Google, Apple, Facebook, Amazon, Microsoft); news media (NYT, BBC); and humanitarian organizations (Unicef, ACLU).
PAI recently launched a new project: the AI Incident Database .
The AIID mimics the FAA’s airplane accidents database and is similarly meant to help “future researchers and developers avoid repeated bad outcomes.” It’s launching with a set of 93 incidents, including an autonomous car that killed a pedestrian, a trading algorithm that caused a flash crash, and a facial recognition system that caused an innocent person to be arrested (see DT #43).
For each incident, the database includes a set of news articles that reported about it: there are over 1,000 reports in the AIID so far.
It’s also open source on GitHub, at PartnershipOnAI/aiid.
Systems like this (and Amsterdam’s AI registry, for example) are a clear sign that productized AI is a quickly starting to mature as a field, and that lots of good work is being done to manage its impact.
Most importantly, I hope these projects will help us have more sensible discussions about regulating AI.
Benedict Evans’ essays Notes on AI Bias and Face recognition and AI ethics are excellent reads on this; he compares calls to “regulate AI” to wanting to regulate databases — it’s not the right level of abstraction, and we should be thinking about specific policies to address specific problems instead.
A dataset of categorized AI incidents, managed by a broad coalition of organizations, sounds like a great step in this direction.
Photoshop's Neural Filters
Light direction is one of many new AI-powered features in Photoshop; in the middle picture, the light source is on the left; in the right picture, it’s moved to the right.
Adobe’s latest Photoshop release is jam-packed with AI-powered features.
The pitch, by product manager Pam Clark:
You already rely on artificial intelligence features in Photoshop to speed your work every day like Select Subject, Object Selection Tool, Content-Aware Fill, Curvature Pen Tool, many of the font features, and more.
Our goal is to systematically replace time-intensive steps with smart, automated technology wherever possible.
With the addition of these five major new breakthroughs, you can free yourself from the mundane, non-creative tasks and focus on what matters most – your creativity.
Adobe is branding the most exciting of these new features as Neural Filters : neural-network-powered image manipulations that are parameterized by sliders in the Photoshop UI.
Some of them automate tasks that were previously very labor-intensive, while others enable changes that were previously impossible.
Here’s a few of both:
- Style transfer: apply one photo’s style to another, like the classic “make this look like a Picasso / Van Gogh / Monet.”
- Smart portraits: subtly change a photo subject’s age, expression, gaze direction, pose, hair thickness, etc.
- Colorize: infer colors for black-and-white photos based on their contents.
- JPEG Artifacts Removal: smooth out the blocky artifacts that occur on patches of JPEG-compressed photos.
These all run on-device and came out of a collaboration between Adobe Research and NVIDIA, implying they’re best suited to machines with beefy GPUs — not surprising.
However, the blog post is a little vague in about the specifics here (“performance is particularly fast on desktops and notebooks with graphics acceleration”), so I wonder whether this Neural Filters is also optimized for any other AI accelerator chips that Adobe can’t mention yet.
In particular, Apple recently showed off their new A14 chips that feature a much faster Neural Engine.
These chips launched in the latest iPhones and iPads but will also be in a new line of non-Intel “Apple Silicon” Macs, rumored to be announced next month — what are the chances that Apple will boast about the performance of Neural Filters on the Neural Engine during the presentation?
I’d say pretty big.
(Maybe worthy of a Ricky, even?)
Anyway, this Photoshop release is exactly the kind of productized AI that I started DT to cover: advanced machine learning models — that only a few years ago were just cool demos at conferences — wrapped up in intuitive UIs that fit into users’ existing workflows.
It’s now just as easy to tweak the intensity of a smile or the direction of a gaze in a portrait photo as it is to manipulate its hue or brightness.
That’s pretty amazing.
OpenAI and Microsoft: GPT-3 and beyond
OpenAI is exclusively licensing GPT-3 to Microsoft.
What does this mean for their future relationship?
GPT-3 is OpenAI’s latest gargantuan language model (see DT #42) that’s uniquely capable of performing many different “text-in, text-out” tasks — demos range from imitating famous writers to generating code (#44) — without needing to be fine-tuned: its crazy scale makes it a few-shot learner.
In July 2019, OpenAI announced it got a $1 billion investment from Microsoft.
Back then, this raised some eyebrows in the (academic) machine learning community, which can sometimes be a bit allergic to the commercialization of AI (#19).
The exact terms of the investment were never disclosed, but some key elements of the deal were.
Tom Simonite for WIRED:
Most interesting bit of the OpenAI announcement: “we intend to license some of our pre-AGI technologies, with Microsoft becoming our preferred partner.”
Now, a year and a bit later, that’s exactly what happened.
From the OpenAI blog:
In addition to offering GPT-3 and future models via the OpenAI API, and as part of a multiyear partnership announced last year, OpenAI has agreed to license GPT-3 to Microsoft for their own products and services.
What does that mean?
Nick Statt for The Verge:
A Microsoft spokesperson tells The Verge that its exclusive license gives it unique access to the underlying code of GPT-3, which contains technical advancements it hopes to integrate into its products and services.
In their blog post, Microsoft pitches this as a way to “expand [their] Azure-powered AI platform in a way that democratizes AI technology,” to which the community again reacted negatively: if you want to democratize AI, why not just open-source GPT-3’s code and training data?* I agree that “democratizing” is a bit of a stretch, but I think there’s a much more interesting discussion to be had here than the one on a self-congratulatory word choice in a corporate press release.
Perhaps ironically, that discussion also starts from overanalyzing another few words in that very same press release.
According to Microsoft’s blog post about the licensing deal, GPT-3 “is trained on Azure’s AI supercomputer.” I wonder if that means OpenAI is now using Microsoft’s open-source DeepSpeed library (#34) to train its GPT models.
DeepSpeed is a library for distributed training of enormous ML models that has specific features to support training large Transformers; Microsoft Research claimed in May that it’s capable of training models with up to 170 billion parameters (#40).
GPT-3 is a 175-billion-parameter Transformer that was released in June, just one month later.
That seems unlikely to be a coincidence, and Microsoft’s latest DeepSpeed update (#49) even includes some experimental work using the GPT-3 architecture.
So this suggests that the partnership goes beyond just the exchange of Microsoft’s money and compute for OpenAI’s trained models and ML brand strength (an exchange of cloud for clout, if you will) that we previously expected.
Are the companies actually also deeply collaborating on ML and systems engineering research?
I’d love to find out.
If so, this could be an early indication that Microsoft — who I’m sure is at least a little bit envious of Google’s ownership of DeepMind — will eventually want to acquire OpenAI.
And it could be a great fit.
Looking at Microsoft’s recent acquisition history, it has so far let GitHub (which it acquired two years ago) continue to operate largely autonomously.
This makes it an attractive potential parent company for OpenAI: the lab probably wouldn’t have to give up too much of its independence under Microsoft’s stewardship.
So unless OpenAI actually invents and monetizes some form of artificial general intelligence (AGI) in the next five to ten years — which I don’t think they will — I wouldn’t be surprised if they end up becoming Microsoft’s DeepMind.
* One big reason for not open-sourcing GPT-3’s code and data is security; see my coverage of OpenAI’s staged release strategy for GPT-2 (#8, #13, #22, #27).
Autonomous trucks will be the first big self-driving market
Autonomous trucking is where I think self-driving vehicle technology will have its first big impact , much before e.g.
the taxi or ride sharing industries.
Long-distance highway truck driving — with hubs at city borders where human drivers take over — is a much simpler problem to solve than inner-city taxi driving.
Beyond the obvious lower complexity of not having to deal with traffic lights, small streets and pedestrians, a specific highway route between two high-value hubs can also be mapped in high detail much more economically than an ever-changing city center could.
And, of course, self-driving trucks won’t have the 11-hour-per-day driving safety limit imposed on human drivers.
This all makes for quite an attractive pitch when taken together.
In recent news, Jennifer Smith at the Wall Street Journal reported that startup Ike Robotics has reservations for its first 1,000 heavy-duty autonomous trucks, from “transport operators Ryder System Inc., NFI Industries Inc.
and the U.S.
supply-chain arm of German logistics giant Deutsche Post AG.”
Tapping into big carriers’ logistics networks and operational expertise means Ike can focus on the technology piece—systems engineering, safety and technical challenges such as computer vision—said Chief Executive Alden Woodrow.
“They are going to help us make sure we build the right product, and we are going to help them prepare to adopt it and be successful,” said Mr.
Woodrow, who worked on self-driving trucks at Uber Technologies Inc.
before co-founding Ike in 2018.
Unlike rival startups, Ike wants to be a software-as-a-service provider of self-driving tech for existing logistics operators, instead of becoming one themselves.
It’ll be interesting to see how well this business model works out when competitors start offering a similar service — the biggest question is how easy or hard it’ll be for an operator to swap one self-driving SaaS out for another.
If it’s easy, that’ll make for a very competitive space.
(On the disruption side: there are nearly 3 million truck drivers in the United States alone, so widespread automation here can be quite impactful.
Until today, I thought trucking was the biggest profession in most US states because of this 2015 NPR article, but apparently that was based on wrongly interpreted statistics; the most common job is in retail — no surprise there.
Nonetheless, trucking is currently a major profession.
A decade from now it may no longer be.)
The deepfake detection ratrace
Microsoft is launching Video Authenticator , an app that helps organizations “involved in the democratic process” detect deepfakes — videos that make people look like they’re saying things they’ve never said by superimposing automatically-generated voice tracks and face movements over real videos.
Deepfakes are usually made using generative adversarial networks (GANs) like those in Samsung AI’s neural avatars project (see DT #15) and in the popular open-source DeepFaceLab app.
Because of all the obvious ways in which deepfakes can be abused, this has been a popular research area for technology platform companies: a bit over a year ago, Facebook launched their deepfake detection challenge and Google contributed to TU Munich’s FaceForensics benchmark (#23).
Microsoft has now productized these research efforts with Video Authenticator.
The app checks photos and videos for the “subtle fading or greyscale elements” that may occur at a deepfake’s blending boundary — where the fake facial movements mix in with the real background media — and gives users a confidence score for whether a face is manipulated.
This happens in real-time and frame-by-frame for videos, which I imagine will be particularly useful for detecting subtle fakery, like a mostly-real video with a few small tweaks that change its message.
Video Authenticator initially won’t be made publicly available.
Instead, Microsoft is privately distributing it to news outlets, political campaigns, and media companies through the AI Foundation’s Reality Defender 2020 program, “which will guide organizations through the limitations and ethical considerations inherent in any deepfake detection technology.” This makes sense, since deepfakes represent a typical cat-and-mouse AI security game — new models will surely be trained specifically to fool Video Authenticator, which this limited release approach attempts to slow down.
I’d be interested to learn about how organizations integrate Video Authenticator into their existing workflows for validating the veracity of newsworthy videos.
I haven’t really come across any examples of big-name news organizations getting fooled by deepfakes yet, but I imagine it’s much more common on social media where videos aren’t vetted by journalists before being shared.
Snapchat's platform for creative ML models
SnapML is a software stack for building Lenses that use machine learning models to interact with the Snapchat camera.
You can build and train a model in any ONNX-compatible framework (like TensorFlow or PyTorch) and drop it straight into Snapchat’s Lens Studio as a SnapML component.
SnapML can then apply some basic preprocessing to the camera feed, run it through the model, and format the outputs in a way that other Lens Studio components can understand.
A segmentation model outputs a video mask, an object detection outputs bounding boxes, a style transfer model outputs a new image, etc.
You even have control over how the model runs: once every frame, in the background, or triggered by a user action.
(More details in the docs.)
Matthew Moelleman has written some great in-depth coverage on SnapML for the Fritz AI Heartbeat blog, including a technical overview and a walkthrough of making a pizza segmentation Lens.
As he notes, SnapML has the potential to be super interesting as a platform:
Perhaps most importantly, because these models can be used directly in Snapchat as Lenses, they can quickly become available to millions of users around the world.
Indeed, if you create an ML-powered Snapchat filter in Lens Studio, you can easily publish and share it using a Snapcode, which users can scan to instantly use the Lens in their snaps.
I don’t think any other platform has such a streamlined (no-code!) system for distributing trained ML models directly to a user base of this size.
Early SnapML user Hart Woolery, also speaking to Heartbeat:
It’s a game-changer.
At least within the subset of people working on live-video ML models.
This now becomes the easiest way for ML developers to put their work in front of a large audience.
I would say it’s analogous to how YouTube democratized video publishing.
It also lowers the investment in publishing, which means developers can take increased risks or test more ideas at the same cost.
Similar to YouTube, the first commercial applications of SnapML have been marketing-related: there’s already a process for submitting sponsored Lenses, which can of course include an ML component.
It’s not too hard to imagine that some advertising agencies will specialize in building SnapML models that, for example, segment Coke bottles or classify different types of Nike shoes in a Lens.
I bet you can bootstrap a pretty solid company around that pitch.
Another application could be Lenses that track viral challenges: count and display how many pushups someone does, or whether they’re getting all the steps right for a TikTok dance.
Snapchat is building many of these things itself, but the open platform leaves lots of room for creative ML engineers to innovate—and even get a share of this year’s $750,000 Official Lens Creators fund.
(See some of the creations that came out of the fund here.)
The big question for me is whether and how Snapchat will expand these incentives for creating ML-powered Lenses.
The Creators fund tripled in size from 2019 to 2020; will we see it grow again next year?
Or are we going to get an in-app Snapchat store for premium Lenses with revenue sharing for creators?
In any case, I think this will be a very exciting space to follow over the next few years.
GPT-3 demos: one month in
OpenAI is expanding access to its API powered by GPT-3, the lab’s latest gargantuan language model.
As I wrote in last month’s DT #42, what makes GPT-3 special is that it can perform a wide variety of language tasks straight out of the box, making it much more accessible than its predecessor, GPT-2:
For example, if you feed it several questions and answers prefixed with “Q:” and “A:” respectively, followed by a new question and “A:”, it’ll continue the passage by answering the question—without ever having to update its weights!
Other example include parsing unstructured text data into tables, improving English-language text, and even turning natural language into Bash terminal commands (but can it do git?).
At the time, only a few companies (like Casetext, MessageBird and Quizlet) and researchers (like Janelle Shane) had access to the API.
But the rest of us could sign up for a waitlist, and over the past few weeks OpenAI has started sending out invites.
I’ve collected some of the coolest demos here, roughly grouped by topic.
I know it’s a lot of links, but many of these are definitely worth a look!
They’re all very impressive, very funny, or both.
A big group of projects generate some form of code.
Two other projects imitate famous writers.
Another set of projects restructures text into new forms.
- Another experiment by Andrew Mayne can transform a movie script into a story (and the reverse). I found this demo particularly impressive: the story also includes a lot of relevant and interesting details tha were not in the original script.
- Francis Jervis had GPT-3 turn plain language into legal language. For example, “My apartment had mold and it made me sick” became “Plaintiff’s dwelling was infested with toxic and allergenic mold spores, and Plaintiff was rendered physically incapable of pursing his or her usual and customary vocation, occupation, and/or recreation.” (More here.)
- Mckay Wrigley built a site called Learn From Anyone, where you can ask Elon Musk to teach you about rockets, or Shakespeare to teach you about writing.
Some projects are about music.
- Arram Sabeti used GPT-3 for a bunch of different things, including generating songs: he had both Lil Wayne and Taylor Swift write songs called “Harry Potter,” with great results. (The blog post also contains a fake user manual for a flux capacitor and a fake essay about startups on Mars by Paul Graham.)
- Sushant Kumar got the API to write vague but profound-sounding snippets about music. For example, “Innovation in rock and roll was often a matter of taking a pop melody and playing it loudly.” And, “You can test your product by comparing it to a shitty product it fixes. With music, you can’t always do that.” (It also generates tweets for blockchain, art, or any other word.)
And finally, some projects did more of the fun prompt-and-response text generation we saw from GPT-2 earlier:
GPT-3 generating episode titles and summaries for the Connected podcast.
I also got my own invite to try GPT-3 for This Episode Does Not Exist!, my project to generate fake episode titles and summaries for my favorite podcasts, like Connected and Hello Internet.
It used to work by fine-tuning GPT-2 on metadata of all previous episodes of the show for 600 to 1,000 epochs, a process that took about half an hour on a p100 GPU on Colab.
Now, with GPT-3 I can simply paste 30ish example episodes into the playground (more is beyond the input character limit), type “Title:”, and GPT-3 generates a few new episodes—no retraining required!
Once I get a chance to wrap this into a Python script, it’ll become so much easier for me to add new podcasts and episodes to the website.
One AI model, four competing services
Melody ML, Acapella Extractor, Vocals Remover, and Moises.ai are all services that use AI to separate music into different tracks by instrument.
Like many of these single-use AI products, they wrap machine learning models into easy-to-use UIs and APIs, and sell access to them as a service (after users exceed their free tier credits).
Here’s a few examples of their outputs:
As you can tell, these services all have pretty similar-quality results.
That’s no accident: all four are in fact built on top of Spleeter, an open-source AI model by French music service Deezer—but none of them are actually by Deezer.
So these services are basically just reselling Amazon’s or Google’s GPU credits at a markup—not bad for what I imagine to be about a weekend’s worth of tying everything together with a bit of code.
There’s a lot of low-hanging fruit in this space, too: even just within the audio domain, there are 22 different task on Papers with Code for which you can find pretrained, state-of-the-art models that are just waiting to be wrapped into a service.
(And for computer vision, there are 807 tasks.)
I actually quite like the idea of this.
You need a whole different skillset to turn a trained model into a useful product that people are willing to pay for: from building out a thoughtful UI and the relevant platform/API integrations, to finding a product/market fit and the right promotional channels for your audience.
As long as the models are open-source and licensed to allow commercial use, I think building products like this and charging money for them is completely fair game.
Since the core technology is commoditized by the very nature of the underlying models being open-source, the competition shifts to who has the best execution around those same models.
For example, the Melody ML service restricts both free and paid users to a maximum length of 5 minutes per song.
Moises.ai saw that and thought they could do better: for $4/month, they’ll process songs up to 20 minutes long.
Similarly, the person who built both Vocals Remover and Acapella Extractor figured the pitch worked better in the form of those two separate, specialized websites.
They even set up namesake YouTube channels that respectively post instrumentals-only and vocals-only versions of popular songs—some with many thousands of views—and of course link those back to the websites.
Clever!
It’s really cool to see how the open-source nature of the AI community, along with how easy it is to build websites that integrate with cloud GPUs and payments services nowadays, is enabling these projects to pop up more and more.
So who’s picking up something like this as their next weekend project?
Let me know if you do!
(Thanks for the link to Acapella Extractor, Daniël!
Update: I previously thought the Melody ML service was by Deezer, but someone at Deezer pointed out it was built by a third party.)
Is it enough for only big tech to pull out of facial recognition?
Big tech companies are putting an end to their facial recognition APIs.
Beside their obvious privacy problems, commercial face recognition APIs have long been criticized for their inconsistent recognition accuracies for people of different backgrounds.
Frankly said, these APIs are better at identifying light-skinned faces than dark-skinned ones.
Joy Buolamwini and Timnit Gebru first documented a form of this in their 2018 Gender Shades paper, and there have been many calls to block facial recognition APIs from being offered ever since; see Jay Peter’s article in The Verge for some more historical context.
It took two years and the recent reckoning of discrimination and police violence in the United States (see DT #41), for IBM to finally write a letter to the US congress announcing they’re done with the technology:
IBM no longer offers general purpose IBM facial recognition or analysis software.
IBM firmly opposes and will not condone uses of any technology, including facial recognition technology offered by other vendors, for mass surveillance, racial profiling, violations of basic human rights and freedoms, or any purpose which is not consistent with our values and Principles of Trust and Transparency.
Amazon and Microsoft followed soon after, pausing police use of their equivalent APIs.
Notably Google, where Gebru works, has never had a facial recognition API.
Now that these big-name tech companies are no longer providing facial-recognition-as-a-service, however, this does expose a new risk.
Benedict Evans, in his latest newsletter:
The catch is that this tech is now mostly a commodity (and very widely deployed in China) - Google can say “wait”, but a third-tier bucketshop outsourcer can bolt something together from parts it half-understands and sell it to a police department that says ‘it’s AI - it can’t be wrong!’.
This is a real risk, and that’s why the second half of these announcements is equally—if not more—important.
Also from IBM’s letter to congress:
We believe now is the time to begin a national dialogue on whether and how facial recognition technology should be employed by domestic law enforcement agencies.
The real solution here is not for individual big tech companies to be publicly shamed into stopping their facial recognition APIs, but for the technology to be regulated by law—so that a “third-tier bucketshop outsourcer” can’t do the same thing, but out of the public eye.
So: these are good steps, but this week’s news is far from the last chapter in the story of face recognition.
FirefliesAI: meeting audio to searchable notes
Fireflies.ai turns meetings into notes.
Fireflies.ai records and transcribes meetings, and automatically turns them into searchable, collaborative notes.
The startup’s virtual assistant, adorably named Fred, hooks into Google Calendar so that it can automatically join an organization’s Zoom, Meet or Skype calls.
As it listens in, it extracts useful notes and information which it can forward to appropriate people in the organization through integrations like Slack and Salesforce.
Zach Winn for MIT News:
“[Fred] is giving you perfect memory,” says [Sam] Udotong, who serves as Firelies’ chief technology officer.
“The dream is for everyone to have perfect recall and make all their decisions based on the right information.
So being able to search back to exact points in conversation and remember that is powerful.
People have told us it makes them look smarter in front of clients.”
As someone who externalizes almost everything I need to remember into an (arguably overly) elaborate system of notes, calendars and to-do apps, I almost feel like this pitch is aimed directly at me.
I haven’t had a chance to try it out yet, but I’m hoping to give it a shot on my next lunchclub.ai call (if my match is up for it, of course).
Fireflies is not alone, though.
It looks like this is becoming an competitive space in productized AI, with Descript (DT #18, #24), Microsoft’s Project Denmark (#23), and Otter.ai (#40) all currently working on AI-enabled smart transcription and editing of long-form audio data.
Exciting times!
Pinterest's AI-powered automatic board groups
Pinterest’s UX flow for ML-based grouping within boards. (Pinterest Engineering Blog.)
Pinterest has added new AI-powered functionality for grouping images and other pins on a board.
The social media platform is mostly centered around finding images and collecting (pinning) them on boards.
After working on a board for a while, though, some users may pin so much that they no longer see the forest for the trees.
That’s where this new feature comes in:
For example, maybe a Pinner is new to cooking but has been saving hundreds of recipe Pins.
With this new tool, Pinterest may suggest board sections like “veggie meals” and “appetizers” to help the Pinner organize their board into a more actionable meal plan.
Here’s how it works:
- When a user views a board that has a potential grouping, a suggestion pops up showing the suggested group and a few sample pins.
- If the user taps it, the suggestion expands into a view with all the suggested pins, where she can deselect any pins she does not want to add to the group. (Which I’m sure is very valuable training data!)
- The user can edit the name for the section, and then it gets added to her board.
Coming up with potential groupings is a three-step process.
First, a graph convolutional network called PinSage computes an embedding based on text associated with the pin, visual features extracted from the image, and the graph structure.
Then the Ward clustering algorithm (chosen because it does not require a predefined number of clusters) generates potential groups.
Finally, a filtered count of common annotations for pins in the group decides the proposed group name.
Pinterest has really been on a roll lately with adding AI-powered features to its apps, including visual search (DT #23) and AR try-on for shopping (DT #33).
This post by Dana Yakoobinsky and Dafang He on the company’s engineering blog has the full details on their implementation of this latest feature, as well as some future plans to expand it.
Cloudflare's ML-powered bot blocking
Cloudflare’s overview of good and bad bots.
Web infrastructure company Cloudflare is using machine learning to block “bad bots” from visiting their customers’ websites.
Across the internet, malicious bots are used for content scraping, spam posting, credit card surfing, inventory hoarding, and much more.
Bad bots account for an astounding 37% of internet traffic visible to Cloudflare (humans are responsible for 60%).
To block these bots, Cloudflare built a scoring system based on five detection mechanisms: machine learning, a heuristics engine, behavior analysis, verified bots lists, and JavaScript fingerprinting.
Based on these mechanisms, the system assigns a score of 0 (probably a bot) to 100 (probably a human) to each request passing through Cloudflare—about 11 million requests per second, that is.
These scores are exposed as fields for Firewall Rules, where site admins can use them in conjunction with other properties to decide whether the request should pass through to their web servers or be blocked.
Machine learning is responsible for 83% of detection mechanisms.
Because support for categorical features and inference speed were key requirements, Cloudflare went with gradient-boosted decision trees as their model of choice (implemented using CatBoost).
They run at about 50 microseconds per inference, which is fast enough to enable some cool extras.
For example, multiple models can run in shadow mode (logging their results but not influencing blocking decisions), so that Cloudflare engineers can evaluate their performance on real-world data before deploying them into the Bot Management System.
Alex Bocharov wrote about the development of this system for the Cloudflare blog.
It’s a great read on adding an AI-powered feature to a larger product offering, with good coverage of all the tradeoffs involved in that process.
Bias reductions in Google Translate
Gender-specific translations from Persian, Finnish, and Hungarian in the new Google Translate.
Google is continuing to reduce gender bias in its Translate service.
Previously, it might translate “o bir doktor” in Turkish, a language that does not use gendered pronouns, to “he is a doctor”—assuming doctors are always men—and “o bir hemşire” to “she is a nurse”—assuming that nurses are always women.
This is a very common example of ML bias, to the point that it’s covered in introductory machine translation courses like the one I took in Edinburgh last year.
That doesn’t mean it’s easy to solve, though.
Back in December 2018, Google took a first step toward reducing these biases by providing gender-specific translations in Translate for Turkish-to-English phrase translations, like the example above, and for single word translations from English to French, Italian, Portuguese, and Spanish (DT #3).
But as they worked to expand this into more languages, they ran into scalability issues: only 40% of eligible queries were actually showing gender-specific translations.
Google’s original and new approaches to gender-specific translations.
They’ve now overhauled the system: instead of attempting to detect whether a query is gender-neutral and then generating two gender-specific translations, it now generates a default translation and, if this translation is indeed gendered, also rewrites it to an opposite-gendered alternative.
This rewriter uses a custom dataset to “reliably produce the requested masculine or feminine rewrites 99% of the time.” As before, the UI shows both alternatives to the user.
Another interesting aspect of this update is how they evaluate the overall system:
We also devised a new method of evaluation, named bias reduction, which measures the relative reduction of bias between the new translation system and the existing system.
Here “bias” is defined as making a gender choice in the translation that is unspecified in the source.
For example, if the current system is biased 90% of the time and the new system is biased 45% of the time, this results in a 50% relative bias reduction.
Using this metric, the new approach results in a bias reduction of ≥90% for translations from Hungarian, Finnish and Persian-to-English.
The bias reduction of the existing Turkish-to-English system improved from 60% to 95% with the new approach.
Our system triggers gender-specific translations with an average precision of 97% (i.e., when we decide to show gender-specific translations we’re right 97% of the time).
The standard academic metrics (recall and average precision) did not answer the most important question about the two different approaches, so the developers came up with a new metric specifically to evaluate relative bias reduction.
Beyond machine translation, this is a nice takeaway for productized AI in general: building the infrastructure and metrics to measure how your ML system behaves in its production environment is at least as important as designing the model itself.
In the December 2018 post announcing gender-specific translations, the authors mention that one next step is also addressing non-binary gender in translations; this update does not mention that, but I hope it’s still on the roadmap.
Either way, it’s commendable that Google has continued pushing on this even after the story has been out of the media for a while now.
Rosebud AI's GAN photo models
None of these models exist. (Rosebud AI)
Rosebud AI uses generative adversarial networks (GANs) to synthesize photos of fake people for ads.
We’ve of course seen a lot of GAN face generation in the past (see DT #6, #8, #23), but this is one of the first startups I’ve come across that’s building a product around it.
Their pitch to advertisers is simple: take photos from your previous photoshoots, and we’ll automatically swap out the model’s face with one better suited to the demographic you’re targeting.
The new face can either be GAN-generated or licensed from real models on the generative.photos platform.
But either way, Rosebud AI’s software takes care of inserting the face in a natural-looking way.
This raises some obvious questions: is it OK to advertise using nonexistent people?
Do you need models’ explicit consent to reuse their body with a new face?
How does copyright work when your model is half real, half generated?
I’m sure Rosebud AI’s founders spend a lot of time thinking about these questions; and as they do, you can follow their along with their thoughts on Twitter and Instagram.
Software 2.0 at Plumerai
The stacked layers of Plumerai’s Larq ecosystem: Larq Compute Engine, Larq, and Larq Zoo.
A few days ago, we published a new blog post about software 2.0 at Plumerai , which touches on some points that are interesting for productized artificial intelligence at large.
As a reminder, Plumerai—and my day-to-day research there—is centered around Binarized Neural Networks (BNNs): deep learning models in which weights and activations are not floating-point numbers but can only be -1 or +1.
Larq is our ecosystem of open-source packages for BNN development: larq/zoo has pretrained, state-of-the-art models; larq/larq integrates with TensorFlow Keras to provide BNN layers and training tools; and larq/compute-engine is an optimized converter and inference engine for deploying models to mobile and edge devices.
Andrej Karpathy, who was previously at OpenAI and is now developing Tesla’s self-driving software, first wrote about his vision for Software 2.0 back in 2017.
I recommend reading the whole essay, but it boils down to the idea that large chunks of currently human-written software will be replaced by learned neural networks—something we already see happening in areas like computer vision, machine translation, and speech recognition/synthesis.
Let’s look at two benefits Karpathy notes about this shift that are relevant to our work on Larq.
First, neural networks are agile.
Depending on computational requirements—running on a high-power chip vs.
an energy-efficient one—deep learning models can be scaled by trading off size for accuracy.
Take one of the BNN families in our Zoo package, for example: the XL version of QuickNet achieves 67.0% top-1 ImageNet classification accuracy at a size 0f 6.2mb, but it can also be scaled down to get 58.6% at just 3.2mb.
Depending on the power and accuracy requirements of your application, you can swap in one for the other without having to otherwise change your code.
Second, deep learning models constantly get better.
If we think of a cool new training trick or other optimization for BNNs, we can push it out in an updated version of the QuickNet family.
In fact, that’s exactly what we did last week:
A great example of the power of this integrated approach is the recent addition of one-padding across the Larq stack.
Padding with ones instead of zeros simplifies binary convolutions, reducing inference time without degrading accuracy.
We not only enabled this in [Larq Compute Engine], but also implemented one-padding in Larq and retrained our QuickNet models to incorporate this feature.
All you need to do to get these improvements is update your pip packages.
I’m super excited about the idea of Software 2.0, and—as you can probably tell from the paragraphs above—I’m pumped to be working on it every day at Plumerai, both on the research side (making better models) and the software engineering side (improving Larq).
You can read more about our Software 2.0 aspirations in this blog post: The Larq Ecosystem: State-of-the-art binarized neural networks and even faster inference.
Unscreen by remove.bg
Landing page for unscreen
Unscreen is a new zero-click tool for automatically removing the background from videos.
It’s the next project from Kaleido, the company behind remove.bg, which I’ve covered extensively on Dynamically Typed: from their initial free launch (DT #3) and Golden Kitty award (DT #5), to the launch of their paid photoshop plugin (DT #12) and cat support (yes, really: DT #16).
Unscreen is another great example of a highly-targeted, easy-to-use AI product, and I’m excited to see it evolve—probably following a similar path to remove.bg, since they’ve already pre-announced their HD, watermark-free pro plan on the launch site.
BERT in Google Search
Improved natural language understanding in Google Search. (Google)
Google Search now uses the BERT language model to better understand natural language search queries:
This breakthrough was the result of Google research on transformers: models that process words in relation to all the other words in a sentence, rather than one-by-one in order.
BERT models can therefore consider the full context of a word by looking at the words that come before and after it—particularly useful for understanding the intent behind search queries.
The query in the screenshots above is a good example of what BERT brings to the table: its understanding of the word “to” between “brazil traveler” and “usa” means that it no longer confuses whether the person is from Brazil and going to the USA or the other way around.
Google is even using concepts that BERT learns from English-language web content for other languages, which led to “significant improvements in languages like Korean, Hindi and Portuguese.” Read more in Pandu Nayak’s post for Google’s The Keyword blog: Understanding searches better than ever before.
Descript's Podcast Studio
The all-new Descript podcast studio. (Descript)
Descript launched their podcast studio app.
As I wrote in DT #18, Descript is a great example of a productized AI company:
Descript takes an audio file (like a podcast or conference talk recording) as input and transcribes it using machine learning.
Then, it lets you edit the transcript and audio in synchrony, automatically moving audio clips around as you cut, paste, and shuffle around bits of text.
The team has now launched a multitrack podcast production app using this same technology.
As they put it, it’s “the version of Descript we’ve dreamed of since conceiving of the company.” The podcast studio allows you to edit multiple speakers’ audio tracks by editing the transcribed text of what they said; Descript takes care of splicing and syncing all the audio.
It also comes with some crazy new (beta) functionality called Overdub.
The feature lets you replace a few words of a transcript and then uses your newly inserted text to generate an audio version of what you typed in your own voice.
Sounds amazing!
But also dangerous—what if someone has a recording of your voice?
Can they just make a convincing audio clip of you saying whatever they want?
Nope.
Lyrebird, the team behind the feature, has built in safeguards to prevent that from happening:
Invariably, to first experience Overdub is to experience wunderschrecken—a simultaneous feeling of wonder and dread.
Rest assured, you can only use Overdub on your own voice.
We built this feature to save you the tedium of re-recording/splicing time every time you make an editorial change, not as a way make deep fakes.
The Lyrebird team deserves credit for figuring this out — in order to train a voice model, you need to record yourself speaking randomly generated sentences, preventing others from using pre-existing recordings to create a model of your voice.
Read all about the new podcast studio and Overdub in Descript CEO Andrew Mason’s Medium post: Introducing Descript Podcast Studio & Overdub.