Dynamically Typed

DALL·E and CLIP: OpenAI's Multimodal Neural Networks

Two example prompts and resulting generated images from DALL·E

Two example prompts and resulting generated images from DALL·E

OpenAI’s new “multimodal” DALL·E and CLIP models combine text and images, and also mark the first time that the lab has presented two separate big pieces of work in conjunction. In a short blog post, which I’ll quote almost in full throughout this story because it also neatly introduces both networks, OpenAI’s chief scientist Ilya Sutskever explains why:

A long-term objective of artificial intelligence is to build “multimodal” neural networks—AI systems that learn about concepts in several modalities, primarily the textual and visual domains, in order to better understand the world. In our latest research announcements, we present two neural networks that bring us closer to this goal.

These two neural networks are DALL·E and CLIP. We’ll take a look at them one by one, starting with DALL·E.

The name DALL·E is a nod to Salvador Dalí, the surrealist artist known for that painting of melting clocks, and to WALL·E, the Pixar science-fiction romance about a waste-cleaning robot. It’s a bit silly to name an energy-hungry image generation AI after a movie in which lazy humans have fled a polluted Earth to float around in space and do nothing but consume content and food, but given how well the portmanteau works and how cute the WALL·E robots are, I probably would’ve done the same. Anyway, beyond what’s in a name, here’s Sutskever’s introduction of what DALL·E actually does:

The first neural network, DALL·E, can successfully turn text into an appropriate image for a wide range of concepts expressible in natural language. DALL·E uses the same approach used for GPT-3, in this case applied to text–image pairs represented as sequences of “tokens” from a certain alphabet.

DALL·E builds on two previous OpenAI models, combining GPT-3’s capability to perform different language tasks without finetuning with Image GPT’s capability to generate coherent image completions and samples. As input it takes a single stream — first text tokens for the prompt sentence, then image tokens for the image — of up to 1280 tokens, and learns to predict the next token given the previous ones. Text tokens take the form of byte-pair encodings of letters, and image tokens are patches from a 32 x 32 grid in the form of latent codes found using a variational autoencoder similar to VGVAE. This relatively simple architecture, combined with a large, carefully designed dataset, gives DALL·E the following laundry list of capabilities, each of which have interactive examples in OpenAI’s blog post:

  • Controlling attributes
  • Drawing multiple objects
  • Visualizing perspective and three-dimensionality
  • Visualizing internal and external structure (like asking for a macro or x-ray view!)
  • Inferring contextual details
  • Combining unrelated concepts
  • Zero-shot visual reasoning
  • Geographic and temporal knowledge

A lot of people from the community have written about DALL·E or played around with its interactive examples. Some of my favorites include:

I think DALL·E is the more interesting of the two models, but let’s also take a quick look at CLIP.

CLIP’s performance on different image classification benchmarks.

CLIP’s performance on different image classification benchmarks.

Sutskever:

CLIP has the ability to reliably perform a staggering set of visual recognition tasks. Given a set of categories expressed in language, CLIP can instantly classify an image as belonging to one of these categories in a “zero-shot” way, without the need to fine-tune on data specific to these categories, as is required with standard neural networks. Measured against the industry benchmark ImageNet, CLIP outscores the well-known ResNet-50 system and far surpasses ResNet in recognizing unusual images.

Instead of training on a specific benchmark like ImageNet or ObjectNet, CLIP pretrains on a large dataset of text and images scraped from the internet (so without specific human labels for each images). It performs a proxy training task: “given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.” To then do actual classification on a benchmark dataset, the labels are transformed to be more descriptive (e.g. a “cat” label becomes “a photo of a cat”), and CLIP calculates for each label how likely it is to be paired with the image. It predicts the most likely one to be the label. As you can see from the image above, this approach is highly effective across datasets. It’s also very efficient because, being a zero-shot model, CLIP doesn’t need to be (re)trained or finetuned for different datasets.

My favorite application so far of CLIP is by Travis Hoppe, who used it to visualize poems using Unsplash photos — worth a click! Another interesting one is actually how it’s used in combination with DALL·E: after DALL·E generates 512 plausible images for a prompt, CLIP ranks their quality, and only the 32 best ones are returned in the interactive viewer. Instead of researchers cherry-picking the best results to show in a paper, a different neural net can actually perform this task!

AlphaFold 2: DeepMind's structural biology breakthrough

AlphaFold’s predictions vs. the experimentally-determined shapes of two CASP14 proteins.

AlphaFold’s predictions vs. the experimentally-determined shapes of two CASP14 proteins.

DeepMind’s AlphaFold 2 is a major protein folding breakthrough. Protein folding is a problem in structural biology where, given the one-dimensional RNA sequence of a protein, a computational model has to predict what three-dimensional structure the protein “folds” itself into. This structure is much more difficult to determine experimentally than the RNA sequence, but it’s essential for understanding how the protein interacts with other machinery inside cells. In turn, this can give insights into the inner workings of diseases — “including cancer, dementia and even infectious diseases such as COVID-19” — and how to fight them.

Biennially since 1994, the Critical Assessment of Techniques for Protein Structure Prediction (CASP) has determined the state of the art in computational models for protein folding using a blind test. Research groups are presented (only) with the RNA sequences of about 100 proteins whose shapes have recently been experimentally determined. They blindly predict these shapes using their computational models and submit them to CASP to be evaluated with a Global Distance Test (GDT) score, which roughly corresponds to how far each bit of the protein is from where it’s supposed to be. GDT scores range from 0 to 100, and a model that scores at least 90 across different proteins would be considered good enough to be useful to science (“competitive with results obtained from experimental methods”).

Before CASP13 in 2018, no model had ever scored significantly above 40 GDT. That year, the first version of AlphaFold came in at nearly 60 GDT — already “stunning” at the time (see DT #21). At CASP14 this year, AlphaFold 2 blew its previous results out of the water and achieved a median score of 92.4 GDT across all targets. This was high enough for CASP to declare the problem as “solved” in their press release and to start talking about new challenges for determining the shape of multi-protein complexes.

I’ve waited a bit to write about AlphaFold 2 until the hype died down because, oh boy, was there a lot of hype. DeepMind released a slick video about the team’s process, their results were covered with glowing features in Nature and The New York Times, and high praise came even from the leaders of DeepMind’s biggest competitors, including OpenAI’s Ilya Sutskever and Stanford HAI’s Fei-Fei Li. It was a pretty exciting few days on ML twitter.

Columbia University’s Mohammed AlQuraishi, who has been working on protein folding for over a decade, was one of the first people to break the CASP14 news. His blog post about CASP13 and AlphaFold 1 was also widely circulated back in 2018, so a lot of people in the field were interested in what he’d have to say this year. Last week, after the hype died down a bit, AlQuraishi published his perspective on AlphaFold 2. He summarized it by saying “it feels like one’s child has left home:” AF2 got results he did not expect to see until the end of this decade, even when takin into account AF1 — bittersweet for someone whose lab has also been working on this same problem for a long time.

AlQuraishi is overall extremely positive about DeepMind’s results here, but he does express disappointment at their “falling short of the standards of academic communication” — the lab has so far been much more secretive about AF2 than it was about AF1 (which is open-source). AlQuraishi’s post is very long and technical, but if you want to know exactly how impressive AlphaFold 2 is, learn the basics of how it works, read about its potential applications in broader biology, or see some of the hot takes against it debunked, the post is definitely worth the ~75 minutes of your time. (I always find it energizing to see someone excitedly explain a big advancement in their field that they did not directly work on; here’s the link again.)

I personally also can’t wait to see the first practical applications of AlphaFold, which I expect we’ll start to see DeepMind talk about in the coming years. (Hopefully!) For one, they’ve already released AlphaFold’s predictions for some proteins associated with COVID-19.

Google AI's ethics crisis

Google AI is in the middle of an ethics crisis. Timnit Gebru, the AI ethics researcher behind Gender Shades (see DT #42), Datasheets for Datasets (#41), and much more, got pushed out of the company after a series of conflicts. Karen Hao for MIT Technology Review:

A series of tweets, leaked emails, and media articles showed that Gebru’s exit was the culmination of a conflict over [a critical] paper she co-authored. Jeff Dean, the head of Google AI, told colleagues in an internal email (which he has since put online) that the paper “didn’t meet our bar for publication” and that Gebru had said she would resign unless Google met a number of conditions, which it was unwilling to meet. Gebru tweeted that she had asked to negotiate “a last date” for her employment after she got back from vacation. She was cut off from her corporate email account before her return.

See Casey Newton’s coverage on his Platformer newsletter for both Gebru’s and Jeff Dean’s emails (and here for his extended statement). This story unfolded over the past week and is probably far from over, but from everything I’ve read so far — which is a __lot, hence this email hitting your inbox a bit later than usual — I think think Google management made the wrong call here. Their statement on the matter focuses on missing references in Gebru’s paper, but as Google Brain Montreal researcher Nicolas Le Roux points out:

… [The] easiest way to discriminate is to make stringent rules, then to decide when and for whom to enforce them. My submissions were always checked for disclosure of sensitive material, never for the quality of the literature review.

This is echoed by a top comment on HackerNews. From Gebru’s email, it sounds like frustrations had been building up for some time, and that the lack of transparency surrounding the internal rejection of this paper was simply the final straw. I think it would’ve been more productive for management to start a dialog with Gebru here — forcing a retraction, “accepting her resignation” immediately and then cutting off her email only served to escalate the situation.

Gebru’s research on the biases of large (compute-intensive) vision and language models is much harder to do without the resources of a large company like Google. This is a problem that academic ethics researchers often run into; OpenAI’s Jack Clark, who gave feedback on Gebru’s paper, has also pointed this out. I always found it admirable that Google AI, as a research organization, intellectually had the space for voices like Gebru’s to critically investigate these things. It’s a shame that it was not able to sustain an environment in which this could be fostered.

In the end, beside the ethical issues, I think Google’s handling of this situation was also a big strategic misstep. 1500 Googlers and 2100 others have signed an open letter supporting Gebru. Researchers from UC Berkeley and the University of Washington said this will have “a chilling effect” on the field. Apple and Twitter are publicly poaching Google’s AI ethics researchers. Even mainstream outlets like The Washington Post and The New York Times have picked up the story. In the week leading up to NeurIPS and the Black in AI workshop there, is this a better outcome for Google AI than letting an internal researcher submit a conference paper critical of large language models?

Methods is Papers with Code's machine learning knowledge graph

Papers with Code’s Methods page for the residual block (cropped).

Papers with Code’s Methods page for the residual block (cropped).

A few weeks ago, Papers with Code launched Methods , a knowledge graph of hundreds of machine learning concepts:

We are now tracking 730+ building blocks of machine learning: optimizers, activations, attention layers, convolutions and much more! Compare usage over time and explore papers from a new perspective.

I’ve started using Methods as my go-to reference for many things at work. Sitting at a more abstracted level than the documentation for your ML library of choice, it’s an incredibly useful resource for anyone doing ML research or engineering. Each Methods page contains the following sections:

  • A concise description of what the method is and how it works, including math and a diagram where relevant
  • A chronological list of papers that use the method
  • A breakdown of tasks from the site’s State-of-the-Art leaderboards for which the method is used
  • A graph of how the method’s use changed over time, compared to other methods of the same category (for example, Adam vs. SGD for optimizers)
  • A list of components: other methods that contribute to this method (for example, 1x1 convolutions and ReLUs are components of residual blocks)
  • A list of categories for the method

I’ve found the last of those sections to be especially handy for answering those hard-to-Google “what’s the name of that other thing that’s kind of like this thing again?” questions. (Also see this Twitter thread by the project’s co-creator Ross Taylor for a few example uses of the other sections.) Methods launched just a month ago, and given how useful it already is, I’m very excited to see how it grows in the future.

One additional feature I’d find useful is the inverse of the components section: I also want to know which methods build on top of the method I’m currently viewing. Another thing I’d like to see is an expansion of code links for methods to also include TensorFlow snippets—but since Facebook AI Research bought Papers with Code late last year, I’m guessing that keeping these snippets exclusive to (FAIR-controlled) PyTorch may be a strategic decision rather than a technical one.

OpenAI's GPT-3: a language model that doesn't need finetuning

OpenAI announced GPT-3, the next generation of its language model. As we’re used to by now, it’s another order of magnitude bigger than previous models, at 175 billion parameters—compared to 1.5 billion for GPT-2 and 17 billion for Microsoft’s Turing NLG (DT #33). It’s not the model’s size that’s interesting, though, but what this enables. From the abstract of the 74-page paper by Brown et al. (2020) detailing GPT-3:

Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. … For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.

This is super cool! Where GPT-2 could only complete a passage from a given input in a natural-sounding way, GPT-3 can now do several tasks just from being shown examples. Instead of fine-tuning the model for specific tasks like translation, question-answering, or generating podcast episode titles that do not exist (👀), the model can do everything out of the box. For example, if you feed it several questions and answers prefixed with “Q:” and “A:” respectively, followed by a new question and “A:”, it’ll continue the passage by answering the question—without ever having to update its weights! Other example include parsing unstructured text data into tables, improving English-language text, and even turning natural language into Bash terminal commands (but can it do git?).

OpenAI rolled out its previous model in stages, starting with a 117-million parameter version (“117M”) in February 2019 (DT #8), followed by 345M in May of that year (DT #13), 774M in September with a six-month follow up blog post (DT #22), and finally the full 1.5-billion parameter version in November (DT #27). The lab is doing the same for GPT-3, which is also the first model that it’s making commercially available in the form of an API. Just a few vetted organizations have had access to the API so far. Ashlee Vance for Bloomberg:

To date, Casetext has been using the technology to improve its legal research search service, MessageBird has tapped it for customer service, and education software maker Quizlet has used it to make study materials.

Janelle Shane als has access to GPT-3, and she has used the API to make some “spookily good Twitter bots” on her AI Weirdness blog.

I’m glad OpenAI staging the release of their API this way again, since valid criticism has already started popping up: Anima Anandkumar pointed out on Twitter that the GPT-2 has “produced shockingly racist and sexist paragraphs without any cherry picking.” (Also see this follow-up discussion with OpenAI policy director Jack Clark.) These type of bias problems have to be worked out before the model can responsibly be released beyond a few trusted partners, which OpenAI CEO Sam Altman also acknowledged this in Vance’s piece:

As time goes on, more organizations will gain access, and then the API will be public. “I don’t know exactly how long that will take,” Altman said. “We would rather be on the too-slow than the too-fast side. We will mistakes here, and we will learn.”

As the OpenAI API gets released more broadly and integrated into more products, I’ll keep following its progress.

Datasheets for datasets and Model Cards for model reporting

Google’s model card for their face detection model. (Google)

Google’s model card for their face detection model. (Google)

Datasheets for Datasets and Model Cards for Model Reporting . These two papers aim to improve transparency and accountability in machine learning models and the datasets that were used to create them.

From the abstract of the first paper by Gebru et al. (2018):

The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet that describes its operating characteristics, test results, recommended uses, and other information. By analogy, we propose that every dataset be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on.

The paper goes on to provide a set of questions and a workflow to properly think through and document each of these aspects of a dataset in a dataseheet. It also has example datasheets for two standard datasets: Labeled Faces in the Wild and the Movie Review Data.

From the abstract of the second paper by Mitchell et al. (2019):

Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.

This is essentially the same principle, but now applied to a trained model instead of a dataset. The paper also includes details on how to fill in each part of a model card, as well as two examples: a smile detection model and a text toxicity classifier. I’ve also seen some model cards in the wild recently: Google has them for their face detection and object detection APIs and OpenAI has one for their GPT-2 language model (but not yet for GPT-3, as far as I can tell).

I’m excited to try creating a dataset datasheet and a model card at work—which also makes me think: practicing making these should really have been part of my AI degree. I’ve also added both papers to my machine learning resources list.

Distill: Exploring Bayesian Optimization

Bayesian optimization of finding gold along a line, using the probability of improvement (PI) acquisition function. (Agnihotri and Batra, 2020.)

Bayesian optimization of finding gold along a line, using the probability of improvement (PI) acquisition function. (Agnihotri and Batra, 2020.)

Apoorv Agnihotri and Nipun Batra wrote an article Exploring Bayesian Optimization for Distill. This technique is used in hyperparameter optimization, where evaluating any one point—like the combination of a learning rate, a weight decay factor, and a data augmentation setting—is expensive: you need to train your entire model to know how well the hyperparameters performed.

This is where Bayesian optimization comes in. It centers around answering the question “Based on what we know so far, what point should we evaluate next?” The process uses acquisition functions to trade off exploitation (looking at points in the hyperparameter space that we think are likely to be good) with exploration (looking at points we’re very uncertain about). Given an appropriate acquisition function and priors, it can help find a good point in the space in surprisingly few iterations.

Bayesian optimization was one of the tougher subjects to wrap my head around in graduate school, so I was very excited to see it get the Distill treatment. Agnihotri and Batra explain the process through an analogy of picking the best places to dig for gold which, incidentally, was also one of its first real-world applications in the 1950s! You can read the full explainer here; also check out DragonFly and BoTorch, two tools for automated Bayesian optimization from my ML resources list.

Report: Toward Trustworthy AI Development

That’s a lot of authors and institutions.

That’s a lot of authors and institutions.

A large coalition of big-name ML researchers and institutions published Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. The 80-page report recognizes “that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development” and presents a set of recommendations for providing evidence of the “safety, security, fairness, and privacy protection of AI systems.” Specifically, they outline two types of mechanisms:

  1. Mechanisms for AI developers to substantiate claims about their AI systems—going beyond just saying a system is “privacy-preserving” in the abstract, for example.
  2. Mechanisms that users, policy makers, and regulators can use to increase the specificity and diversity of demands they make to AI developers—again, going beyond abstract, unenforceable requirements.

The two-page executive summary and single-page list of recommendations (categorized across institutions, software, and hardware) are certainly worth a read for anyone who is to some extent involved in AI development, from researchers to regulators:

Recommendations from the report by Brundage et al. (2020).

Recommendations from the report by Brundage et al. (2020).

On the software side, I found the audit trail recommendation especially interesting. The authors state the problem as such:

AI systems lack traceable logs of steps taken in problem-definition, design, development, and operation, leading to a lack of accountability for subsequent claims about those systems’ properties and impacts.

Solving this will have go far beyond just saving a git commit history that traces the development of a model. For the data collection, testing, deployment, and operational aspects, there are no reporting or verification standards in widespread use yet.

I don’t think these standards can be sensibly defined for “AI” at large, so they’ll have to be implemented on an industry-by-industry basis. There are a whole different set of things to think about for self-driving cars than for social media auto-moderation, for example.

(Also see OpenAI’s short write-up of the report.)

Distill: Early Vision in CNNs

The largest neuron groups in the mixed3a layer of InceptionV1. (Olah et al., 2020)

The largest neuron groups in the mixed3a layer of InceptionV1. (Olah et al., 2020)

Chris Olah and his OpenAI collaborators published a new Distill article: An Overview of Early Vision in InceptionV1 . This work is part of Distill’s Circuits thread, which aims to understand how convolutional neural networks work by investigating individual features and how they interact through the formation of logical circuits (see DT #35). In this new article, Olah et al. explore the first five layers of Google’s InceptionV1 network:

Over the course of these layers, we see the network go from raw pixels up to sophisticated boundary detection, basic shape detection (eg. curves, circles, spirals, triangles), eye detectors, and even crude detectors for very small heads. Along the way, we see a variety of interesting intermediate features, including Complex Gabor detectors (similar to some classic “complex cells” of neuroscience), black and white vs color detectors, and small circle formation from curves.

Each of these five layers contains dozens to hundreds of features (a.k.a. channels or filters) that the authors categorize into human-understandable groups, which consist of features that detect similar things for inputs with slightly different orientations, frequencies, or colors. This goes from conv2d0, the first layer where 85% of filters fall into two simple categories (detectors for lines and for contrasting colors, in various orientations), all the way up to mixed3b, the fifth layer where there are over a dozen complex categories (detectors for small heads, for circles/loops, and much more). We’ve known that there are line detectors in early network layers for a long time, but this detailed taxonomy of later-layer features is novel—and it must’ve been an enormous amount of work to create.

A cicuits-based visualization of the black & white detector neuron group in layer mixed3a of InceptionV1. (Olah et al., 2020)

A cicuits-based visualization of the black & white detector neuron group in layer mixed3a of InceptionV1. (Olah et al., 2020)

For a few of the categories, like black & white and small circle detectors in mixed3a, and boundary and fur detectors in mixed3b, the article also investigates the “circuits” that formed them. Such circuits show how strongly the presence of a feature in the input positively or negatively influences (“excites” or “inhibits”) different regions of the current feature. One of the most interesting aspects of this research is that some of these circuits—which were learned by the network, not explicitly programmed!—are super intuitive once you think about them for a bit. The black & white detector above, for example, consists mostly of negative weights that inhibit colorful input features: the more color features in the input, the less likely it is to be black & white.

The simplicity of many of these circuits suggests, to me at least, that Olah et al. are currently exploring one of the most promising paths in AI explainability research. (Although there is an alternate possibility, as pointed out by the authors: that they’ve found a “taxonomy that might be helpful to humans but [that] is ultimately somewhat arbitrary.”)

Anyway, An Overview of Early Vision in InceptionV1 is one of the most fascinating machine learning papers I’ve read in a long time, and I spent a solid hour zooming in on different parts of the taxonomy. The groups for layer mixed3a are probably my favorite. I’m also curious about how much these early-layer neuron groups generalize to other vision architectures and types of networks—to what extent, for example, do these same neuron categories show up in the first layers of binarized neural networks?

If you read the article and have more thoughts about it that I didn’t cover here, I’d love to hear them. :)

Google releases TensorFlow Quantum

“A high-level abstract overview of the computational steps involved in the end-to-end pipeline for inference and training of a hybrid quantum-classical discriminative model for quantum data in TFQ. “

“A high-level abstract overview of the computational steps involved in the end-to-end pipeline for inference and training of a hybrid quantum-classical discriminative model for quantum data in TFQ. “

Google has released TensorFlow Quantum (TFQ), its open-source library for training quantum machine learning models. The package integrates TensorFlow with Cirq, Google’s library for working with Noisy Intermediate Scale Quantum (NISQ) computers (scale of ~50 - 100 qubits). Users can define a quantum dataset and model in Cirq and then use TFQ to evaluate it and extract a tensor representation of the resulting quantum states. For now Cirq computes these representations (samples or averages of the quantum state) using millions of simulation runs, but in the future it will be able to get them from real NISQ processors. The representations feed into a classical TensorFlow model and can be used to compute its loss. Finally, a gradient descent step updates the parameters of both the quantum and classical models.

A key feature of TensorFlow Quantum is the ability to simultaneously train and execute many quantum circuits. This is achieved by TensorFlow’s ability to parallelize computation across a cluster of computers, and the ability to simulate relatively large quantum circuits on multi-core computers.

TensorFlow Quantum is a collaboration with the University of Waterloo, (Google/Alphabet) X, and Volkswagen, which aims to use it for materials (battery) research. Other applications of quantum ML models include medicine, sensing, and communications.

These are definitely still very much the early days of the quantum ML field (and of quantum computing in general), but nonetheless it’s exciting to see this amount of software tooling and infrastructure being built up around it. For lots more details and links to sample code and notebooks, check out the Google AI blog post by Alan Ho and Masoud Mohseni here: Announcing TensorFlow Quantum: An Open Source Library for Quantum Machine Learning.

Distill: Zoom in on Circuits

“By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks.

“By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks."

Chris Olah et al. wrote a fascinating new Distill article about “circuits” in convolutional neural networks. The authors aim to reposition the field of AI interpretability as a natural science, like biology and chemistry:

There are two common proposals for dealing with this [lack of shared evaluation measures in the field of interpretability], drawing on the standards of adjacent fields. Some researchers, especially those with a deep learning background, want an “interpretability benchmark” which can evaluate how effective an interpretability method is. Other researchers with an HCI background may wish to evaluate interpretability methods through user studies.

But interpretability could also borrow from a third paradigm: natural science. In this view, neural networks are an object of empirical investigation, perhaps similar to an organism in biology. Such work would try to make empirical claims about a given network, which could be held to the standard of falsifiability.

Olah et al. do exactly this by investigating the Inception v1 network architecture in detail and presenting three speculative claims about how convolutional neural networks work:

  1. Features are the fundamental unit of neural networks. They correspond to directions. These features can be rigorously studied and understood.
  2. Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.
  3. Analogous features and circuits form across models and tasks.

For the former two claims, they present substantive evidence: examples of curve detectors, high-low frequency detectors, and pose-invariant dog head detectors for their claim about features; and examples of again curve detectors, oriented dog head detection, and car + dog superposition neurons for the circuits claim.

As always, the article is accompanied by very informative illustrations, and even some interesting tie-backs to the historical invention of microscopes and discovery of cells. I found it a fascinating read, and it made me think about how these findings would look in the context of binarized neural networks. You can read the article by Olah et al. (2020) on Distill: Zoom In: An Introduction to Circuits.

Chollet's Abstraction and Reasoning Corpus

From top to bottom: Chollet’s hierarchy of intelligence, and two sample tasks from ARC. (François Chollet))

From top to bottom: Chollet’s hierarchy of intelligence, and two sample tasks from ARC. (François Chollet))

Keras creator François Chollet has published his 64-page manifesto on the path “toward more intelligent and human-like” AI in a paper titled The Measure of Intelligence that “formalizes things [he’s] been talking about for the past 10 years.” This is one of the most inspiring papers I’ve read in a long time, and it has many people around the office very excited too. Broadly, Chollet covers three topics: (1) the context and history of evaluating the intelligence of humans and machines; (2) a new perspective of what a framework for evaluating intelligence should be; and (3) the Abstraction and Reasoning Corpus (ARC), his implementation of this framework.

(1) Context and history. In cognitive science, there are are two opposing views of how the human mind works:

One view in which the mind is a relatively static assembly of special-purpose mechanisms developed by evolution, only capable of learning what is it programmed to acquire, and another view in which the mind is a general-purpose “blank slate” capable of turning arbitrary experience into knowledge and skills, and that could be directed at any problem.

Chollet explains that early (symbolic) AI research focused on the former view, creating intricate symbolic representations of problems over which computers could search for solutions, while current (deep learning) AI research focuses on the latter, creating “randomly initialized neural networks that starts blank and that derives its skills from training data.” He argues that neither of these approaches is sufficient for creating human-like intelligence, which, as he introduces through the lense of psychometrics, is mostly characterized by the ability to broadly generalize on top of some low-level core knowledge that all humans are born with.

(2) A new perspective. Chollet presents a new framework that is meant to be an “actionable perspective shift in how we understand and evaluate flexible or general artificial intelligence.” It evaluates these broad cognitive generalization abilities by modelling an intelligent system as something that can output static “skill programs” to achieve some task. The system’s intelligence is then measured by how efficiently it can generate these skills. Formally:

The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.

(3) Abstraction and Reasoning Corpus (ARC). Chollet finally proposes a practical implementation of the framework. An ARC task, as pictured above, consists of several example before and after grids, and one final before grid for which the intelligent system’s generated skill must figure out the correct after grid. Each task is designed so that the average human can solve it quite easily, and so that it depends only on core knowledge (and not learned things like the concept of arrows). Tasks range from simple object counting to more complex things like continuing a line that bounces off edges. There are 400 tasks to train on and 600 tasks to test on, of which 200 are secret and used to evaluate a competition.

I’ve barely scratched the surface of the paper here, and I highly recommend reading it in full and trying out ARC for yourself!

  • The Measure of Intelligence on arXiv: Chollet (2019)
  • The Abstraction and Reasoning Corpus on GitHub, including a version you can test yourself on: fchollet/ARC
  • Chollet’s twitter thread with some more background about how the paper came to be.

Papers with Code sotabench

The sotabench homepage. (sotabench)

The sotabench homepage. (sotabench)

The team behind Papers with Code has launched sotabench. The name derives from “state of the art (sota)” + “benchmark”, and its mission is precisely that: to benchmark every open source model —for free!

This is super cool. A researcher just needs to implement a small Python file that specifies how to run their model on some given test data. They can then submit their repository to sotabench, which tracks it and runs the model on standardized test data for every commit to the master branch. This way, it independently keeps track of whether models achieve the performance claimed by the authors (within some benchmark-specific error range).

The project is run by Atlas ML, a company whose mission is to “advance open source deep learning” (emphasis mine).

We believe the software of the future should be accessible to everyone, not just large technology companies. We are realising this future by building breakthrough tooling that allows the world to build and collaborate on ambitious deep learning projects.

Atlas ML was co-founded by Robert Stojnic, one of the first Wikipedia engineers. It’s therefore not surprising that the team’s main objective is to push the open and collaborative values that also drive Wikipedia. The meta dataset resulting from sotabench will also surely lead to lots of interesting research on reproducibility and model characteristics vs. performance. Check out the project at sotabench.com.