ML Research links
Wu et al.
(2021) at OpenAI used a fine-tuned GPT-3 to recursively summarize books.
The model first separately summarizes sections of a book, then concatenates those summaries together and summarizes the result, and continues the process until it converges on a concise summary of the entire book.
This works surprisingly well!
For Romeo and Juliet , as visualized on the Summarizing Books demo page, this process takes it from 25,433 words (the whole play), to 5,809 words (72 summaries of sections), to 692 words (7 summaries of section summaries), to 116 words (the final summary).
The result is usually a bit worse than an “average” human summarizer, but importantly this recursive process allows researchers to trace back how the model constructed the summary: What part of the book was the source of this plot point in the summary?
What parts of lower-level summaries did the model not deem important enough to include in a higher level?
Constructing models in such a way that these kinds of questions can be answered are part of OpenAI’s larger research effort into the alignment problem: to “ensure that machine learning models act in accordance with human intentions.” (A core part of their mission.)
In 2019, Google AI introduced Translatotron, “the first ever model that was able to directly [end-to-end] translate speech between two languages,” instead of chaining together separate speech recognition, machine translation, and speech synthesis models (see DT #14).
Jia et al.
(2021) updated the model to create Translatotron 2, which is newly able to do voice transfer — making the translated speech sound like it was spoken by the same voice as the input speech — “even when the input speech contains multiple speakers speaking in turns.” (Check out the blog post for some samples of generated audio.) One significant change from the original Translatotron is that both the voice and content of the input speech are now captured using a single encoder, which the authors claim makes the model less likely to be abused for spoofing arbitrary audio content (making someone’s voice say something they never said).
But I’m a bit surprised that this is such a central part of the blog post, since there are plenty of dedicated voice-mimicking speech generation models out there already that would be easier to use for this purpose anyway.
Big AI labs’ focus on multimodal neural networks — architectures that combine different types of input and output data, like images and text — continues.
After OpenAI’s DALL·E and CLIP, Stanford HAI’s Foundation Models, and DeepMind’s Perceiver IO, Google AI has now announced WIT: a Wikipedia-based image-text dataset.
Bridging the gap between human-annotated image captions (too labor-intensive) and broad web-scraped ones (too messy and English-centric), Srinivasan et al.
(2021) created WIT by “extracting multiple different text selections associated with an image from Wikipedia articles and Wikimedia image links.” This results in a Creative Commons-licensed dataset of 37.5 million image-text examples, across 11.5 million unique images and 108 languages.
Until now these big multimodal models have mostly been trained on proprietary datasets by large private labs; this open dataset should help lower the barrier to entry for university labs to research similar models.
Perceiver IO is DeepMind’s new general-purpose architecture for processing a wide variety of input modalities — like images, videos, 3D point clouds, and sounds — into output vectors.
First, Perceiver (without the IO) scaled Transformers’ concept of attention to much larger input sizes, “without introducing domain-specific assumptions,” by first encoding the inputs to a small fixed-size latent array and attending over that.
Now, Perceiver IO (arXiv, GitHub) extends this by also applying attention to the decoding side, so that one input can produce multiple outputs and both the inputs and outputs can be a mix of modalities.
“This opens the door for all sorts of applications, like understanding the meaning of a text from each of its characters, tracking the movement of all points in an image, processing the sound, images, and labels that make up a video, and even playing games, all while using a single architecture that’s simpler than the alternatives.” With OpenAI releasing DALL·E and CLIP and Stanford HAI launching the Foundation Models research center, both also this year, these large multimodal networks have become a central focus of leading AI labs.
Stanford HAI’s new Center for Research on Foundation Models (“foundation models” is their name for large self-supervised models like GPT-3 and CLIP) has open-sourced Mistral, a “framework for transparent and accessible large-scale language model training.” It’s on GitHub at stanford-crfm/mistral.
OpenAI has released v1.0 of Triton, its Python-like GPU programming language for neural networks.
“Triton makes it possible to reach peak hardware performance with relatively little effort; for example, it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS—something that many GPU programmers can’t do—in under 25 lines of code.
Our researchers have already used it to produce kernels that are up to 2x more efficient than equivalent Torch implementation.” It’s interesting to see how many different intermediate languages have popped up in recent years — JAX and MLIR are the other big ones — that sit somewhere between the abstraction level of frameworks (TensorFlow/PyTorch) and of bare-metal GPU languages (CUDA), all trying to find an architectural balance in what bits to keep flexible and what bits to abstract away from developers.
I always find it very hard to estimate which of these things are going to be mainstream and which will mostly be used as “glue” by framework- and low-level language builders.
I guess time will tell.
Triton is available on GitHub at openai/triton.
AlphaFold, DeepMind’s protein folding neural network that represented a breakthrough in structural biology, is now open-source.
The model’s paper, Highly accurate protein structure prediction with AlphaFold by Jumper et al.
(2021), got published in Nature; the code is on GitHub at deepmind/alphafold.
Lots of people in the community were asking for this.
Distill, my favorite machine learning journal, is going on hiatus.
Maybe I jinxed this last month when I hoped that the founding of Anthropic, a new AI safety research company started by many of the people behind Distill, wouldn’t impact their work on the journal.
Oops.
Over the past five years, Distill’s innovations of being web-only — not forcing articles to fit into two-column static PDFs — and explicitly caring about publishing explainers and artifacts, have pushed AI explainability to a whole new level.
I’ll miss this feed of highly-polished interactive articles a lot, but I also understand the editorial team’s decision here: they found that their mentorship, article template, community, and dedicated authors were more central to the excellent quality of work on Distill, than the fact that Distill is its own journal was.
They think the future of Distill-style articles is self-publication, “either on one-off websites or on a hypothetical ‘Distill Arxiv.’” See the editorial team’s blog post for more of their thoughts on this, and some other considerations — volunteer burnout also played a role.
A lot of the people behind some of my favorite recent machine learning research (like Circuits and Multimodal Neurons) have joined up to form a new AI safety and research company called Anthropic, and raised a $124 million series A round “to build more reliable, general AI systems.” I hope they keep publishing their research to Distill!
Cool new paper from Drain et al.
(2021) at Microsoft Research: DeepDebug is a Transformer-based model that can fix Python bugs using stack traces, back translation and code skeletons.
One interesting contribution is their “neural bugs” injection model, which was trained to revert bug-fixing commits and “can generate near arbitrary edits that are drawn from the distribution of mistakes developers actually make.” On the QuixBugs benchmark, DeepDebug increases the number of bug fixes found by 50% while reducing false positives from 35% to 5%, all while decreasing the run timeout from six hours to one minute.
Can I get this in PyCharm?
Papers with Code has a new feature to link papers to independent reproducibility reports done as part of their ML Reproducibility Challenge (RC2020) event, which now covers NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR and ECCV.
Both in university and at my previous and current jobs, paper reproductions have always been some of my favorite learning experiences: you don’t truly understand a paper (and the math inside it) until you’ve coded it up and gotten it to perform similarly to the original!
It’s great to see some more formal infrastructure being built up around this practice, now including Papers with Code’s recurring event, standardized reports, and cross-linking.
Google launched Know Your Data, a new tool that “helps researchers, engineers, product teams, and decision makers understand datasets with the goal of improving data quality, and helping mitigate fairness and bias issues.” It includes 70+ existing image datasets for which the tool can find corrupted data, sensitive subjects, coverage gaps, and balance problems.
This looks like a solid technical step towards more equitable and reliable machine learning.
In response to the announcement that NeurIPS 2021 will have a datasets track (cool!), Cyril Diagne wrote a Twitter thread covering some of his favorite sources of publicly available visual datasets, including Kaggle (646 computer vision datasets), Visual Data (527 datasets) and Bifrost (1900 datasets).
A great source of project inspiration!
As part of their earlier research project to translate Fon — a language spoken by two million people across Benin, Nigeria and Togo — to French, Bonaventure Dossou and Chris Emezue built FFRTranslate.com.
They’ve wrapped their neural machine translation model into an easy-to-use website for translating back and forth between the two languages, and both the model and dataset are open-source on GitHub at bonaventuredossou/ffr-v1.
Dossou and Emezue are both MSc students and they’ve so far paid for the server costs of this project out of pocket.
They set up a GoFundMe and Paypal to help with the ongoing costs; I donated $20 through the latter and encourage you to also chip in if you can.
(For Dutch readers: the iDEAL option on GoFundMe doesn’t work because the project isn’t Dutch, and the website silently fails if you try to use it.)
ML resource: Published at ICLR 2021, Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges is a free 150-page proto-book by Bronstein et al.
(2021) that “attempts to distill ‘all you need to build the architectures that are all you need.’” They express popular architectures like CNNs, GNNs, Transformers and LSTMs all using a common geometric blueprint.
Co-author Petar Veličković on Twitter: “Hence we believe that our work can be a useful way to navigate the increasingly challenging landscape of deep learning architectures.” Direct PDF link (large).
Distill #2: Branch Specialization by Voss et al.
(2021), a chapter in the Circuits thread which includes previous work like Zoom in on Circuits, Early Vision in CNNs, Curve Detectors, Equivariance, High-Low Frequency Detectors, and Curve Circuits.
In this article, the authors find that similar circuit-level functions tend to group themselves in network branches, which are “sequences of layers which temporarily don’t have access to ‘parallel’ information which is still passed to later layers.” For example, all 30 curve-related features in InceptionV1’s mixed3b_5x5 layer are concentrated in just one of the layer’s four branches.
The authors hypothesize that this is because of a positive feedback loop during training, where the earlier layer in a branch is incentivized to form low-level features that the later layer uses as primitives for higher-level features.
One cool thing about Distill is that it also invites non-AI researchers to provide commentary on articles.
In this case, Matthew Nolan and Ian Hawes, neuroscientists at the University of Edinburgh, see a “striking parallel” with the separation of cortical pathways in the human brain.
Distill #1: Multimodal Neurons in Artificial Neural Networks by Goh et al.
(2021), which investigates CLIP, OpenAI’s multimodal neural network that learned to match images on the internet to text snippets that surround them.
(Probably) unlike older image classification models, CLIP has neurons that “light up” for high-level concepts that were never explicitly part of any classification dataset.
“These neurons don’t just select for a single object.
They also fire (more weakly) for associated stimuli, such as a Barack Obama neuron firing for Michelle Obama or a morning neuron firing for images of breakfast.” The article has deep dives into three neuron families : (1) the Region Neurons family (like neurons for the USA or for Europe; these links take you to the neurons’ pages on OpenAI Microscope); (2) the Person Neurons family (including Lady Gaga and Ariana Grande); and (3) the Emotion Neurons family (including sleepy and happy).
It also highlights a baker’s dozen other families, from holidays and religions to brands and fictional universes.
There’s even an LGBTQ+ neuron that responds to things like rainbow flags and the word “pride”!
Beyond this exploration, the article looks at how these abstractions in CLIP can be used: for understanding language, emotion composition, and typographic attacks.
The authors also note that “CLIP makes it possible for end-users to ‘roll their own classifier’ by programming the model via intuitive, natural language commands — this will likely unlock a broad range of downstream uses of CLIP-style models.” Sound familiar?
I wonder how long it’ll take until OpenAI launches a v2 of their API that uses CLIP (+ DALL·E?) for image processing and generation the way v1 uses GPT-3 for text.
Distill #3: Weight Banding by Petrov et al.
(2021), another chapter in the Circuits thread.
This article explores why weights in some layers display a very distinct banding pattern when visualized using Nonnegative Matrix Factorization (NMF), with the following process: “For each neuron, we take the weights connecting it to the previous layer.
We then use NMF to reduce the number of dimensions corresponding to channels in the previous layer down to 3 factors, which we can map to RGB channels.” This pattern occurs in the final convolutional layer across InceptionV1, ResNet50, and VGG19 (but not AlexNet, which does not use global average pooling).
The authors hypothesize that this horizontal banding pattern “is a learned way to preserve [vertical] spatial information as it gets lost through various pooling operations,” which is enforced by the fact that, in an experiment in which they rotate input images by 90 degrees, the bands also rotate by 90 degrees to become vertical.
The article concludes that banding is an example of emergent structure in vision models, but that we can’t say much about whether this structure is “good” or “bad” or how its presence should influence architectural decisions; not the most significant conclusions, but a very interesting observation nonetheless.
EleutherAI, “a grassroots collection of researchers working to open source AI research,” has scaled its open-source GPT-Neo implementation up to GPT-3 size.
The weights are available to download for free, and you can play around with the pretrained models in an example Colab notebook.
Yay open science!
Now that I’ve run out of free GPT-3 credits on the OpenAI API, maybe I’ll be able to use this to generate new content for This Episode Does Not Exist! — drop me a message if you’d like me to try it out for your favorite podcast.
I wrote about Google’s AI ethics crisis last December, when the company pushed out their Ethical Artificial Intelligence Team’s co-lead Timnit Gebru after a series of conflicts around a critical paper she was working on.
Her dismissal was not received well by her team and the community at large.
A few months later, Google also fired Margaret Mitchell, the team’s other co-lead.
And now it seems that the dust has settled a bit, internally at least: according to a post on Google’s The Keyword blog, Dr.
Marian Croak, long-time VP at the company, “has created and will lead a new center of expertise on responsible AI within Google Research.” I wonder how much of Gebru’s and Mitchell’s original team is sticking around for this new group — the researchers who spoke out publicly did not seem to have much faith left in their ability to work on ethical AI issues from inside Google.
Three new long reads on Distill: a bit of a meta article about how they think about Visualizing Weights, which has been an important feature in lots of the publication’s recent articles; a new entry to the Circuits thread on reverse-engineering Curve Circuits; and an application of Neural Cellular Automata (NCA) for generating Self-Organizing Textures.
That last one features a fun interactive graphic at the top.
I didn’t get around to reading these in detail before sending out today’s DT — they’re quite long — but wanted to share them anyways.
Google has released Model Search, an open-source AutoML platform for the TensorFlow ecosystem.
The pitch: “Model Search is domain agnostic, flexible and is capable of finding the appropriate architecture that best fits a given dataset and problem, while minimizing coding time, effort and compute resources.” It can run on a single machine or in a distributed setting, and uses a reinforcement learning-inspired “explore & exploit” methodology to find a model architecture that optimizes for user-specified metrics.
For efficiency, Model Search also uses knowledge distillation and weight sharing between experiments runs.
It’s available on GitHub at google/model_search.
A new paper by Jurowetzki et al.
(2021) quantifies transition flows of machine learning researchers between industry and academia and “finds that researchers working within the field of deep learning as well as those with higher average impact tend to transition into industry.” Juan Mateos Garcia, one of the authors, refers to this as “the AI brain drain.” This is quite a controversial topic and I haven’t made up my mind about how I feel about it yet, so I’d love to hear your thoughts.
(Hit that reply button!)
AI lab tooling long read #1: DeepMind published a blog post about using JAX to accelerate their research.
JAX is a modern take on the NumPy API that “includes an extensible system of composable transformation that help support machine learning research” by taking care of differentiation, vectorization (like abstracting batching away from the researcher), and JIT-compilation (for GPUs and TPUs).
The Python library now underpins many of DeepMind’s recent publications, and they’ve also open-sourced several components of their internal ecosystem on top of JAX: Haiku, Optax, RLax, Chex, and Jraph (“it’s pronounced gif ”).
AI lab tooling long read #2: OpenAI added a blog post about scaling Kubernetes to 7,500 nodes.
Kubernetes is a system for orchestrating Docker containers across a datacenter, and I think most compute-heavy companies use it by now.
Both startups I’ve worked at also use it for their machine learning workloads — but at a scale on the order of tens or hundreds of nodes, not many thousands.
At that scale, a whole load of problems and potential optimizations suddenly become worth their engineer-time to look at, and that’s exactly what OpenAI does in this detailed post.
(A fun fact I quite enjoy and will probably never have a better excuse to share in DT than now: Kubernetes is abbreviated as K8s — “K-then-eight-letters-then-s,” like how internationalization is i18n — and there’s a management tool for Kubernetes called K9s.
At first sight, the name just looks like a typical programmer move, “K8+1s = K9s,” but it has another level to it: if you pronounce K9s as a word, it sounds like “canines” — dogs!
So the logo for K9s is a dog.
🐩)
Papers with Code is continuing on its quest to index more and more aspects of machine learning research.
They’ve now launched a new Datasets section that lets you search benchmarks by the dataset and modality they’re based on.
The page for ImageNet, for example, has a description, samples, usage statistics and metadata about the dataset, and links to 52 related benchmarks.
Papers with Code’s execution has been impressive lately: in the last year and a half, they’ve also launched sotabench, Methods, arXiv integration and Axcell.
These are all useful resources, and it seems their acquisition by Facebook in the middle of it all hasn’t slowed the team down one bit!
Ludwig Schubert, Chelsea Voss and Chris Olah published a new entry to the Distill Circuits thread, in which they model connections in trained convolutional neural networks as logical circuits to figure out how they work; I covered what makes this research so interesting last April in Distill: Early Vision in CNNs.
Using feature visualization, dataset examples, and synthetic tuning curves, this new article goes in-depth on a relatively unintuitive class of neurons: High-Low Frequency Detectors, which activate when they encounter “directional transitions from low to high spatial frequency.” In one very cool section of the article, the authors combine clusters of high- and low-frequency circuit components into two generic HF- and LF-factors, and show that they play the same roles in the implementation of high-low frequency detectors as their individual components do.
As always, the article is a great weekend long read.
PlotNeuralNet is an open-source LaTeX package for drawing deep neural networks, also featuring a Python wrapper.
I’ve spent hours doing this by hand in Figma in the past, so this is a very welcome change.
(Thanks for the tip, Tim!)
François Chollet’s NeurIPS talk on abstraction and reasoning, based on his 2019 Measure of Intelligence paper (see DT #26), is a great watch.
It covers the shortcut rule (“you get what you optimize for — at the detriment of everything else”); levels of generalization; and what kinds of abstraction he thinks deep learning can and cannot solve (by itself).
The tutorial also included a talk on analogical reasoning by Melanie Mitchell and one on math reasoning by Christian Szegedy; this live-tweeted thread by Antonio Vergari summarizes all three talks in bite-size chunks.
(RE: the previous link, it’d be super cool to see DeepMind try to tackle the ARC challenge — maybe someday.)
New from DeepMind, in Nature: Mastering Atari, Go, chess and shogi by planning with a learned model by Schrittwieser et al.
(2020).
The paper describes DeepMind’s new games-playing reinforcement learning algorithm MuZero, which is the latest evolution of the lab’s previous AlphaGo (2016), AlphaGo Zero (2017), and AlphaZero (2018) algorithms.
The key improvement in MuZero is that it doesn’t need to be explicitly told the rules of the games it plays: it’s model-free, and “just models aspects that are important to the agent’s decision-making process.” This helps it achieve state-of-the-art (and superhuman) results on the Atari suite, Go, chess, and shogi.
There are some new developments in Google’s ongoing AI ethics crisis (see DT #54).
CEO Sundar Pichai issued a company-wide memo apologizing for the fact that “Dr.
[Timnit] Gebru’s departure … seeded doubts and led some in our community to question their place at Google.” This doesn’t address the central issue, and it did not land well with the community; see the tweets from Gebru and Jack Clark, as well as Khari Johnson’s interview with Gebru and NPR’s coverage of the story.
In response to the memo, a group of Google AI researchers sent the executives a list of demands asking for leadership and policy changes.
And meanwhile, someone made a fake Twitter account (complete with a GAN-generated profile picture) that opposed Gebru’s side of the story by pretending to be an ex-researcher from the Google ethics team.
I don’t think this’ll be resolved anytime soon.
Chris Olah et al.
have a cool new Distill article in the Circuits thread: Naturally Occurring Equivariance in Neural Networks.
“We sometimes think of understanding neural nets as being like reverse engineering a regular computer program.
In this analogy, equivariance is like finding the same inlined function repeated throughout the code.”
SE4ML lab’s updated Engineering best practices for Machine Learning includes tips for managing data, code, training, deployment, teams, and high-level governance.
Cool video demo of speech recognition for overlapping voices by Google AI’s Quan Wang: VoiceFilter-Lite lets users enroll their voices on their phones (“this is my phone, please remember my voice”) and is then very accurately able to filter out other peoples’ voices while transcribing text.
This all happens on-device and has super low latencies.
Fantastic slides by Ilharco et al.
(2020) for the EMNLP 2020 tutorial on High Performance Natural Language Processing (PDF), which got a lot of love on Twitter.
Alexander Rush: “Every slide is current to the minute.
Amazing set of diagrams.” Jeff Dean: “hundreds of slides that will transform your understanding!”
Hamel Husain wrote in a post on the GitHub blog that the company is going to assist in developing fast.ai’s nbdev, a literate programming environment for Python.
Donald Knuth describes literate programming as “a move away from writing computer programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.” I linked to nbdev last year (DT #28) but haven’t seen it in the wild much since then.
I’m also not yet convinced that this notebooks-centric approach scales to making large software projects easier to follow.
(Maybe that’s just because I’m used to the current standard structures for Python projects?) But the fact that GitHub is embracing nbdev so enthusiastically makes me curious about how it’ll develop in the coming years, and whether it’ll start to pop up more outside of fast.ai’s courses/projects.
Also presented at EMNLP 2020, the Language Interpretability Tool (LIT) is an open-source platform for visualizing and understanding NLP models.
It builds on top of Google’s previous What-If Tool, and supports “local explanations, including salience maps, attention, and rich visualizations of model predictions, as well as aggregate analysis including metrics, embedding spaces, and flexible slicing.” James Wexler and Ian Tenney introduce the tool in a post on the Google AI Blog, which also includes a few demos.
Cool new Distill paper from Hilton et al.
(2020): Understanding RL Vision.
The authors train a reinforcement learning agent to play a procedurally-generated video game based on single frames as input, and then develop an interactive interface (embedded in the article!) to study what different parts of the network learn.
Using Circuits editing (see DT #37), they then make the agent blind to e.g.
left-moving enemies in the game, and experimentally show that this indeed makes it fail more often by missing such enemies.
“Our results depend on levels in CoinRun being procedurally-generated, leading us to formulate a diversity hypothesis for interpretability.
[Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).] If it is correct, then we can expect RL models to become more interpretable as the environments they are trained on become more diverse.” As always, the full article is a great Sunday long read.
Efemerai is a tool to visualize, inspect and debug deep learning code.
The visualization bit looks coolest: “Efemarai can scan the execution of your machine learning code and automatically generate interactive 3D visualizations.
All you need is a single line of Python to explore the entire computational graph of your model with all of its values, parameters and gradients.”
In his follow-up to A very short history of some times we solved AI, Julian Togelius asks How many AGIs can dance on the head of a pin? Intelligence — let alone artificial general intelligence — is an extremely ill-defined concept; and the path from current machine learning software to a self-aware Terminator is… unclear, to put it mildly.
So, Togelius argues that the popular philosophical AGI debates (about how to contain or align an exponentially self-improving intelligence explosion) are completely moot.
“If you don’t believe in angels, it makes no sense discussing how much space they occupy.
It just becomes a word game.” It’s a good read that once again very much aligns with my views on AGI.
Wu et al. (2020) present four variations of the popular MNIST digit recognition for “our orthographies used in Afro-Asiatic and Niger-Congo languages: Ge`ez (Ethiopic), Vai, Osmanya, and N’Ko.” They’re formatted so that they can be used as drop-in replacement for any existing MNIST model, and the authors show that LeNet achieves similar classification accuracies to classic MNIST on each of the new datasets.
The data is open-source at Daniel-Wu/AfroMNIST.
Kyle Wiggers wrote a feature for Venture Beat on Facebook’s new M2M-100 model — a machine translation model that, unlike e.g.
Google Translate does for many language pairs, does not use English as go-between.
Instead of translating from A to English and then from English to B, it translates form A directly to B — which for 100 languages means there are nearly 10,000 combinations.
The model was trained on 2200 of these combinations, and is a new state of the art (in terms of BLEU) for many non-English language pairs.
The model has 15 billion parameters, continuing the trend that strength really is in numbers for NLP and MT models.
FAIR has open-sourced M2M-100 at pytorch/fairseq.
An Image is Worth 16x16 words: Transformers for Image Recognition at Scale is a paper under review for ICLR 2020 that’s been making the rounds on Twitter.
I found Yannick Kilcher’s explainer video — which starts with a lovely rant about “double-blind” peer review — a good introduction to the model, which could be the start of Transformers overtaking convolutional models at the very largest scales of computer vision models.
Cool new arXiv.org feature: Papers with Code-discovered implementations are now linked right on a paper’s abstract page.
I’ve always found it quite easy to find any available implementations with a few quick Google and GitHub searches, but integrations like this are great for discoverability.
Building on their previous three years of graduate school application mentoring programs, Black in AI has launched an Academic Positions program to support Black junior researchers getting started in “careers in academia, industry, and policy.” The launch blog post includes details about the program, tips on how academics and organizations can support it, and lots of additional resources.
This is a great link to amplify within your ML network!
TensorSensor is a Python package that “clarifies” (visualizes) the dimensions of tensors in numpy, TensorFlow or PyTorch.
I recently had to reproduce a paper that wrote down its math in a simplified form that ignored the out-channel dimension of convolutional filters, and spent a lot of time trying to get all my matrices to line up correctly with that extra dimension.
This tool would’ve made that a lot easier!
Also check out Terence Parr’s introduction to TensorSensor.
We’ve seen “tuning hyperparameters without grad students” with Dragonfly (DT #11) but… how much does a researcher’s experience actually correlate with their skills for tuning an ML model?
Anand et al, (2020) investigated this and found a strong positive correlation between experience and final model accuracy, and “that an experienced participant finds better solutions using fewer resources on average.” Glad to see my skills aren’t yet completely automatable yet!
(The paper is co-authored by Jan van Gemert, who was the first person to explain to me what a convolution is, in a guest lecture during my first year of undergrad.
😊)
Microsoft has updated DeepSpeed, its open-source library for efficiently training massive ML models (see DT #34, #40), with four big improvements: 3D parallelism for training trillion-parameter models; ZeRO-Offload for 10x bigger model training on a single GPU; Sparse Attention kernels for 10x longer input sequences in Transformers; and 1-bit Adam for reducing network load in multi-GPU training.
My work focuses on tiny models rather than large ones, so I haven’t gotten a chance to try DeepSpeed, but if any of you have, I’d love to hear about your experience!
The NumPy paper is out!
It’s a Nature article by Harris et al. (2020).
Is this going to break records in citation counts, as pretty much every machine learning paper should probably be referencing it?
Either way, I’ve updated my repo of BibTex citations for Python packages to add it.
Chaitanya K. Joshi wrote an essay for The Gradient where he argues that Transformers are Graph Neural Networks, equating the former’s attention mechanism to the latter’s aggregation functions.
It’s a great introduction to both model types, and Joshi poses that these two subfields of machine learning can learn a lot from each other.
(Also, he represents nodes in a GNN using emojis instead of letters, and references them as such in the text, which I love.) Great weekend read.
Data Readiness for Natural Language Processing (Olsson and Sahlgren, 2020) is a detailed guide describing “how an organization may proceed to identify, make available, validate, and prepare data to facilitate automated analysis methods.” Nice link for your NLP toolbox!
When I covered BigGAN in February 2019 (DT #6), its image generation results were very impressive — but the model was also incredibly expensive to train, requiring a cluster of hundreds of TPUs.
Now, just a year and a half later, Han et al.
(2020) introduced not-so-BigGAN: close-enough image quality trained on just 4 Tesla-V100 GPUs — an order of magnitude less compute.
This speed of this progress is amazing.
Google’s Dataset Search (DT #15) now contains over 31 million datasets, a tripling since its initial launch two years ago.
Natasha Noy and Omar Benjelloun wrote up an analysis of the types of datasets that are now available for the Google AI blog.
Social science plus geoscience make up almost half the datasets, and together with biology, agriculture, and medicine, they comprise three quarters.
The post also includes some best practices for publishing datasets so that Dataset Search can properly index them for other researchers to find.
Although increasingly enormous do-it-all language models like T5 and GPT-3 (DT #42, #44) have been getting a lot of attention (haha) lately, smaller and more parameter-efficient models are still improving a lot as well.
A recent interesting one is REALM by Guu et al.
(2020) at Google AI, which, unlike these larger models, separates the encoding of language from the encoding of knowledge. Instead of implicitly storing information about the world in the language model’s weights, it introduces a neural retriever that learns to find relevant snippets of text from Wikipedia to be fed into the language model as context alongside the original query.
As a result, it achieves a score of 40.4 on Natural Questions with just 300 million parameters, compared to T5’s score of 36.6 with 11 billion parameters—10% better results at 35x fewer parameters.
TF-Coder is TensorFlow’s new tensor manipulation utility.
Given a few examples of input and output tensors, it generates TF2 code that transforms the input into the output.
Check out the code on GitHub, try it out in a Colab notebook, or read about how it works in Shi et al.
(2020).
Does GPT-3, OpenAI’s latest iteration of their gargantuan language model (DT #42, #44) mean we’re imminently close to artificial general intelligence (AGI) like some of the Twitter hype has been suggesting?
Reinforcement learning researcher Julian Togelius says no: in A very short history of some times we solved AI, he argues that we’ve been moving the goalpost fo AGI for decades.
“Algorithms for search, optimization, and learning that were once causing headlines about how humanity was about to be overtaken by machines are now powering our productivity software.
And games, phone apps, and cars.
Now that the technology works reliably, it’s no longer AI (it’s also a bit boring).” Forgive the long quotes, but I share Togelius’ views on AGI almost exactly, and he communicates them very succinctly: “So when will we get to real general artificial intelligence?
Probably never.
Because we’re chasing a cloud, which looks solid from a distance but scatters in all directions as we drive into it.” For his more optimistic conclusion, read the full blog post.
Cool new dataset by Zhou et al.
(2020): HoliCity: A City-Scale Data Platform for Learning Holistic 3D Structures.
Covering a 20-square-kilometer area of central London, it aligns 6,300 high-resolution panorama photos with a 3D CAD model of the city, “with the ultimate goal of supporting real-world applications including city-scale reconstruction, localization, mapping, and augmented reality.” The dataset’s website includes a few samples with interactive sliders between their RGB, plane, CAD, and semantic views.
Google has released its Model Card Toolkit, a JSON spec that makes it easier to specify the capabilities and gotchas of trained ML models (see DT #41).
Gradio is an open-source Python library for generating quick web UIs around ML models: use it to “play around with your model in your browser by dragging-and-dropping in your own images (or pasting your own text, recording your own voice, etc.).”
Josh Meyer has created a handy markdown template for creating datasheets for datasets (see DT #41).
This would’ve come in very handy a month ago when I was writing a dataset datasheet at work and copying over all the questions from the PDF of Gebru et al.
(2018) by hand.
Today in gargantuan language models: Google’s new state-of-the-art model for translating from 100 languages to English has 600 billion parameters.
Compare this to OpenAI’s GPT-3 at 175 billion parameters from June (see DT #42) and Microsoft’s Turing-NLG at 17 billion parameters from February (DT #33).
Google’s 600 billion-parameter Transformer took four days to train on 2048 (!) TPUs, which is actually relatively little for that model size.
This training process is therefore also the focus of the paper describing the model: Lepikhin et al.
(2020) introduce GShard, “an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code.”
In a paper to be published at ICSE 2020, Liem and Panichella (2020) introduce two heuristics that can be used to semi-automatically uncover high-level issues in data labels and representations.
In ImageNet for example, they find that the synonymous “laptop” and “notebook” labels consistently confuse models, and argue that such oracle issues warrant closer collaboration between the machine learning and software testing communities.
The paper, called Oracle Issues in Machine Learning and Where to Find Them, also comes with an amazing video where the authors—animated as talking portrait paintings from the wizarding world—describe their “potion for better Defense Against the Dark ML Arts.” It may be the most perfect thing I’ve ever shared in this section.
Nick Cammarata et al.
published a new article on curve detectors in Distill’s Circuits thread (see DT #35, #37).
It’s a deep dive into the 3b:379 neuron of the InceptionV1 network, and—as usual—it’s an exceptionally well-written and well-illustrated post.
I also really like the idea of a CNN learning how to do e.g.
curve detection (which hasn’t been solved classically!), and then teaching us how to implement it by hand.
Goes to show the power of the Circuits hypothesis.
Vincent Sitzmann et al.
introduced SIREN, a new activation function for implicit neural representations, a technique to encode a signal (e.g.
an image, audio sample, video clip, or 3D scenes) in the parameters of a neural network.
Their main innovation is using a periodic activation function (based on a sine wave) instead of the usual ReLU, TanH, or Softplus nonlinearities, which yields very impressive results.
Check out their paper video and demo site.
Watch tip: Superintelligence: The Idea That Eats Smart People, a 2016 keynote by Maciej Ceglowski in which he compares machine learning to alchemy and artificial general intelligence to the philosopher’s stone.
It’s quite relevant to the issues in our field today.
Free online talk + meetup on June 18th from the PyData Boston group: Causal Modeling in Machine Learning by AI research engineer Robert Osazuwa Ness.
“Oun yìn wàn nouwé” means “I love you” in Fon, an African language spoken by approximately two million people across Benin, Nigeria and Togo.
Aiming to translate texts from his mother, Bonaventure Dossou worked with Chris Emezue to scrape data from a Jehovah Witness Bible and create a basic Fon to French machine translation model.
Since the language is “mostly spoken and rarely documented,” this is a low-resource neural machine translation problem, which presents a number of additional challenges.
(See also Sennrich and Zhang (2019), by my former NLP/NMT professor in Edinburgh.)
Related: Jo and Gebru (2019) point out that many AI fairness problems are rooted in the data collection and annotation process, and offer “five key approaches in document collection practices in archives that can inform data collection in sociocultural ML.” These can be summarized as consent , inclusivity , power , transparency , and ethics & privacy, with details in Table 1 of their paper: Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.
Google also proposed a new optimization technique: speeding up neural network training with data echoing.
It’s quite simple: while one bottlenecked part of the training pipeline is getting the next input ready, the current input gets “echoed” through the rest of the model graph, reducing training time while preserving predictive performance.
This is cool work, and hopefully it’ll get upstreamed to TensorFlow for everyone to use.
Google’s People + AI Research (PAIR) group has a set of open-source tools and platforms “that make ML models more understandable, trustworthy, and fair,” including several model visualization and feature attribution projects.
Microsoft released the second version of DeepSpeed and its Zero Redundancy Optimizer (ZeRO-2, see DT #34).
These improvements enable training models that are an order of magnitude larger and faster than previously possible: up to 170 billion parameters, at up to 10x previous state-of-the-art speeds.
It’s open-source on GitHub at microsoft/DeepSpeed.
OpenAI released an analysis showing that “since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months.
Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet” (Hernandez and Brown, 2020).
Papers With Code, the site that has benchmarked the performance of over 20,000 ML models on 2,500 standard tasks, now links results in plots back directly to the tables they came from in a paper.
Ross Taylor wrote up their automated results extraction method, which is open-source on GitHub: paperswithcode/axcell.
PyTorch Serve is an open-source tool by Facebook and Amazon to easily turn ML models into API endpoints accessible from the web: pytorch/serve.
Cool progress on automated chip design from Anna Goldie and Azalia Mirhoseini at Google Brain: “[whereas] existing baselines require human experts in the loop and take several weeks to generate, our method can generate placements in under six hours that outperform or match their manually designed counterparts.” Check out their Google AI blog post on the research: Chip Design with Deep Reinforcement Learning.
Andrej Karpathy’s weekend project: a 150-line Python implementation of an autograd engine and PyTorch-like neural net library on top of it, all in Jupyter notebooks on GitHub: karpathy/micrograd.
OpenAI Microscope is a collection of visualizations of every significant layer and neuron of eight vision “model organisms” which are often studied in interpretability.
(See for example OpenAI’s Distill paper on early vision in InceptionV1; DT #37.)
TensorFlow Profiler provides a set of tools that you can use to measure the training performance and resource consumption of your TensorFlow models.
Neurologists Joseph Makin et al. at UC San Francisco used a-250 electrode brain implant to decode human brain signals into text with techniques from machine translation—at a word error rate of only 3%.
The implant technology won’t be widely usable anytime soon, if ever, but you can download the code it runs on anyway: jgmakin/machine_learning.
Shuyang Cheng et al. at self-driving car company Waymo have extended Google’s reinforcement-learned image data augmentation technique, AutoAugment, to work with LIDAR data.
Elon Musk still believes LIDAR sensors are useless “appendices” for self-driving cars but the rest of the industry, Waymo included, is evidently not getting anywhere closer to agreeing with that thesis.
There’s a cool new Distill post about visualizing neural networks with the grand tour and lots of other linear and non-linear visualizations.
Google has a collection of courses and resources to help developers improve their technical documentation: Technical Writing Courses for Engineers.
Self-driving car company Wayve wrote a blog post about predicting a distribution of different near-term future traffic scenarios based on a car’s current situation and possible next actions.
HiPlot is Facebook Research’s new “lightweight interactive visualization tool to help AI researchers discover correlations and patterns in high-dimensional data using parallel plots and other graphical ways to represent information.”
Neural Tangents is a high-level neural network API for specifying complex, hierarchical, neural networks of both finite and infinite width.
Google released Open Images V6, a new version of “the largest annotated image dataset in many regards.” It now features local narratives, such as the one I embedded above, consisting of “synchronized voice, text, and mouse traces over the objects being described.”