Rami Al-Rfou

Senior Staff Research Scientist

Google Research

Biography

Rami Al-Rfou is a Senior Staff Research Scientist at Waymo Research. He leads a team to build foundational models for motion and driving based on his expertise in large language models.

Previously, Rami was a technical lead for assisted writing applications such as SmartReply at Google Research. His research focused on improving pretraining large language modeling through token-free architectures, synthetic datasets constructed with knowledge-base based generative models, and improved sampling strategies for multilingual datasets. These pretrained language models, trained on +100 languages, are being utilized in query understanding, web page understanding, semantic search, and response ranking in conversations.

Al-Rfou’s research goes beyond language into designing better architecture to understand large-scale data such as graphs. Al-Rfou repurposes language modeling tools to produce novel graph learning algorithms that measure node and graph similarities. These modeling ideas have been deployed for spam detection and personalization application on large scale.

Al-Rfou received his PhD in Computer Science at Stony Brook University under the supervision of Prof. Steven Skiena in 2015. He investigated how to utilize deep learning representations to build truly massive multilingual NLP pipeline that supports +100 languages. Massively multilingual modeling significantly gained momentum in the recent years since then. Al-Rfou’s experience in sequential modeling and crosslingual applications span 10 years of academic and industrial research with applications that touched the lives of millions of users and open sourced code that helped thousands of students.

Experience

Senior Staff Research Scientist

Waymo Research

Mar 2021 – Present Mountain View, CA

Responsibilities include:

Foundational Motion Models TLM

Staff Research Scientist

Google Research

Jun 2015 – Mar 2021 Mountain View, CA

Responsibilities include:

SmartReply Technical Lead
Deep Retrieval Research Lead

Research Intern

Microsoft Research

Jun 2013 – Aug 2013 New York City, NY

Host: Leon Bottou
“Investigated new ways to improve semi-supervised learning with word embeddings.”

Research Intern

Google Research

Jun 2012 – Aug 2012 Mountain View, CA

Host: Jay Ponte
“Developed a language-independent, semi-supervised method for multilingual coreference resolution utilizing word emebddings and finetuned dual-encoder ranking model.”

Software Engineer Intern

Google

Jun 2011 – Aug 2011 Mountain View, CA

Host: Mario Guajardo
“Developed a visualization system for Google’s data centers' internal networks.”

Education

PhD in Natural Language Processing

Stony Brook University

Sep 2010 – Jun 2015 Stony Brook, NY

Dissertation: Polyglot: A Massive Multilingual Natural Language Processing Pipeline. Adviser: Steven Skiena.
Committee: Yejin Choi, Leman Akoglu, Leon Bottou

BSc. in Computer Engineering

University of Jordan

Sep 2004 – Feb 2009 Amman, Jordan

Dissertation: TCP Performance over Wireless Networks: Analysis & Simulation.
GPA: 3.79/4.0

Talks

Attention & Language

Tutorial on how attention mechanism enhances language modeling.

Sep 17, 2020 12:00 PM — 1:30 PM

Rami Al-Rfou

Slides Video

Token-Free Language Modeling

How to eliminate segmentation (the last preprocessing step) from NLP models.

Feb 1, 2019 12:00 PM — 1:30 PM

Rami Al-Rfou

Slides

Conversation Modeling as a Search Problem

How to utilize response ranking for conversation modeling.

Apr 16, 2017 12:00 PM — 1:30 PM

Rami Al-Rfou

Slides

Featured Publications

Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips, Yinfei Yang

April 2020 EMNLP 2020

LAReQA: Language-agnostic answer retrieval from a multilingual pool

We present LAReQA, a challenging new benchmark for language-agnostic answer retrieval from a multilingual candidate pool. Unlike previous cross-lingual tasks, LAReQA tests for “strong” cross-lingual alignment, requiring semantically related cross-language pairs to be closer in representation space than unrelated same-language pairs. This level of alignment is important for the practical task of cross-lingual information retrieval. Building on multilingual BERT (mBERT), we study different strategies for achieving strong alignment. We find that augmenting training data via machine translation is effective, and improves significantly over using mBERT outof-the-box. Interestingly, model performance on zero-shot variants of our task that only target “weak” alignment is not predictive of performance on LAReQA. This finding underscores our claim that language-agnostic retrieval is a substantively new kind of crosslingual evaluation, and suggests that measuring both weak and strong alignment will be important for improving cross-lingual systems going forward. We release our dataset and evaluation code at https://github.com/google-research-datasets/lareqa

PDF Dataset

Recent Publications

Quickly discover relevant content by filtering publications.

Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Oshin Agarwal, Heming Ge, Siamak Shakeri, Rami Al-Rfou

PDF

Machine Translation Aided Bilingual Data-to-Text Generation and Semantic Parsing

Oshin Agarwal, Mihir Kale, Heming Ge, Siamak Shakeri, Rami Al-Rfou

PDF

mT5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel

PDF Code

LAReQA: Language-agnostic answer retrieval from a multilingual pool

Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips, Yinfei Yang

PDF Dataset

Wiki-40B: Multilingual Language Model Dataset

Mandy Guo, Zihang Dai, Denny Vrandečić, Rami Al-Rfou

PDF Code Dataset

See all publications

Patents

Systems and Methods for Determining Graph Similarity
US Patent Application US¹⁶/_850,570
Selective text prediction for electronic messaging
US Patent Application US¹⁵/_852,916
Cooperatively training and/or using separate input and subsequent content neural networks for information retrieval
US Patent Application US¹⁵/_476,280
Cooperatively training and/or using separate input and response neural network models for determining response(s) for electronic communications
US Patent Application US¹⁵/_476,292
Iteratively learning coreference embeddings of noun phrases using feature representations that include distributed word representations of the noun phrases
Issued Oct 02, 2017 US 9514098 B1

Contact

631 371 3165
701 N Rengstorff Avenue, Apt 19, Mountain View, CA 94043
Book an appointment
DM Me

Rami Al-Rfou

Senior Staff Research Scientist

Google Research

Biography

Experience

Senior Staff Research Scientist

Waymo Research

Staff Research Scientist

Google Research

Research Intern

Microsoft Research

Research Intern

Google Research

Software Engineer Intern

Google

Education

PhD in Natural Language Processing

Stony Brook University

BSc. in Computer Engineering

University of Jordan

Talks

Attention & Language

Token-Free Language Modeling

Conversation Modeling as a Search Problem

Projects

YouTube SmartReply

Graph Structure Understanding

Gmail SmartCompose

Gmail SmartReply

Polyglot NER

Polyglot Embeddings

Featured Publications

LAReQA: Language-agnostic answer retrieval from a multilingual pool

Recent Publications

Patents

Contact

Rami Al-Rfou

Senior Staff Research Scientist

Biography

Experience

Senior Staff Research Scientist

Waymo Research

Staff Research Scientist

Google Research

Research Intern

Microsoft Research

Research Intern

Google Research

Software Engineer Intern

Google

Education

PhD in Natural Language Processing

BSc. in Computer Engineering

University of Jordan

Talks

Projects

Featured Publications

Recent Publications

Patents

Popular Topics

Contact