Rami Al-Rfou is a Staff Research Scientist at Google Research. In his current role, he is a technical lead for assisted writing applications such as SmartReply. His research focuses on improving pretraining huge language modeling through token-free architectures, synthetic datasets constructed with knowledge-base based generative models, and improved sampling strategies for multilingual datasets. These pretrained language models, trained on +100 languages, are being utilized in query understanding, web page understanding, semantic search, and response ranking in conversations.
Al-Rfou’s research goes beyond language into designing better architecture to under large-scale data such as graphs. Al-Rfou repurposes language modeling tools to produce novel graph learning algorithms that measure node and graph similarities. These modeling ideas have been deployed for spam detection and personalization application on large scale.
Al-Rfou received his PhD in Computer Science at Stony Brook University under the supervision of Prof. Steven Skiena in 2015. He investigated how to utilize deep learning representations to build truly massive multilingual NLP pipeline that supports +100 languages. Massively multilingual modeling significantly gained momentum in the recent years since then. Al-Rfou’s experience in sequential modeling and crosslingual applications span 10 years of academic and industrial research with applications that touched the lives of millions of users and open sourced code that helped thousands of students.
We present LAReQA, a challenging new benchmark for language-agnostic answer retrieval from a multilingual candidate pool. Unlike previous cross-lingual tasks, LAReQA tests for “strong” cross-lingual alignment, requiring semantically related cross-language pairs to be closer in representation space than unrelated same-language pairs. This level of alignment is important for the practical task of cross-lingual information retrieval. Building on multilingual BERT (mBERT), we study different strategies for achieving strong alignment. We find that augmenting training data via machine translation is effective, and improves significantly over using mBERT outof-the-box. Interestingly, model performance on zero-shot variants of our task that only target “weak” alignment is not predictive of performance on LAReQA. This finding underscores our claim that language-agnostic retrieval is a substantively new kind of crosslingual evaluation, and suggests that measuring both weak and strong alignment will be important for improving cross-lingual systems going forward. We release our dataset and evaluation code at https://github.com/google-research-datasets/lareqa
Systems and Methods for Determining Graph SimilarityUS Patent Application US16/850,570
Selective text prediction for electronic messagingUS Patent Application US15/852,916
Cooperatively training and/or using separate input and subsequent content neural networks for information retrievalUS Patent Application US15/476,280
Cooperatively training and/or using separate input and response neural network models for determining response(s) for electronic communicationsUS Patent Application US15/476,292
Iteratively learning coreference embeddings of noun phrases using feature representations that include distributed word representations of the noun phrasesIssued Oct 02, 2017 US 9514098 B1