Language Modeling

Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique …

Machine Translation Aided Bilingual Data-to-Text Generation and Semantic Parsing

We present a system for bilingual Data-To-Text Generation and Semantic Parsing. We use a text-to-text generator to learn a single model that works for both languages on each of the tasks. The model is aided by machine translation during both …

Attention & Language

Tutorial on how attention mechanism enhances language modeling.

Bridging the Gap for Tokenizer-Free Language Models

Purely character-based language models (LMs) have been lagging in quality on large scale datasets, and current state-of-the-art LMs rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the model is …

Character-Level Language Modeling with Deeper Self-Attention

LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability …

Token-Free Language Modeling

How to eliminate segmentation (the last preprocessing step) from NLP models.