With more than 300 million learners, Duolingo has the world's largest collection of language-learning data at its fingertips. This allows us to build unique systems and uncover new insights about the nature of language and learning. We are committed to sharing our data and findings with the broader research community.
We present the task of second language acquisition (SLA) modeling. Given a history of errors made by learners of a second language, the task is to predict errors that they are likely to make at arbitrary points in the future. We describe a large corpus of more than 7M words produced by more than 6k learners of English, Spanish, and French using Duolingo, a popular online language-learning app. Then we report on the results of a shared task challenge aimed studying the SLA task via this corpus, which attracted 15 teams and synthesized work from various fields including cognitive science, linguistics, and machine learning. ... Read more
We present half-life regression (HLR), a novel model for spaced repetition practice with applications to second language acquisition. HLR combines psycholinguistic theory with modern machine learning techniques, indirectly estimating the "half-life" of a word or concept in a student’s long-term memory. We use data from Duolingo — a popular online language learning application — to fit HLR models, reducing error by 45%+ compared to several baselines at predicting student recall rates. HLR model weights also shed light on which linguistic concepts are systematically challenging for second language learners. Finally, HLR was able to improve Duolingo daily student engagement by 12% in an operational user study. ... Read more
Collecting (or "sampling") information that one expects to be useful is a powerful way to facilitate learning. However, relatively little is known about how people decide which information is worth sampling over the course of learning. We describe several alternative models of how people might decide to collect a piece of information inspired by "active learning" research in machine learning. We additionally provide a theoretical analysis demonstrating the situations under which these models are empirically distinguishable, and we report a novel empirical study that exploits these insights. Our model-based analysis of participants’ information gathering decisions reveals that people prefer to select items which resolve uncertainty between two possibilities at a time rather than items that have high uncertainty across all relevant possibilities simultaneously. Rather than adhering to strictly normative or confirmatory conceptions of information search, people appear to prefer a "local" sampling strategy, which may reflect cognitive constraints on the process of information gathering. ... Read more
We show that student learning can be accurately modeled using a mixture of learning curves, each of which specifies error probability as a function of time. This approach generalizes Knowledge Tracing, which can be viewed as a mixture model in which the learning curves are step functions. We show that this generality yields order-of-magnitude improvements in prediction accuracy on real data. Furthermore, examination of the learning curves provides actionable insights into how different segments of the student population are learning. To make our mixture model more expressive, we allow the learning curves to be defined by generalized linear models with arbitrary features. This approach generalizes Additive Factor Models and Performance Factors Analysis, and outperforms them on a large, real world dataset. ... Read more
Public version of a tool used inside Duolingo to develop content that is appropriate for different learner levels (beginner, intermediate, etc.). It is aligned to the CEFR framework and uses multilingual domain adaptation to learn from English CEFR-labeled vocabulary to other languages.
Data for the 2018 Shared Task on Second Language Acquisition Modeling (SLAM). This corpus contains 7 million words produced by learners of English, Spanish, and French. It includes user demographics, morph-syntactic metadata, response times, and longitudinal errors for 6k+ users over 30 days.
Data used to develop our half-life regression (HLR) spaced repetition algorithm. This is a collection of 13 million user-word pairs for learners of several languages with a variety of language backgrounds. It includes practice recall rates, lag times between practices, and other morpho-lexical metadata.
We are a diverse team of engineers and researchers who collaborate closely with UX designers, linguists, and others to build innovative systems based on world-class research. We are growing, so check out our job openings below!
Develop ML-driven technologies that are used by millions of people every day.
Conduct ML research using the world's largest collection of language-learning data.
Leverage our treasure trove of data to drive product changes and business results.
Lead a team and help develop next-generation AI technologies used by millions.