The non-profit OpenAI has trained a language model that, for example, writes coherent texts by predicting the next word, based on all previous words in a text. The researchers are not releasing the model for fear of abuse.
OpenAI calls its model GPT-2. The language model has 1.5 billion parameters and is trained on a dataset of eight million web pages. According to the researchers, GPT-2 scores better on tasks for language models than models trained in specific domains. The researchers chose not to base their dataset on news articles, Wikipedia entries or books alone, in order to keep their dataset as large and diverse as possible.
Instead, they scraped all outbound links from Reddit that had received at least three karma. “This can be seen as a heuristic indicator that other users found the link interesting, educational or just plain funny,” they write in their white paper. Wikipedia pages pulled them out because they are often used for other datasets. The result was a 40GB text file they call Webtext.
By training their language model on this, they say they came up with a model that can be used for many tasks across different domains. They give the example of answering questions, making summaries and providing translations, with the advantage that the model learns this based on raw text instead of specific training data.
OpenAI demonstrates their language model by writing different texts with the aim of simply predicting the next word based on a given text: the basis is always a short text written by humans, which continues the model. The model takes over the style and the content. This does not always go well, the researchers admit, and the model does not perform well in technical subjects in particular, but in many other cases and sometimes after several attempts, the synthetic texts are realistic articles. By training GPT-2 on specific data sets, the model can be fine-tuned. OpenAI gives an example of writing reviews by training on Amazon Reviews.
Jack Clark of the nonprofit told the Guardian that the trained model is not being released to clarify what it can and cannot do first. “There are a lot more people than us who are better at imagining what evil it can do.” Instead, OpenAI is releasing a smaller model on GitHub for researchers to experiment with.
OpenAI is an organization that focuses on research into the responsible use of artificial intelligence and is supported by Microsoft, Nvidia, GitHub and Elon Musk, among others.