Our project is a computer program that uses Markov chains to generate a statistical model of the style and content of an input text, and generates text based on that style.
The purpose of this project is to explore what characterizes the style of a particular author, and to try and use that stylistic information to generate text. This type of information can be used both for text generation as well as identifying text. Identifying the style of a text has a variety of applications beyond text generation, including forensics (for example, matching up the writing of a criminal to text posted on social media)and history (for example, dating a document, or identifying whether a new document is a forgery).
This project was inspired by several different existing projects that are looking at the intersection between Computer Science and art. One is Experiments in Musical Intelligence, or EMI, which is a computer program that can imitate the style of classical composers. This program can actually fool some into thinking that its compositions were composed by classical composers, like Bach. (http://www.nytimes.com/1997/11/11/science/undiscovered-bach-no-a-computer-wrote-it.html). Another precedent for this project is Google’s DeepDream project, which uses Deep Learning techniques to imitate the style of a painter and applying it to a given picture. This project isn’t designed to perfectly mimic the style of a painter, but it does learn and mimic noticeable aspects of the style of a given painter (https://github.com/google/deepdream). Some final precedents for our project are the computer programs which write simple news articles, created by companies like Automated Insights or Narrative Science, which write news articles based on preset templates, and neural networks which have learned how to write Shakespeare (https://www.wired.com/2015/10/this-news-writing-bot-is-now-free-for-everyone/, http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
The way the program works it that is uses a several different Markov chains, trained on different aspects of the text, to understand the style and content. We choose to use Markov chains, which are stochastic models that transition from state to state based on learned probabilities, because a lot of current research in this area uses machine learning. We wanted to see what the best model we could make with a simpler, Markov-chain based approach — setting it apart from many current research projects. However, unlike the spam bots or news writing programs, which utilize Markov chains alongside remade templates, our project learns and generates templates based on the input text. These templates can then be filled with words, either from the source text or from a different source (to apply the style to a different topic)
From a technical standpoint, our program works by separating the training and generation steps into three parts. The first part trains a Markov chain on transitions from punctuation mark to punctation mark, and generates a list of punctation marks to base the generated text on. For instance, it might add a transition from a period to a comma 25% of the time, from a period to a period 25% of the time, and from a period to a dash 50% of the time. It learns these transitions from the input text, and generates text uses a weight random number generator. The second part trains a Markov chain on transitions between periods and number of spaces. For instance, a period might transition to a series of 5 words 10% of the time, 2 words 5%, etc, etc. These transitions are also learned from the input text. The third part learns transitions from one word to the next. This part of the program is the most similar to previous Markov-chain based programs.