Authorship Attribution with Python

Attribution of authorship is the task of identifying the author of a given text. It should not be confused with author profiling which involves author information such as age, gender or ethnicity. In this article, I will introduce you to a machine learning project on the Authorship Attribution with Python.

What is Authorship Attribution?

Authorship Attribution is the process of attempting to identify the probable authorship of a given document, given a set of documents of which the authorship is known. The assignment of copyrights is becoming a significant issue as the range of anonymous information increases with the rapidly growing internet use around the world.

Also, Read – 100+ Machine Learning Projects Solved and Explained.

Applications of Authorship Attribution include detecting plagiarism, inferring the author of inappropriate communications that have been sent anonymously or under a pseudonym, and resolving unclear or disputed historical issues of authorship.

In simple words, Authorship Attribution is the means of determining the author of a text when it is not known who wrote it. It is useful when two or more people claim to have written something or when no one wants to say that they wrote the play.

Machine Learning Project on Authorship Attribution with Python

In machine learning, Authorship Attribution is a sort of classification problem. But it is different from the text classification because the writing style is also important in the attribution of authorship as well as the text content which is the only factor used in the text categorization.

Also, with different data, classifiers and feature sets may behave differently. Likewise, in the attribution of authorship, the feature set is not deterministic as in text categorization. Thus, these differences make the task of attributing authors more difficult.

In the section below, I will take you through a Machine Learning project on Authorship Attribution with Python programming language.

Authorship Attribution with Python

Now let’s start the task of authorship attribution with Python by importing the necessary Python libraries and the dataset. I’ll be using a few books from the Gutenberg Project as my training and testing dataset:

import nltk'gutenberg')
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/

Data Preparation

There are 18 books in the NLTK Gutenberg corpus, I will use Austen-sense.txt as a test set and Austen-emma.txt and Shakespeare-caesar.txt as our training set:

emma = nltk.corpus.gutenberg.words('austen-emma.txt')
emma = ' '.join(emma)
caesar = nltk.corpus.gutenberg.words('shakespeare-caesar.txt')
caesar = ' '.join(caesar)
sense = nltk.corpus.gutenberg.words('austen-sense.txt')
sense = ' '.join(sense)

I’ll limit the number of states in the string to 26 English characters and one blank space character:

state = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p","q", "r", "s", "t", "u", "v", "w", "x", "y", "z", " "]
#create word2id and id2word dictionary
char2id_dict = {}
for index, char in enumerate(state):
    char2id_dict[char] = index

Now, I’m going to define a function that creates a transition matrix from a given text:

Now I’ll compare the log-likelihood of the book Austen-sense written by Jane Austen and Shakespeare:


Therefore, the given text has a higher probability written by Jane Austen than Shakespeare. We attribute the authorship of the text given to Jane Austen. I hope you liked this article on Authorship Attribution with Python. Feel free to ask your valuable questions in the comments section below.

Articles: 75

Leave a Reply