Project : Spooky_Author_Identification

This is a competition to identify an auther from sentences which they wrote. This project follows the instructions of a Kaggle competition during the Halloween of 2017. The link of the competition is https://www.kaggle.com/c/spooky-author-identification

Data description

The competition dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenizer, so you may notice the odd non-sentence here and there. Your objective is to accurately identify the author of the sentences in the test set.

In [21]:
import re
import nltk

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.stem.snowball import SnowballStemmer
from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt

%matplotlib inline

nltk.download('stopwords')
nltk.download('punkt')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sunghwanki/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/sunghwanki/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[21]:
True

1. Load and check data

1.1 Load data

In [22]:
# Load train and Test set
train = pd.read_csv('../3_spooky-author-identification/data/train.csv')
test = pd.read_csv('../3_spooky-author-identification/data/test.csv')

# Store author data at y_train
# y_train = train['author']

# Take a look at the first 5 rows in the data
train.head()
Out[22]:
id text author
0 id26305 This process, however, afforded me no means of... EAP
1 id17569 It never once occurred to me that the fumbling... HPL
2 id11008 In his left hand was a gold snuff box, from wh... EAP
3 id27763 How lovely is spring As we looked from Windsor... MWS
4 id12958 Finding nothing else, not even gold, the Super... HPL
In [23]:
# Check the data set
print("Train data : ", train.shape)
print("Test  data : ", test.shape)
Train data :  (19579, 3)
Test  data :  (8392, 2)
In [24]:
# Check the train data set's columns
print("Train data columns Qty :", len(train.columns), "\n\n")
print("Train data columns :", train.columns)
Train data columns Qty : 3


Train data columns : Index(['id', 'text', 'author'], dtype='object')

According to the competition page there are three distinct author initials

  1. EAP - Edgar Allen Poe : American writer who wrote poetry and short stories that revolved around tales of mystery and the grisly and the grim. Arguably his most famous work is the poem - "The Raven" and he is also widely considered the pioneer of the genre of the detective fiction.
  1. HPL - HP Lovecraft : Best known for authoring works of horror fiction, the stories that he is most celebrated for revolve around the fictional mythology of the infamous creature "Cthulhu" - a hybrid chimera mix of Octopus head and humanoid body with wings on the back.
  1. MWS - Mary Shelley : Seemed to have been involved in a whole panoply of literary pursuits - novelist, dramatist, travel-writer, biographer. She is most celebrated for the classic tale of Frankenstein where the scientist Frankenstein a.k.a "The Modern Prometheus" creates the Monster that comes to be associated with his name.

Visualize some basic statistics in the data, like the distribution of entries for each author.

For this purpose, I will invoke the handy visualisation library and plot some simple bar plots.

In [25]:
z = {'EAP': 'Edgar Allen Poe', 'MWS': 'Mary Shelley', 'HPL': 'HP Lovecraft'}

plt.figure(figsize=(15,6))
g = sns.barplot(x = train.author.map(z).unique(),
            y = train.author.value_counts().values)
g.set_title('Text distribution by Author')
plt.show()
In [26]:
all_words = train['text'].str.split(expand=True).unstack().value_counts()
In [27]:
plt.figure(figsize=(15,6))
g = sns.barplot(x = all_words.index.values[2:50],
            y = all_words.values[2:50])
g.set_title('Word frequencies in the train data')
plt.show()

1.2 Word cloud

  • A visualization method that can be used when you have the frequency of words data
  • The importance of each word is shown with font size or color
  • This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence

Create three different python lists that store the texts of Edgar Allen Poe(EAP), Mary Shelley(MWS), HP Lovecraft(HPL) respectively

In [28]:
# Classify by author
EAP = train[train['author']=='EAP']
MWS = train[train['author']=='MWS']
HPL = train[train['author']=='HPL']
In [29]:
plt.figure(figsize=(20,15))
plt.subplot(311)
wc = WordCloud(background_color="white", max_words=10000,
               stopwords=STOPWORDS, max_font_size= 40)

wc.generate(" ".join(EAP.text))
plt.title("EAP - Edgar Allen Poe", fontsize=25)
plt.imshow(wc.recolor( colormap= 'viridis' , random_state=17), alpha=0.9)
plt.axis('off')

plt.subplot(312)
wc.generate(" ".join(MWS.text))
plt.title("MWS - Mary Shelley", fontsize=25)
plt.imshow(wc.recolor( colormap= 'viridis' , random_state=17), alpha=0.9)
plt.axis('off')

plt.subplot(313)
wc.generate(" ".join(HPL.text))
plt.title("HPL - HP Lovecraft", fontsize=25)
plt.imshow(wc.recolor(colormap = 'viridis', random_state=17), alpha=0.9)
plt.axis('off')
Out[29]:
(-0.5, 399.5, 199.5, -0.5)
In [30]:
train['sentences'] = train.text.transform(lambda x: len(sent_tokenize(x)))
train['words'] = train.text.transform(lambda x: len(word_tokenize(x)))
train['text_length'] = train.text.transform(lambda x: len(x))

test['sentences'] = test.text.transform(lambda x: len(sent_tokenize(x)))
test['words'] = test.text.transform(lambda x: len(word_tokenize(x)))
test['text_length'] = test.text.transform(lambda x: len(x))

text_analize = train.groupby("author")[['sentences','words','text_length']].sum()
text_analize
Out[30]:
sentences words text_length
author
EAP 8206 232184 1123585
HPL 5876 173979 878178
MWS 6128 188824 916632
In [31]:
text_analize.plot.bar(subplots = True, layout=(1,3), figsize=(12,6))
plt.show()

2. Natural Language Processing

2.1 Stopword Removal

Generally, words that appear frequently in corpus are learning models and do not really contribute to the learning or prediction process, so they can not distinguish them from other texts. For example, words such as survey, suffix, i, me, my, it, this, that, is, and are often appear but do not contribute much to finding the actual meaning. Stopwords include terms such as "to" or "the" and should be removed at the pre-processing stage. NLTK has 179 predefined English language abbreviations.

In [40]:
len(stopwords.words('english')), stopwords.words('english')[:5]
Out[40]:
(179, ['i', 'me', 'my', 'myself', 'we'])

2.2 Stemming

Stemming refers to the removal of affixes from the modified word in the morphology and information retrieval field, and separating the stem of the word. In this case, the stem is not necessarily the same as the root, and it is the purpose of the stem extraction that the related words are mapped to the same stem even if there is a difference from the root.

In [32]:
stopwords.words
Out[32]:
<bound method WordListCorpusReader.words of <WordListCorpusReader in '/Users/sunghwanki/nltk_data/corpora/stopwords'>>
In [33]:
stemmer = SnowballStemmer('english')

def text_to_words(text):

    # Convert non-English characters to spaces
    letters_only = re.sub('[^a-zA-Z]', ' ', text)

    # Lowercase conversion
    words = letters_only.lower().split()

    # Convert stopwords to sets.
    stops = set(stopwords.words('english'))

    # Remove Stopwords 
    meaningful_words = [w for w in words if not w in stops]

    # Stemming
    stemming_words = [stemmer.stem(w) for w in meaningful_words]

    # Combining with a space-delimited string
    return( ' '.join(stemming_words) )

2.3 Improved workflow

In order to preprocess a large amount of text data, it is necessary to improve the workflow. The threading-based multiprocessing package improves the speed of work, thus reducing work time

In [34]:
# Improved workflow processing speed
from multiprocessing import Pool

def _apply_df(args):
    df, func, kwargs = args
    return df.apply(func, **kwargs)

def apply_by_multiprocessing(df, func, **kwargs):
    # Get workers parameter from keyword item
    workers = kwargs.pop('workers')

    # Define a process pool with the number of workers
    pool = Pool(processes = workers)

    # Work by dividing the number of functions and data frames 
    # to be executed by the number of workers
    result = pool.map(_apply_df, [(d, func, kwargs)
            for d in np.array_split(df, workers)])
    pool.close()

    # Combine work
    return pd.concat(list(result))

2.4 Put all the preprocessing steps together

Now, stopword removal and stemming are used to preprocess text data using an improved workflow

In [41]:
# Proprocessing text data in a train
train_clean = apply_by_multiprocessing(train['text'], text_to_words, workers=4)

# Proprocessing text data by author
EAP_clean = apply_by_multiprocessing(EAP['text'], text_to_words, workers=4)
MWS_clean = apply_by_multiprocessing(MWS['text'], text_to_words, workers=4)
HPL_clean = apply_by_multiprocessing(HPL['text'], text_to_words, workers=4)
In [42]:
EAP_clean.head(3)
Out[42]:
0    process howev afford mean ascertain dimens dun...
2    left hand gold snuff box caper hill cut manner...
6    astronom perhap point took refug suggest non l...
Name: text, dtype: object
In [43]:
# Number of words
train['num_words'] = train_clean.apply(lambda x: len(str(x).split()))

# Number of words without duplicates
train['num_uniq_words'] = train_clean.apply(lambda x: len(set(str(x).split())))
In [44]:
import seaborn as sns

fig, axes = plt.subplots(ncols=2)
fig.set_size_inches(18, 6)
print('Average of number of words per text :', train['num_words'].mean())
print('Median of number of words per text', train['num_words'].median())
sns.distplot(train['num_words'], bins=100, ax=axes[0])
axes[0].axvline(train['num_words'].median(), linestyle='dashed')
axes[0].set_title('Number of words distribution')

print('Average of number of unique words per text:', train['num_uniq_words'].mean())
print('Median of number of unique words per tex', train['num_uniq_words'].median())
sns.distplot(train['num_uniq_words'], bins=100, color='g', ax=axes[1])
axes[1].axvline(train['num_uniq_words'].median(), linestyle='dashed')
axes[1].set_title('Number of unique words distribution')
plt.show()
Average of number of words per text : 13.076612697277696
Median of number of words per text 11.0
Average of number of unique words per text: 12.723121712038408
Median of number of unique words per tex 11.0