Fake News Detection
Distinguishing between real news stories and deliberate hoaxes or sarcastic news has become increasingly important with the spread of such information over social media networks. Big tech and social media companies are particularly interested in the reliability of content being disseminated on their platforms. These platforms would ideally like to be able to detect and flag articles suspected of being so-called “fake news” automatically.
Here we will require the student to train and test a fake news detector using machine learning techniques. The dataset being used for the task is from the recently compiled and released open dataset described in this paper. This particular dataset contains headlines only: decisions about the legitimacy of the news articles must be based on the headline alone.
In the description below, I suggest some Python packages and tools that you can use to complete certain tasks. If you use another programming language for the project you will need to find appropriate replacements.
The first stage in the pipeline is to preprocess and clean the dataset.
Training and test splits
The very first thing that you will need to do is split the data into training and test sets. Write a Python script to perform the split: 75% of the data for training and the remainder for test. Take appropriate measures to ensure that the test set is not biased in any way. Store the resulting training and test sets in files using any convenient data format that you like. Collect and record statistics on the resulting training and test sets including total numbers of real and fake news headlines in each set.
If you plan to use a validation set (as opposed to cross validation) for model selection, this would be a good time to split off the validation set too.
The second part of preprocessing will be to extract the features you will need for the remainder of the analysis. You may revisit this stage many times as you become more familiar with the dataset and the kinds of features that may be useful for the classification task. You may want to start by using a bag-of-words model here to transform the documents into a fixed length
representation suitable for classification. The sklearn.feature_extraction.text package may be useful here.
The features you choose will affect the performance of the final classifier, and there are many possibilities (e.g. stop word removal, TF-IDF encoding, infrequent word removal, etc.). Choose something you think is reasonable to start with and later you can experiment with alternatives on the validation set.
Exploratory data analysis
Use the training section of the dataset to perform some exploratory data analysis. The goal at this stage is to become accustomed with the data and gain insights into the kinds of features that may be useful for classification.
Consider carefully which subset of the data should be used for exploratory analysis.
Find the top-20 most frequently used words in real and fake headlines and use a bar plot to show their relative frequencies. What can you say about these words? What changes when stop words are removed?
Compare the distribution of headline lengths in real and fake headlines using appropriate plots
(e.g. a boxplot). Are fake headlines usually shorter or longer? Document all your findings and any other interesting observations that you can find.
Train a supervised classification model on your features and calculate validation accuracy either on a hold-out validation set or using cross-validation. Record the final accuracy of the classifier. How many of the headlines are correctly classified by the model? How many are misclassified? Investigate the kinds of errors that are being made (e.g. using the sklearn.metrics package).
Document all findings. Save the model to the disk (e.g. using the Python pickle module).
Select multiple candidate models that you want to compare. This could include different classifiers (e.g. naive Bayes (MultinomialNB), logistic regression, SVMs, etc.), different hyperparameters, or different sets of features. Use a validation set or cross-validation to
compare the accuracy of different models. Create plots to compare a subset of the models that you investigated during model selection. Retain the most effective model for evaluation.
It is important that you do a reasonably thorough investigation of different alternatives in this section.
Estimate the out-of-sample error for the model that you found to be most accurate during model selection by evaluating it on the held-out test set. Use the sklearn.metrics package (or similar) to benchmark the model in several ways. Create an ROC plot for the model. Compute the model's AUC metric. Generate the confusion matrix for the model on the test set. Comment on the
implications of the resulting confusion matrix for a real production classifier.
The final submission should be a report documenting all assumptions, design decisions, and findings. Include visualizations, plots, and tables. You should strive to make your work
completely reproducible using only the report document: include details on everything you tested and all results. Document and justify all design decisions. Include an appendix explaining how to run your code to reproduce the experiments.
Important: do NOT include code as images (e.g. screenshots of code) in your report. Include code snippets as text.
The page limit for the report is 25 pages (including all figures and references).
You can use Python. You are free to use any external libraries you like (e.g. scikit-learn, pandas, seaborn, etc.), as long as you document which ones you used in the report. You can use both Python scripts and IPython notebooks for implementation as you see fit.