Data

Wluper Dataset Downloads


In our paper, "Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks", we analysed which characteristics of a text classification dataset could make it difficult for a model to learn. To do so, we required a large number of text classification datasets of different difficulties - from easy to hard.
In total, we used 89 datasets in the work: 34 distinct, publicly available datasets; 51 created through different combinations of the original 34; and four fake, synthetically generated datasets for error analysis. Here, we have made all 89 datasets in the form we used them freely available to the research community.

All datasets are stored as .csv files with two columns: the data item and its label. All datasets have, at minimum, a training and a testing set. Many datasets also contain a validation set. Clicking any of the buttons below will download a .zip file of the corresponding dataset. Each .zip file has the same name as the dataset and contains two directories: "training", containing the training set, and "eval" containing the testing and validation sets. All .csv files are named according to the same pattern: DATASET_NAME__[FULL/DEV/TEST].csv

In addition, on our GitHub, we have provided some code files for loading the data and which demonstrate how to use our difficulty analysing code which implements the findings from our paper.


Download All Datasets (2.4GB) Demo Code Difficulty-Calculating Code



Downloads (A-Z)




AG's News Topic Classification Dataset - version 3 (AG)

AG's News Topic Classification Dataset is constructed and used as a benchmark in the paper by [1]. It is a collection of more than 1 million news articles gathered from more than 2000 news sources, which is used for research purposes. It has four classes: "Business", "Sci/Tech", "Sports", "World", each class contains 30000 training samples and 1900 testing samples. In total, the training set has 108000 sentences, the validation set has 12000 sentences and the test set has 7600 sentences.

References:
https://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

[1] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



Amazon Reviews Binary Classification

A dataset of Amazon product reviews posed as a binary classification task for classifying the review as either positive or negative compiled by [1].

References:
[1] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



Amazon Reviews Star Rating Classification

A dataset of Amazon product reviews compiled by [1]. The task is to classify each review by how many stars it was given.

References:
[1] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



ATIS Intent Classification Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- ATIS [1]
- Airline Twitter Sentiment [2]

References:
[1] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

[2] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Airline Twitter Sentiment and Classic Literature

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Airline Twitter Sentiment [1]
- Classic Literature

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Airline Twitter Sentiment + Corporate Messaging + Disaster Tweets + New Year's Resolutions + Self Driving Car Twitter Sentiment + Text Emotion Classification

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Airline Twitter Sentiment [1]
- Corporate Messaging [1]
- Disaster Tweets [1]
- New Year's Resolutions [1]
- Self Driving Car Twitter Sentiment [1]
- Text Emotion Classification [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Airline Twitter Sentiment

The Airline Twitter Sentiment dataset is crowd-sourced by [1], it has 3 classes about the sentiment of peoples' tweets to the airline. The training set has 12619 sentences and the test set has 2226 sentences.

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



ATIS Intent Classification Combined with Classic Literature

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- ATIS [1]
- Classic Literature

References:
[1] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



ATIS Intent Classification Combined With Corporate Messaging

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- ATIS [1]
- Corporate Messaging [2]

References:
[1] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.
[2] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Classic Literature

This dataset consists of The Complete Works of William Shakespeare, War and Peace by Leo Tolstoy, Wuthering Heights by Emily Bront\"{e} and the War of the Worlds by H.G. Wells. Each data item is a sentence from one of these books. All sentences longer than 100 words are discarded. The label of each sentence is which author wrote that sentence. All books were downloaded from the Project Gutenberg [1] website. There are 40489 training sentences, 5784 validation sentences and 11569 testing sentences.

References:
[1] http://www.gutenberg.org/

Download Data .zip



Classic Literature Combined With Corporate Messaging

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Classic Literature
- Corporate Messaging [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Classic Literature and Deflategate Tweets

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Classic Literature
- Deflategate Tweets [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Classic Literature and the Large Movie Review Corpus

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Classic Literature
- Large Movie Review Corpus [1]

References:
[1] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.

Download Data .zip



Classic Literature and Review Sentiments

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Classic Literature
- Review Sentiments [1]

References:
[1] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, August. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 597-606). ACM.

Download Data .zip



Classic Literature and Stanford Sentiment Treebank 3-Class

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Classic Literature
- Stanford Sentiment Treebank 3-Class Data [1]

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).

Download Data .zip



Classic Literature and Text Emotion Classification

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Classic Literature
- Text Emotion Classification [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Corporate Messaging Dataset

The Corporate messaging dataset is crowd-sourced by [1], it has 4 classes about what corporations talk about on social media: "information", "Action", "Dialogue" and "Exclude". The training set has 2650 sentences and the test set has 468 sentences.

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Corporate Messaging Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Corporate Messaging [1]
- Airline Twitter Sentiment [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Corporate Messaging Combined With Deflategate Tweets

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Corporate Messaging [1]
- Deflategate Tweets [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



DBPedia Ontology Classification Dataset - version 2 (DB PEDIA) - Subsampled to 10% Size

The DBpedia dataset are licensed under the terms of GNU Free Documentation License [1], the DBPedia ontology classification dataset is constructed and used as a benchmark in the paper [2]. It has 14 classes, the total size of the training set is 560000 and testing set 70000, we split 10% of the training set as validation set with size 5600.

References:
[1] http://wiki.dbpedia.org> - Dataset 2.0, 2015 -
[2] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



DBPedia Ontology Classification Dataset - version 2 (DB PEDIA)

The DBpedia dataset are licensed under the terms of GNU Free Documentation License [1], the DBPedia ontology classification dataset is constructed and used as a benchmark in the paper [2]. It has 14 classes, the total size of the training set is 560000 and testing set 70000, we split 10% of the training set as validation set with size 5600. Due to the large size of this dataset and the need to increase training speed due to the large number of models we had to train, we randomly sampled 10% of the dataset based on the class distribution as our training, validation and test datasets.

References:
[1] http://wiki.dbpedia.org> - Dataset 2.0, 2015 -
[2] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



New England Patriots Deflategate Sentiment (DFG)

The New England Patriots Deflategate sentiment dataset is crowd-sourced by [1], it is gathered from Twitter sentiment on chatter around deflated footballs. It has five sentiment classes: negative, slightly negative, neutral, slightly positive and positive. The training set has 8250 sentences, the validation set has 1178 sentences and the test set has 2358 sentences.

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Disaster Tweets Topics Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Disaster Tweets Topics [1]
- Airline Twitter Sentiment [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Disaster Tweets Topics Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Disaster Tweets Topics [1]
- ATIS [2]

References:
[1] https://www.figure-eight.com/data-for-everyone/
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



Economic News Article Tone and Relevance (ENR)

The Economic News Article Tone and Relevance dataset is crowd-sourced by [1], it contains classes for whether the article is about the US economy, if so, what tone (1-9) is that article is. In this project, we employed a binary classification task by only taking two classes: Yes or No. The training set has 5593 sentences, the validation set has 799 sentences and the test set has 1599 sentences.

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Fake Dataset 25 Classes

A synthetically generated dataset with 25 classes. Each class consists of 1000 identical copies of a randomly generated, 25 letter string.

Download Data .zip



Fake Dataset 50 Classes

A synthetically generated dataset with 50 classes. Each class consists of 1000 identical copies of a randomly generated, 25 letter string.

Download Data .zip



Fake Dataset 100 Classes

A synthetically generated dataset with 100 classes. Each class consists of 1000 identical copies of a randomly generated, 25 letter string.

Download Data .zip



Fake Dataset 1000 Class

A synthetically generated dataset with 1000 classes. Each class consists of 1000 identical copies of a randomly generated, 25 letter string.

Download Data .zip



Grammar and Online Product Reviews (GPR)

The Grammar and Online Product Reviews comes from [1], it is a list of reviews of products with 5 classes (rating from 1 to 5). The training set has 49730 sentences, the validation set has 7105 sentences and the test set has 14209 sentences.

References:
[1] https://www.kaggle.com/datafiniti/grammar-and-online-product-reviews

Download Data .zip



Hate Speech (HS)

The hate speech data comes from the work [1] which has three classes: "offensiveLanguage", "hateSppech", and "neither". The training set has 17348 sentences, the validation set has 2478 sentences and the test set has 1115 sentences.

References:
[1] Davidson, T., Warmsley, D., Macy, M. and Weber, I., 2017. Automated hate speech detection and the problem of offensive language. arXiv preprint arXiv:1703.04009.

Download Data .zip



Large Movie Review Corpus (LMRC)

The large movie review corpus is conducted by [1] which contains 50000 reviews from IMDB and a even number of positive and negative reviews. The number of reviews for each movie is not allowed to be more than 30 to avoid correlated ratings. It contains two classes: A negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10, no neutral class is included.

References:
[1] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.

Download Data .zip



Large Movie Review Corpus Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Large Movie Review Corpus [1]
- Airline Twitter Sentiment [2]

References:
[1] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.
[2] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Large Movie Review Corpus Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Large Movie Review Corpus [1]
- ATIS [2]

References:
[1] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



Large Movie Review Corpus + Review Sentiment + Stanford Sentiment Treebank

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Large Movie Review Corpus [1]
- Review Sentiment [2]
- Stanford Sentiment Treebank 3-class [3]

References:
[1] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.
[2] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, August. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 597-606). ACM.
[3] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).

Download Data .zip



Large Movie Review Corpus + Review Sentiments + YouTube Spam Classification

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.
Dataset is composed of:

- Large Movie Review Corpus [1]
- Review Sentiments [2]
- YouTube Spam Classification [3]

References:
[1] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.
[2] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, August. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 597-606). ACM.
[3] Alberto, T.C., Lochter, J.V. and Almeida, T.A., 2015, December. Tubespam: Comment spam filtering on youtube. In Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on (pp. 138-143). IEEE.

Download Data .zip



London-based Restaurants' Reviews on TripAdvisor (LRR)

The London-based restaurants' reviews on TripAdvisor is taken as subset of a bigger dataset (more than 1.8 million restaurants) that was created by extracting data from Tripadvisor.co.uk. It has five classes (rating from 1-5). The training set has 12056 sentences, the validation set has 1722 sentences and the test set has 3445 sentences.

References:
[1] https://www.kaggle.com/PromptCloudHQ/londonbased-restaurants-reviews-on-tripadvisor

Download Data .zip



New Year's Resolution Dataset Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- New Years Resolutions [1]
- Airline Twitter Sentiment [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



New Year's Resolution Dataset Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- New Years Resolutions [1]
- ATIS [2]

References:
[1] https://www.figure-eight.com/data-for-everyone/
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



New Years Resolutions (NYR)

The 2015 New Year's resolutions dataset is crowd-sourced by [1], it contains demographic and geographical data of users and resolution categorizations. This dataset has 10 classes. The training set has 3507 sentences, the validation set has 501 sentences and the test set has 1003 sentences.

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



New Year's Resolution Dataset Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- New Years Resolutions [1]
- Airline Twitter Sentiment [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



New Year's Resolution Dataset Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- New Years Resolutions [1]
- ATIS [2]

References:
[1] https://www.figure-eight.com/data-for-everyone/
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



New Year's Resolution (NYR)

The 2015 New Year's resolutions dataset is crowd-sourced by [1], it contains demographic and geographical data of users and resolution categorizations. This dataset has 115 classes. The training set has 3507 sentences, the validation set has 501 sentences and the test set has 1003 sentences.

Dataset is composed of:
- New Years Resolutions [1]
- ATIS [2]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Paper Sentiment Classification Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Paper Sentiment Classification [1]
- Airline Twitter Sentiment [2]

References:
[1] https://archive.ics.uci.edu/ml/datasets/Sentence+Classification
[2] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Paper Sentiment Classification Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Paper Sentiment Classification [1]
- ATIS [2]

References:
[1] https://archive.ics.uci.edu/ml/datasets/Sentence+Classification
[2] https://www.figure-eight.com/data-for-everyone/
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



Paper Sentiment Classification Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Paper Sentiment Classification [1]
- ATIS [2]

References:
[[1] https://archive.ics.uci.edu/ml/datasets/Sentence+Classification
[2] https://www.figure-eight.com/data-for-everyone

Download Data .zip



Paper Sentence Classification (PSC)

The paper sentence classification dataset comes from [1], it contains sentences from the abstract and introduction of 30 articles ranging from biology, machine learning and psychology. There are 5 classes in total, the training set has 2181 sentences, the validation set has 311 sentences and the test set has 625 sentences.

References:
[1] https://archive.ics.uci.edu/ml/datasets/Sentence+Classification

Download Data .zip



Classification of Political Social Media (PSM)

The classification of political social media dataset is crowd-sourced by [1], the social media messages from US Senators and other American politicians are classified into 9 classes ranging from "attack" to "support", the training set has 3500 sentences, the validation set has 500 sentences and the test set has 1000 sentences.

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



QC - Experimental Data for Question Classification

The Question Classification dataset comes from [1], it classifies questions into six classes: "NUM", "LOC", "HUM", "DESC", "ENTY" and "ABBR", the training set has 4096 sentences, the validation set has 546 sentences and the test set has 500 sentences.

References:
[1] Li, X. and Roth, D., 2002, August. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1 (pp. 1-7). Association for Comput

Download Data .zip



RS - Review Sentiments

The review sentiments dataset is generated by [1] using an approach from group level labels to instance level labels, which is evaluated on three large review datasets: IMDB, Yelp, and Amazon. The dataset contains classes, the training set has 2100 sentences, the validation set has 300 sentences and the test set has 600 sentences.

References:
[1] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, August. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 597-606). ACM.

Download Data .zip



Review Sentiments Combined With Airline Twitter Sentiment 1

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Review Sentiments [1]
- Airline Twitter Sentiment [2]

References:
[1] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, August. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 597-606). ACM.
[2] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Review Sentiments Combined With Airline Twitter Sentiment 2

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Review Sentiments [1]
- ATIS [2]

References:
[1] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, August. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 597-606). ACM.
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



Review Sentiments Combined With Airline Twitter Sentiment 3

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

- Review Sentiments [1]
- Large Movie Review Corpus [2]

References:
[1] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, August. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 597-606). ACM.

[2] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.

Download Data .zip



Review Sentiments Combined With Stanford Sentiment Treebank 3-Class

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Review Sentiments [1]
- Stanford Sentiment Treebank 3-class [2]

References:
[1] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, August. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 597-606). ACM.
[2] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).

Download Data .zip



Self Driving Car Sentiments Combined With Airline Twitter Sentimen

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Self Driving Cat Sentiment [1]
- Airline Twitter Sentiment [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Self Driving Car Sentiments Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Self Driving Cat Sentiment [1]
- ATIS [2]

References:
[1] https://www.figure-eight.com/data-for-everyone/
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



Self Driving Cars Twitter Sentiment (SDC)

The Self-driving cars Twitter sentiment analysis dataset is crowdsourced by [1], it has 6 sentiment classes to classify the sentiments of self driving cars: very positive, slightly positive, neutral, slightly negative, very negative and not relevant. The training set has 6082 sentences and the test set has 1074 sentences.

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



SMS Spam Collection (SMSS)

The SMS Spam Collection is a collection of labeled message for mobile phone spam research [1]. It contains two class: spam and ham, the training set has 3901 sentences, the validation set has 558 sentences and the test set has 4957 sentences.

References:
[1] http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

Download Data .zip



SNIPS Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- SNIPS Natural Language Understanding Benchmark [1]
- Airline Twitter Sentiment [2]

References:
[1] https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems
[2] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



SNIPS + ATIS + Classic Literature

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- SNIPS Natural Language Understanding Benchmark [1]
- ATIS [2]
- Classic Literature

References:
[1] https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



SNIPS + ATIS + Classic Literature

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- SNIPS Natural Language Understanding Benchmark [1]
- ATIS [2]
- Paper Sentence Classification [3]

References:
[1] https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

[3] https://archive.ics.uci.edu/ml/datasets/Sentence+Classification

Download Data .zip



SNIPS + ATIS + Stanford Sentiment Treebank 3-Class

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- SNIPS Natural Language Understanding Benchmark [1]
- ATIS [2]
- Stanford Sentiment Treebank 3-Class Data[3]

References:
[1] https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

[3] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).

Download Data .zip



SNIPS Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- SNIPS Natural Language Understanding Benchmark [1]
- ATIS [2]

References:
[1] https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems-google-facebook-microsoft-and-snips-2b8ddcf9fb19
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



SNIPS Natural Language Understanding Benchmark (SNIPS)

The SNIPS natural language dataset is open sourced by SNIPS [1], it has 7 intents: "AddToPlaylist", "BookRestaurant", "GetWeather", "PlayMusic", "RateBook", "SearchCreativeWork", "SearchScreeningEvent". The training set contains 13784 sentences and the test set contains 700 sentences.

References:
[1] https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems-google-facebook-microsoft-and-snips-2b8ddcf9fb19

Download Data .zip



Sougou

A Chinese dataset of news articles converted to Roman alphabet characters with the Pin Ying method. This incarnation of the dataset was assembled by [1]. It contains 450 000 training samples and 60 000 test samples. It has five classes of data. Note: When loading this dataset with Python, you _must_ run the following lines of code before attempting to load the data:

```python import sys import csv csv.field_size_limit(sys.maxsize) ```

Else the code will fail beacuse each data item has so many characters.

References:
[1] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



SST Binary Classification Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Stanford Sentiment Treebank Binary Classification [1]
- Airline Twitter Sentiment [2]

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).
[2] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



SST Binary Classification Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Stanford Sentiment Treebank Binary Classification [1]
- ATIS [2]

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



Stanford Sentiment Treebank Binary Classification

Stanford Sentiment Treebank was introduced by [1], it is the first corpus with fully labeled parse trees, which could be normally used to capture linguistic features and predict the presented compositional semantic effect. It contains 5 sentiment classes: very negative, negative, neutral, positive and very positive; however we have filtered this to just two classes: positive, and negative. The training data is split into phrases rather than sentences, following the approach of [2]. The training data has 117220 sentences, the validation set has 872 sentences and the test set has 1821 sentences.

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).
[2] https://gist.github.com/wpm/52758adbf506fd84cff3cdc7fc109aad

Download Data .zip



SST 3-Class Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Stanford Sentiment Treebank 3-Class Classification [1]
- Airline Twitter Sentiment [2]

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).
[2] https://www.figure-eight.com/data-for-everyone

Download Data .zip



SST 3-Class Classification Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Stanford Sentiment Treebank 3-Class Classification [1]
- ATIS [2]

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



SST 3-Class Classification + Classic Literature + Large Movie Review Corpus

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Stanford Sentiment Treebank 3-Class Classification [1]
- Classic Literature [2]
- Large Movie Review Corpus [3]

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).
[2] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.

Download Data .zip



SST 3-Class Classification Combined With Large Movie Review Corpus

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Stanford Sentiment Treebank 3-Class Classification [1]
- Large Movie Review Corpus [2]

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).
[2] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.

Download Data .zip



SST 3-Class Classification Combined With SST Binary Classification

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- Stanford Sentiment Treebank 3-Class Classification [1]
- Stanford Sentiment Treebank Binary Classification [2]

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).

Download Data .zip



Stanford Sentiment Treebank 3-Class

Stanford Sentiment Treebank was introduced by [1], it is the first corpus with fully labeled parse trees, which could be normally used to capture linguistic features and predict the presented compositional semantic effect. It contains 5 sentiment classes: very negative, negative, neutral, positive and very positive; however we have filtered this to just three classes: positive, negative and neutral. The training data is split into phrases rather than sentences, following the approach of [2]. The training data has 236076 sentences, the validation set has 1100 sentences and the test set has 2210 sentences.

References:
[1] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).
[2] https://gist.github.com/wpm/52758adbf506fd84cff3cdc7fc109aad

Download Data .zip



Text Emotion Classification Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- TEXT Emotion Classification [1]
- Airline Twitter Sentiment [2]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Text Emotion Classification Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- TEXT Emotion Classification [1]
- ATIS [2]

References:
[1] https://www.figure-eight.com/data-for-everyone/
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



Text Emotion Classification Combined With New Year's Resolutions Tweets

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- TEXT Emotion Classification [1]
- New Year's Resolution Tweets [1]

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Text Emotion Classification (TE)

The Text Emotion Classification dataset is crowd-sourced by [1], it contains 13 classes for emotional content like happiness or sadness. The training set has 34000 sentences and the test set has 6000 sentences.

References:
[1] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



Yahoo! Answers Dataset

A dataset of questions and answers from Yahoo gathered by [1]. Each training data item consists of the question title, question content and best answer and each label is one of 10 possible categories for the answer. It contains 1.4 million items of training data and 60 000 items of training data.

References:
[1] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



Yelp Review Full Star Dataset (YELP)

The Yelp reviews dataset consists of reviews from Yelp, it is extracted from the Yelp Dataset Challenges 2015 data [1], it is constructed and used as a benchmark in the paper [2]. In total, there are 650,000 training samples and 50,000 testing samples with 5 classes, we split 10% of the training set as validation set. This dataset is a smaller version of the full dataset, subsampled to just 5% of its usual size.

References:
[1] http://www.yelp.com/dataset_challenge
[2] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



Yelp Review Binary Classification (YELP)

The Yelp reviews dataset consists of reviews from Yelp, it is extracted from the Yelp Dataset Challenges 2015 data [1], it is constructed and used as a benchmark in the paper [2]. Here reviews are classified as either positive or negative. In total, there are 560,000 training samples and 38,000 testing samples with 2 classes, we split 10% of the training set as validation set.

References:
[1] http://www.yelp.com/dataset_challenge
[2] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



Yelp Review Full Star Dataset (YELP)

The Yelp reviews dataset consists of reviews from Yelp, it is extracted from the Yelp Dataset Challenges 2015 data [1], it is constructed and used as a benchmark in the paper [2]. In total, there are 650,000 training samples and 50,000 testing samples with 5 classes, we split 10% of the training set as validation set.

References:
[1] http://www.yelp.com/dataset_challenge
[2] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

Download Data .zip



YouTube Spam Classification (YTS)

The Youtube Spam Classification dataset comes from the [1], which is a public set of comments collected for spam research. The dataset contains 2 classes, the training set has 1363 sentences, the validation set has 194 sentences and the test set has 391 sentences.

References:
[1] Alberto, T.C., Lochter, J.V. and Almeida, T.A., 2015, December. Tubespam: Comment spam filtering on youtube. In Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on (pp. 138-143). IEEE.

Download Data .zip



YouTube Spam Classification Combined With Airline Twitter Sentiment

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- YouTube Spam Classification [1]
- Airline Twitter Sentiment [2]

References:
[1] Alberto, T.C., Lochter, J.V. and Almeida, T.A., 2015, December. Tubespam: Comment spam filtering on youtube. In Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on (pp. 138-143). IEEE.
[2] https://www.figure-eight.com/data-for-everyone/

Download Data .zip



YouTube Spam Classification Combined With ATIS

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:
- YouTube Spam Classification [1]
- ATIS [2]

References:
[1] Alberto, T.C., Lochter, J.V. and Almeida, T.A., 2015, December. Tubespam: Comment spam filtering on youtube. In Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on (pp. 138-143). IEEE.
[2] Price, P.J., 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.

Download Data .zip



YouTube Spam Classification + Text Emotion Classification + Paper Sentence Classification + Review Sentiment Classification

A dataset composed of two others, with the label for each item of data being the dataset it originally came from.

Dataset is composed of:

- YouTube Spam Classification [1]
- Text Emotion Classification [2]
- Paper Sentence Classification [3]
- Review Sentiment Classification [4]

References:
[1] Alberto, T.C., Lochter, J.V. and Almeida, T.A., 2015, December. Tubespam: Comment spam filtering on youtube. In Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on (pp. 138-143). IEEE.
[2] https://www.figure-eight.com/data-for-everyone/
[3] https://archive.ics.uci.edu/ml/datasets/Sentence+Classification
[4] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, August. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 597-606). ACM.

Download Data .zip