Dataset – The WISE Lab

BiliBili Time Synchronized Comments (TSC) Dataset

This is the Time Synchronized Comments (TSC) dataset from BiliBili.com (commonly known as the B-site), which is one of the largest user-interactive video sharing websites in China. We used the dataset for the Personalized Key Frame Recommendation research published in SIGIR 2017, which attempts to display personalized key frames for different users even on the same video. This released dataset is larger and more complete than what we used in the paper, including more than 500K users, 900 videos, and 1.5 million time synchronized comments. This data can support large-scale model training for various research tasks in Recommendation, IR, Multimedia, etc. Please stay tuned as we are going to release a more complete version which includes the user profiles and video metadata. The dataset can be accessed by clicking here (https://www.dropbox.com/sh/3p51zbc9as2l72a/AACpBNbzxrlV-jS09GG9G1Rea?dl=0). We appreciate your citing the following paper if using this dataset for your research.

References:

[1] Xu Chen, Yongfeng Zhang, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin. Personalized Key Frame Recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), August 7 – 11, 2017, Tokyo, Japan.

JD.com E-Commerce Data

JD.com is one of the largest Chinese E-commerce websites. This dataset contains consumer purchasing behaviors, user ratings, reviews, and product metadata from Jan 1, 2011 to Mar 31, 2014 (3 years and a quarter), covering 15 first-level product categories, 987 second-level product categories, nearly 2 million users, over 100K products, and over 60 million reviews. Each piece of textual review in this dataset consists of three sub-reviews: a positive review, a negative review, and an overall review. The dataset can be downloaded by clicking here (https://pan.baidu.com/s/1hsQSTbm), and the download passcode is 3ru2.

References:

[1] Yongfeng Zhang, Min Zhang, Yi Zhang, Guokun Lai, Yiqun Liu, Honghui Zhang, Shaoping Ma. Daily-Aware Personalized Recommendation based on Feature-Level Time Series Analysis. In Proceedings of the 24th International World Wide Web Conference (WWW 2015), May 18 – 22, 2015, Florence, Italy.

Amazon Baby Registry Dataset

This dataset is from Amazon Baby Registry (http://www.amazon.com/babyregistry). In this website, people register a wishlist of products to purchase for their new baby. As a result, each list is a set of complementary products. Each list contains the user_id and a list of product_id; for each product, we know its title, brand, price, category (book, toy, etc.), and product URL. The dataset can be downloaded here (http://yongfeng.me/attach/amazon_baby_registry.zip).

References:

[1] Qi Zhao, Yongfeng Zhang, Yi Zhang, Daniel Friedman. Multi-Product Utility Maximization for Economic Recommendation. In Proceedings of the 10th International Conference on Web Search and Data Mining (WSDM 2017), February 6 – 10, 2017, Cambridge, UK.

Dianping Review Dataset

This dataset contains the user reviews as well as the detailed business meta data information crawled from a famous Chinese online review website DianPing.com, including the 3,605,300 reviews of 510,071 users towards 209,132 businesses. The numerical ratings of this dataset are used for collaborative filtering (Localized Matrix Factorization) in [1] and [2], and the textual reviews are used for sentiment analysis and explainable recommendation in [3] and [4], respectively. Detailed data format descriptions are included in the readme.txt file. This dataset can be downloaded here (http://pan.baidu.com/s/1dDxkY0x). The download password is “t23c”, and the extraction password for the zip file is “yongfeng.me”.

References:

[1] Yongfeng Zhang, Min Zhang, Yiqun Liu, Shaoping Ma and Shi Feng. Localized Matrix Factorization for Recommendation based on Matrix Block Diagonal Forms. In Proceedings of the 22nd International Conference on World Wide Web (WWW 2013), May 13 – 17, 2013, Rio de Janeiro, Brazil.

[2] Yongfeng Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. Improve Collaborative Filtering Through Bordered Block Diagonal Form Matrices. In Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2013), July 28 – August 1, 2013, Dublin, Ireland.

[3] Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. Do Users Rate or Review? Boost Phrase-level Sentiment Labeling with Review-level Sentiment Classification. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 – 11, 2014, Gold Coast, Australia.

[4] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu and Shaoping Ma. Explicit Factor Models for Explainable Recommendation based on Phrase-level Sentiment Analysis. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 – 11, 2014, Gold Coast, Australia.

Phrase-level Sentiment Labeled Reviews

This dataset contains the reviews of two domains, which are restaurant reviews and digital cameras. We extract all of the product feature word to user opinion word pairs (e.g. service | good, price | reasonable, etc) from each of the reviews, as well as the sentiment polarity of these pairs. The methodology of sentiment polarity labelling and feature-opinion pair extraction are presented in [1] and [2] respectively, and this dataset is used for explainable recommendation in [3]. We also provide the toolkit for extracting such product feature and user opinion words from arbitrary (English or Chinese) textual corpora, for detail, please refer to the “Sentires” toolkit under the Software tab of this website.

The labeled Dianping dataset can be downloaded at here (http://pan.baidu.com/s/1zsWMU) with password “ouk2”, and the labeled DC review dataset can be downloaded here (http://pan.baidu.com/s/1bnGlVdl) with the password “tier”; The extraction passwords for the zip files are both “yongfeng.me”. A brief description of the data format is as follows:

A user review is formatted as an XML entry of the form:

User_id Item_id flavor_rating environment_rating service_rating

review_text

feature-opinion pairs matched in the review_text, each of the form [feature_word, opinion_word, sentiment_polarity, times_of_ occurrence, reversed_or_not]

e.g. [service, good, +1, 1, Y] means that the pair ‘service | good’ is matched for once in the review, and the pair itself represents a positive sentiment (+1), however, it is reversed (Y means that it is indeed reversed, and N is not reversed) by a negation word (e.g. ‘not’), so the final sentiment of this pair in this review would be negative.

References:

[1] Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. Do Users Rate or Review? Boost Phrase-level Sentiment Labeling with Review-level Sentiment Classification. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 – 11, 2014, Gold Coast, Australia.

[2] Yunzhi Tan, Yongfeng Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. A Unified Framework for Emotional Elements Extraction based on Finite State Matching Machine. Natural Language Processing and Chinese Computing, Communications in Computer and Information Science (CCIS), Volume 400, 2013, pp 60-71.

[3] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu and Shaoping Ma. Explicit Factor Models for Explainable Recommendation based on Phrase-level Sentiment Analysis. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 – 11, 2014, Gold Coast, Australia.