Kloud Kd Tree implements Kd Tree in a map reduce framework. Details are in this paper.
-
under
./MapReduceKDT/
-
Independent Kd Trees
-
Distributed Kd Trees
-
Run
./ikdt_index.sh
or./dkdt_index.sh
under./MapReduceKDT/
-
Index trees are stored in
ikdt_index_output.py
as a python variable.
-
Run
./ikdt_query.sh
or./dkdt_query.sh
under./MapReduceKDT/
-
Output stored in
ikdt_query_output.py
as a python variable.
-
./generate.py
provides basic functionality to generate test data with various rows/dimensions/distributions. -
./index.py
can perform single machine Kd tree tests. -
./index_lsh.py
can perform single machine indexing using E2LSH scheme.
Eyra Search: avdanced way to discover walls and handles
using underlying knowledge graphs to discover related Facebook walls and @handles.
- Currently, search on Facebook using descriptive query is not effective in returning wall pages. For example, the query "Luxury department store" returns a single Australian page for a US-based user.
label propagation, transfer learning, collaborative filtering
- Unlike Facebook pages, Twitter @handles lack default category.
- Official API lacks good support for finding:
- verified, most followed @handles related to NBA
GET
- @handles in both Golf and Fashion
GET
- top 3 categories that @elonmusk should be in
ASSIGN
- Similar @handles to @BarackObama
FIND
- @handles matching _arbitrary_description_
SEARCH
(Not just string match but semantic,relational match as well)
code for our Silverback paper on efficient probabilistic association mining and a lightweight distribution framework.
We address the problem of large scale probabilistic association rules mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed Silverback framework, developed at Voxsup Inc. Silverback tackles the storage efficiency by proposing a probabilistic columnar infrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilistic pruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to traditional Hadoop-based approach for improving scalability by adding more hosts, Silverback – which has been commercially deployed and developed at Voxsup Inc. since May 2011 – has much better run-time performance with negligible accuracy sacrifices.
Two publications that describe our system:
30th IEEE ICDE 2014
Yusheng Xie, Diana Palsetia, Ankit Agrawal, Goce Trajcevski, and Alok Choudhary. Silverback: Scalable Association Mining For Massive temporal Data in Columnar Probabilistic DatabasesNU-EECS-13-06
Yusheng Xie, Diana Palsetia, Kunpeng Zhang, Ankit Agrawal, Goce Trajcevski, Alok Choudhary: Silverback: Scalable Association Mining For Massive Temporal Data in Columnar Probabilistic Databases
- Use a columnar, probabilistic storage for transactions.
- Frequent item-sets mining using Bloom Filter operations.
- Min-hash based prunning technique for drastically reducing candidate item-sets.
- A distributed Framework for multi-node scalability (Ill.)
- False-positive impact introduced by using Bloom Filters
- The dots perfectly lining up with the red line would imply there is no error introduced through Bloom Filters. The plots below show that Bloom Fitler introduces marginal errors.
- Multi-node scalability and performance
config file available upon request.
silverback core: core implementatation of mining algorithms and distributed framework. data source-independent code.
silverback fb/tw: data-dependent implementation of communication services. Different data sources may require different sets of allowable operations.
Please see the README file in each folder. To use,
$ mkdir silverback_deployment
$ cp silverback_core/*.* silverback_deployment/
$ cp locality_sensitivy_hashing/*.* silverback_deployment/
$ cp silverback_twitter_module/*.* silverback_deployment/ #(or facebook_module)
$ #then create a config.py in silverback_deployment/
Good packages for minHash
- https://github.com/embr/lsh
- https://github.com/go2starr/lshhdc
- https://github.com/rahularora/MinHash
MuSES: Multilingual Sentiment Elicitation System for Social Media Data
MuSES consists of three different sentiment identification algorithms. Our first algorithm augments previous compositional semantic rules by adding social media-specific rules. In the second algorithm, we define a scoring function that measures the degree of a sentiment, instead of simply classifying a sentiment into binary polarities. All such scores are calculated based on a large volume of customer reviews. Due to the special characteristics of social media texts, we propose a third algorithm, which takes emoticons, negation word position, and domain-specific words into account. In addition, we propose a label-free process to transfer multilingual sentiment knowledge between different languages. We conduct our experiments on user comments from Facebook, tweets from Twitter, and multilingual product reviews from Amazon.
- Publication
IEEE Intelligent Systems Magazine
Yusheng Xie, Zhengzhang Chen, Yu Cheng, Kunpeng Zhang, Daniel K. Honbo, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. MuSES: a Multilingual Sentiment Elicitation System for Social Media Data
-
The web is multilingual. A recent study shows that more than 50% of the conversions on Facebook are non-English; yet most of the infrastructural works on sentiment analysis (e.g., NLTK, WordNet) focus primarily on English.
-
Labeling process for a "new" language can be really expensive.
-
Unlike traditional documents such as Newspaper articles, Internet languages possess a lot of quirky traits such as the lack of strict grammar, heavy usage of emoticons, etc.
-
Massive amount of new tweets, comments, blogs pose a serious engineering challenge.
-
Overall Architecture
- LingPipe
guess_language
guess-language
guess-language
- Languages on Facebook
- A novel zero-effort labeling system that leverages knowledge bases like Wikipedia, and labels word-level sentiment for non-English words.
- Compositional Semantic Rule Algorithm
- Numeric Sentiment Identification Algorithm
- How to associate numeric scores with the degree of textual sentiment;
- If we assign a score to each word/phrase, how to combine all the scores of multiple words for a sentence.
- Bag-of-Words and Rule-based Algorithm
- Due to the special characteristics of social media texts, we define some domain rules to analyze sentiments. What people write on Facebook or Twitter is different in style from traditional product reviews or blog articles. Social media texts are often short and contain Internet idioms, etc.
- To unify the independent results from the three individual algorithms, MuSES employs high-level meta-learning techniques. Decision tree (DT), neural networks (NN), random forest model (RF), and logistic regression (LR) are used as our sentimental classification models. Five discriminative features are considered by these models. These features include 3 basic ones and 2 derived ones:
- S1, integer output from compositional semantic rules;
- S2, float output from the numeric sentimental identification algorithm;
- S3, output from the back-of-word and rule-based algoithm (
P
,N
, orO
); - S1+S2; and
- S1−S2.
Datasets originally prepared by @ych133
Collaborators: @miccrun @kpzhang @ych133 @liuhaotian @honbo @palsetia @alokchoudhary
project link We investigate a class of emerging online marketing challenges in social networks; macro behavioral targeting (MBT) is introduced as non-personalized broadcasting efforts to massive populations. We propose a new probabilistic graphical model for MBT. Further, a linear-time approximation method is proposed to circumvent an intractable parametric representation of user behaviors. We compare the proposed model with the existing state-of-the-art method on real datasets from social networks. Our model outperforms in all categories by comfortable margins.
Publication:
13th SIAM Data Mining 2013
Yusheng Xie, Zhengzhang Chen, Kunpeng Zhang, Md. Mostofa Ali Patwary, Yu Cheng, Haotian Liu, Ankit Agrawal, and Alok Choudhary. Graphical Modeling of Macro Behavioral Targeting in Social Networks- A version has been sent to a top Marketing journal for review.
- Target audience: Brand and social media marketers and researchers
- Problem: Marketers spend millions every month to acquire Facebook fans
- We spent $30M and acquired 20M FB fans. Now what? – a major US retailer
- Research question: How to optimally engage one’s fans so that they stay active and retain their value to the brand?
- FB user => Fan => Engaged Fan => Customer
Incorporating Conditional Random Fields (CRF) and Active Learning to Improve Sentiment Identification
Many machine learning, statistical, and computational linguistic methods have been developed to identify sentiment in sentences in documents, yielding promising results. However, most of the current state-of-the-art methods focus on individual sentences and ignore the impact of context on the meaning of a sentence. In this paper, we propose a method based on conditional random fields to incorporate sentence structure and context information in addition to syntactic information. We also investigate how human interaction affects the accuracy of sentiment labeling using limited training data. We propose and evaluate two different active learning strategies for labeling sentiment data.
-
Publication
35th ACM SIGIR 2012
Kunpeng Zhang, Yusheng Xie, Yu Cheng, Doug Downey, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. Sentiment Identification by Incorporating Syntactics, Semantics and Context Information -
Publication
Elsevier Neural Networks
Incorporating Conditional Random Fields and Active Learning to Improve Sentiment Identification
- How to take full advantage of the sentence structure;
- How to use context information to capture the relationship among sentences and to improve document-level sentiment classification;
- How to account for Internet language word set and emoticons;
- How to incorporate human interaction to improve sentiment identification accuracy and construct a large training dataset.
-
We want to capture the context information (e.g., neighboring sentences or sentences connected by transition words) among sentences in a document. The procedure of sentiment identification therefore becomes a kind of sequence labeling.
-
The goal of the model is to give a label to each sentence corresponding to the sentence sequence. We use CRF as a tool to model this sequence labeling problem.
- Number of Positive/Negative Words
- Containing Any Positive/Negative Emoticons
- Comparative Sentences (JJR vs RBR vs JJS vs RBS)
- Type of Conjunction Words (IN vs CC)
- Sentence Position in Paragraph
- Sentence/clause Complexity
- Position of Positive/Negative Words
- Position of Negation Words
- Comparison Subject
- Similarity to Neighboring Sentences
- Levenshtein
- LSI-Levenshtein
python ./prepare.py
check the parameters, paths, etc.python ./generate_template.py
if necessary.bash ./run_crf.sh
modifyrun_crf.sh
if necessary.
- Dataset contains long, complicated sentences from Stanford & Northwestern.
- CRF++ is a great tool.
Collaborators: @kpzhang @hixiaoxi
Recommendation systems automatically recommend relevant items to users based on their preferences, which are vital to the success of online retailers and content providers. Collaborative filtering works well in practice at web scale. However, one common difficulty in collaborative filtering rec- ommendation systems is the "cold start" problem. The word "cold" refers to the items that are not yet rated by any user or the users who have not yet rated any items. We propose ELVER, an algorithm for recommending cold items from large, sparse user-item matrices. We use ELVER to recommend and optimize page-interest targeting on Facebook. Special traits of a social network like Facebook have influenced the design of ELVER. Existing techniques for cold recommendation mostly rely on content features in the event of lacking user ratings. Traditional items (e.g., movies or music) have rich, organized content features like actors, directors, awards, etc. Since it is very hard to construct universally meaningful features for the millions of Facebook pages, ELVER makes minimal assumption of content features.
2013 IEEE International Conference on Big Data
Yusheng Xie, Zhengzhang Chen, Kunpeng Zhang, Yu Cheng, Chen Jin, Ankit Agrawal, and Alok Choudhary. Elver: Recommending Facebook Pages in Cold Start Situation Without Content Features