Pinned Repositories
Analysis-Wikipedia-Entities
Goal: To understand the Wikipedia dataset, especially the entity info boxes. Task: We have taken the Wikipedia dump. Our aim is to extract information about various entity types. The steps for this task are as follows: 1. Given the Wikipedia dump, gather all the pages from Wikipedia with Info boxes on them. 2. Find the set of all possible entity types on Wikipedia 3. Find the set of all possible attributes that can be associated with any entity type on Wikipedia. 4. From a few values of these attributes, infer the data type of these attributes as one of the following: String, set of strings, duration, number, set of durations, date, other. 5. Find various units that can be used to express the value of a numeric attribute. E.g., for “height” attribute of “person” entities, the units could be “cms, inches” 6. For numeric attributes, find typical ranges (using the most popular unit). E.g., For person entities, the age attribute should have the range as 0-150 years. 7. For attributes which are semantically similar but have different names used across different entities of the same type, merge them. E.g., Automatically identify that the attribute “birthdate” is the same as “bdate”.
aspect_category
azure-docs
Open source documentation of Microsoft Azure
Extract_EmailIDs_Unstructured_webpages
Goal: To understand basic crawling, and use simple heuristics to handle real world unclean web data to get email ids. Input: 2000 business webpages crawled from Yelp. Each webpage is an HTML containing details about the business. It does not have the email id, but it has the website address for the business which can be used to find the contact us page for the website and thereby extract its email id. Task is to obtain structured data for the business: business name, business phone number, business home page URL, contact-us URL for the business, email id for the business.
guidance
A guidance language for controlling large language models.
Handwritten-Digit-Recognition
The problem of handwriting recognition is to interpret intelligible handwritten input automatically, which is of great interest in the pattern recognition research community because of its applicability to many fields towards more convenient input devices and more efficient data organization and processing. We have to code a complete digit recognizer and test it on the MNIST digit dataset. As a benchmark for testing classification algorithms, the MNIST dataset has been widely used to design novel handwritten digit recognition systems. The dataset consists of 70,000 gray scale images, each of size 784. The recognizer is supposed to read the image data, extract features from it and use a k-nearest neighbor classifier to recognize any test image. To carry out the experiments, we need to randomly divide it into two partitions - training and testing. The training set is used to create the classifier and test set is used to determine the accuracy.
ImplementingEigenFaces
The goal of this mini project is to get familiarized with the ideas of image representation, PCA and LDA, and face recognition. It is also understand the practical difficulties in developing real-world systems that work with acceptable accuracies.
Phrase-Translation
Words may not always be the best atomic unit of a sentence. One word in the source language often corresponds to multiple words in the target language. A word-based model would break down in these cases. This is the mortivation for building a phrase-based model for translation.
Top-K-Influentials-in-Temporal-Graph
Given a social network graph, our objective is to find the top –k influential nodes such that if these k nodes are made seeds of information, the information will spread to maximal number of nodes in a certain number of time stamps. We also wish to optimise k so that there is a reasonable trade-off between cost and time.
try_sentiment
This is an attempt to implement NRC-Canada's sentiment module for SemEval'14
satarupaguha11's Repositories
satarupaguha11/Extract_EmailIDs_Unstructured_webpages
Goal: To understand basic crawling, and use simple heuristics to handle real world unclean web data to get email ids. Input: 2000 business webpages crawled from Yelp. Each webpage is an HTML containing details about the business. It does not have the email id, but it has the website address for the business which can be used to find the contact us page for the website and thereby extract its email id. Task is to obtain structured data for the business: business name, business phone number, business home page URL, contact-us URL for the business, email id for the business.
satarupaguha11/Handwritten-Digit-Recognition
The problem of handwriting recognition is to interpret intelligible handwritten input automatically, which is of great interest in the pattern recognition research community because of its applicability to many fields towards more convenient input devices and more efficient data organization and processing. We have to code a complete digit recognizer and test it on the MNIST digit dataset. As a benchmark for testing classification algorithms, the MNIST dataset has been widely used to design novel handwritten digit recognition systems. The dataset consists of 70,000 gray scale images, each of size 784. The recognizer is supposed to read the image data, extract features from it and use a k-nearest neighbor classifier to recognize any test image. To carry out the experiments, we need to randomly divide it into two partitions - training and testing. The training set is used to create the classifier and test set is used to determine the accuracy.
satarupaguha11/Phrase-Translation
Words may not always be the best atomic unit of a sentence. One word in the source language often corresponds to multiple words in the target language. A word-based model would break down in these cases. This is the mortivation for building a phrase-based model for translation.
satarupaguha11/Top-K-Influentials-in-Temporal-Graph
Given a social network graph, our objective is to find the top –k influential nodes such that if these k nodes are made seeds of information, the information will spread to maximal number of nodes in a certain number of time stamps. We also wish to optimise k so that there is a reasonable trade-off between cost and time.
satarupaguha11/try_sentiment
This is an attempt to implement NRC-Canada's sentiment module for SemEval'14
satarupaguha11/Analysis-Wikipedia-Entities
Goal: To understand the Wikipedia dataset, especially the entity info boxes. Task: We have taken the Wikipedia dump. Our aim is to extract information about various entity types. The steps for this task are as follows: 1. Given the Wikipedia dump, gather all the pages from Wikipedia with Info boxes on them. 2. Find the set of all possible entity types on Wikipedia 3. Find the set of all possible attributes that can be associated with any entity type on Wikipedia. 4. From a few values of these attributes, infer the data type of these attributes as one of the following: String, set of strings, duration, number, set of durations, date, other. 5. Find various units that can be used to express the value of a numeric attribute. E.g., for “height” attribute of “person” entities, the units could be “cms, inches” 6. For numeric attributes, find typical ranges (using the most popular unit). E.g., For person entities, the age attribute should have the range as 0-150 years. 7. For attributes which are semantically similar but have different names used across different entities of the same type, merge them. E.g., Automatically identify that the attribute “birthdate” is the same as “bdate”.
satarupaguha11/aspect_category
satarupaguha11/azure-docs
Open source documentation of Microsoft Azure
satarupaguha11/guidance
A guidance language for controlling large language models.
satarupaguha11/ImplementingEigenFaces
The goal of this mini project is to get familiarized with the ideas of image representation, PCA and LDA, and face recognition. It is also understand the practical difficulties in developing real-world systems that work with acceptable accuracies.
satarupaguha11/ImplementingPerceptronAlgorithms
satarupaguha11/reRankURL
This project is based on the Personalized Web Search Challenge organized by Kaggle. The aim of this challenge is to re-rank URLs of each SERP returned by the search engine according to the personal preferences of the users.
satarupaguha11/satarupaguha11.github.io
Personal Webpage
satarupaguha11/SearchEngineForWikipedia
Given a query, search the Wikipedia Corpus (46 GB) and give the titles of top ten retrieved documents, in ranked order. Queries can be either phrase queries or field based queries. Multi-level indexes were built to improve retrieval speed. Evaluation will be done primarily on the basis of the quality of results and time taken for retrieval (less than 1 sec). Keeping the size of the index was also a challenge. Compression techniques was used for that purpose.
satarupaguha11/sentiment-stateOfTheArt
satarupaguha11/stanford-sentiment
This code is released by Stanford for Sentiment Analysis
satarupaguha11/Top-k
satarupaguha11/TopKInfluentialsTwitterSpreadLink
satarupaguha11/transformers
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.