Menu

Text Categorization using WordNet and Scholarly Article Domain Classification using Supervised Learning Algorithms A PROJECT REPORT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF BACHELOR IN TECHNOLOGY IN COMPUTER ENGINEERING SUBMITTED BY HARSH KUMAR 2K14/CO/041 KUSHAGRA JAIN 2K14/CO/055 SANIDHYA PAL 2K14/CO/105 VARUN KASHYAP 2K14/CO/139 UNDER THE SUPERVISION OF Ms

Text Categorization using WordNet and Scholarly Article Domain Classification using Supervised Learning Algorithms A PROJECT REPORT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF BACHELOR IN TECHNOLOGY IN COMPUTER ENGINEERING SUBMITTED BY HARSH KUMAR 2K14/CO/041 KUSHAGRA JAIN 2K14/CO/055 SANIDHYA PAL 2K14/CO/105 VARUN KASHYAP 2K14/CO/139 UNDER THE SUPERVISION OF Ms. MINNI JAIN DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING DELHI TECHNOLOGICAL UNIVERSITY (FORMERLY DELHI COLLEGE OF ENGINEERING) Bawana Road, Delhi-110042 May, 2018

DELHI TECHNOLOGICAL UNIVERSITY (FORMERLY DELHI COLLEGE OF ENGINEERING) Bawana Road, Delhi-110042 CANDIDATE’S DECLARATION This is to declare that the project titled “Text Categorization using WordNet and Scholarly Article Domain Classification using Supervised Algorithms” which is submitted by us to the department of Computer Science and Engineering, Delhi Technological University, Delhi in partial fulfillment of the requirement for the award of the degree of Bachelor in Technology, is original and not copied from any source without proper citation. This work has not previously formed the basis for the award of any Degree, Diploma Associateship, Fellowship or other similar title or recognition. Place: Delhi Date: Harsh Kumar Kushagra Jain Sanidhya Pal Varun Kashyap 2K14/CO/041 2K14/CO/055 2K14/CO/105 2K14/CO/139

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING DELHI TECHNOLOGICAL UNIVERSITY (FORMERLY DELHI COLLEGE OF ENGINEERING) Bawana Road, Delhi-110042 CERTIFICATE This is to certify that the Project Dissertation titled “Text Categorization using WordNet and Scholarly Article Domain Classification using Supervised Algorithms” which is submitted by HARSH KUMAR 2K14/CO/041, KUSHAGRA JAIN 2K14/CO/041, SANIDHYA PAL 2K14/CO/105, VARUN KASHYAP 2K14/CO/139 to the Department of Computer Science and Engineering Delhi Technological University, Delhi in partial fulfillment of the requirement for the award of the degree of Bachelor of Technology, is a record of the project work carried out by the students under my supervision. To the best of my knowledge, this work has not been submitted in part or full for any Degree or Diploma to this University or Elsewhere Place: Delhi Date: 15th May 2018 Ms. Minni Jain Assistant Professor

ACKNOWLEDGEMENT We are using this opportunity to express our gratitude to everyone who helped and supported us throughout our project. We express our special appreciation and thanks to our guide, Ms. Minni Jain who has been a tremendous mentor for us and for her continuous support from initial conceptualization stage to ongoing advice and taking project to conclusion. We wish to thank her for her undivided support, patience, motivation, and immense knowledge, which inspired and encouraged us to go to the right way. At last, but not the least, we want to thank our friends and family who appreciated us for our work and motivated us.

Abstract With the rapid spread of the Internet and the increase in on-line information, the technology for automatically classifying huge amounts of diverse text infor-mation has come to play a very important role. In the 1990s, the performance of computers improved sharply and it became possible to handle large quantities of text data. This led to the use of the machine learning approach, which is a method of creating classifiers automatically from the text data given in a category label. This approach provides excellent accuracy, reduces labor, and ensures conservative use of resources. Text categorization is an upcoming area in the field of text mining. The text documents possess huge number of features due to their unstructured nature. In today’s society a large portion of the world’s population get information through electronic devices. This opens up the possibility to enhance their reading experience by personalizing information for the readers based on their previous preferences. Keywords: Multi label Text Categorization; Lexical Analysis; Semantic Analysis; Word Net, Natural Language Processing, Machine Learning, Support Vector Machine, Feature Selection, Natural Language ToolKit

Contents Abstract Acknowledgement 1 Introduction 2 Introduction to text classification 2.1 Introduction to text classification 2.2 Pre-processing 2.2.1 Data transformation 2.2.2 Feature selection 2.2.3 Part-of-Speech Filtering 2.3 Common text classification issues 2.3.1 Over Fitting 2.3.2 Imbalanced Data 2.4 Multi-label classification 2.5 Choosing a classifier. . 2.5.1 Support vector machines 2.5.2 Logistic Regression 2.5.3 Naïve Bayes 2.5.4 Random Forest 3 Word Net 3.1 WordNet Introduction. 3.2 Database Contents 3.3 Lexical Ontologies 3.4 Limitations

CONTENTS 4 Text Categorization Using Support Vector Machines 4.1 Support Vector Machines 4.2Why Should SVMs Work Well for Text Categorization 5 Text Categorization: Implementation Details and Results 5.1 Implementation Details 5.2 Results 6 Conclusions 6.1 Summary 6.2 Future Directions References Appendix

List of Figures 1.1 The text classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 The text classification process . . . . . . . . . . . . . . . . . . . . . 9 2.2 Visual representation of Overfitting. . . . . . . . . . . . . . . . . 12 3.1 Example Entry (Hamburger- WordNet) . . . . . . . . . . . . . 17 3.2 WordNet for Container. . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 Support Vector Machine. . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Learning without best features. . . . . . . . . . . . . . . . . . . . . 24 4.3 Classes and Support Vectors. . . . . . . . . . . . . . . . . . . . . . 29 5.1 Classification with ACM Classifier. . . . . . . . . . . . . . . . . 30 5.2 Naïve Bayes- Confusion Matrix. . . . . . . . . . . . . . . . . . . . 34 5.3 SVM- Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 1 Introduction With the rapid growth of the Internet, it has become natural that we handle a text not as printed but as online information. Today, we can search books and news electronically. Almost all companies and individuals have their own web pages and dispatch their information. When there is some information to find, it is usual to search for the information on the Internet. In this way, a lot of information has become open to the public. If you have much information, you would need to classify it. When the scale of the target is small or the target is written by yourself, it is possible to classify it. However, in the current situation that many and ordinary people electronically post a lot of texts to the Internet, it is becoming impossible to classify them by hand. Then, it is necessary to classify text information automatically into various fields. Moreover, new services that have never been considered before appeared along with the growth of online information. They include mail filters, web filters, and online help desks. Mail filters shut out unsolicited business e-mails or spam e-mails sent to many unspecified persons, by classifying e-mail into “ordinary mail” or “spam mail.” Web filters mainly prevent children from accessing undesirable website content, such as violence and pornography, by classifying web sites into “harmless” or “detrimental” categories. Online help desks that distribute elec-tronic mails from customers to operators may belong to a very special field. Text classification technology is essential to these services. 1

CHAPTER 1 In looking back upon the historical development of text classification technol-ogy, we find that the rule-base approach, i.e., the method of writing classification rules, was mainly used until the second half of the 1980s. This approach is simple but difficult to create rules by hand for high accuracy, and have to write many rules, as we change the domains. However, in the 1990s, the performance of computers improved rapidly and it became possible to handle a great amount of text data. This led to the use of the machine learning approach, which creates classifiers automatically from the text data which were previously labeled with categories by hand. This approach became popular because of its high classifica-tion accuracy, reduction of labor, and conservative use of resources. Various machine learning methods have been applied to text categorization until today. These include k-nearest-neighbor Yang94, decision trees Lewis94 and naive-Bayes Lewis94. Using a large number of words as features in these methods, which can potentially contribute to the overall task, is a challenge into machine learning approaches. More specifically, the difficulties in handling a large input space are twofold: • How to design a learning algorithm with an effective use of large scale feature spaces. • How to choose an appropriate subset of words for effective classification. These two factors are interdependent Yang97. We should develop learning machines with feature selection criteria because the suitable set of words greatly differs depending on the learning method Lewis94. Support Vector Machines (SVMs) Vapnik95 Cortes95 construct the opti-mal hyperplane that separates a set of positive examples from a set of nega-tive examples with a maximum margin. A margin is intuitively the distance from a data point to the classification boundary. SVMs have been shown to achieve a good generalization 2

CHAPTER 1 performance for a wide variety of classification prob-lems that require large-scale input space, such as handwritten character recogni-tion Vapnik95 and face detection Osuna98. Two studies have explored the use of SVMs for text categorization Joachims98 Dumais98. Although they both achieve promising performances, they use completely different feature (word) selection strategies. In Joachims98, words are considered features only if they occur in the train-ing data at least three times and if they are not stop words such as “and” and “or.” This approach, then, employs the inverse document frequency (IDF) Salton88 as a value for each feature. In contrast, Dumais98 considers only 300 words for each category, which are handled by a threshold for high mutual informa-tion Cover91. The feature value in this case is assigned as a binary value to indicate whether a word appears in a text or not. A natural question about SVM text categorization occurs to us: how much influence do dierent feature selection strategies have? Is there one best strategy for choosing appropriate words? Feature selection becomes especially delicate in agglutinative languages such as Japanese and Chinese because in these languages word identification itself is not a straightforward task. Unknown words output by word-segmentation and part-of-speech tagging systems contain both important keywords (like personal and company names) and useless portions of words. The selection of these un-known words is crucial to these languages. To address these questions, this thesis investigates the eect of prior feature selection in SVM-based text categorization using Japanese newspaper articles. In our experiments, the number of input spaces was gradually increased by two distinct criteria: mutual information (MI) filtering and part-of-speech (POS) filtering. MI filtering selects discriminating words for a particular category from an information-theoretical viewpoint. Words with higher mutual information are more highly representative in a specific text category. In contrast, POS filtering constructs word input space based on parts of speech. 3

CHAPTER 1 SVMs Joachims98 Dumais98 Taira99 have been applied to text categorization with remarkable success. However, the inductive approach cannot guarantee a suffciently high accuracy when there is a great difference between the training and test data distributions. This problem becomes extremely serious if the amount of training data is small. This is often the case under many practical conditions such as the classification of on-line Internet texts. Therefore, it is reasonable to utilize the distribution of unlabeled test data for training as well as the distribution of a small number of labeled training data. Nigam et al. proposed an EM-based method with naive-Bayes to take account of the distribution of test data under the condition where there is a small amount of training data Nigam00. Although the method shows substantial improvement over the performance of the standard naive-Bayes classifier, one of its limitations is the tendency to fall into a local optimum. In contrast, Joachims adapted a transductive method to SVMs Joachims99 and obtained an improvement in classification performance?Transduction is a general learning framework that minimizes classification errors only for the test data, while induction tries to minimize classification errors for both the training and test data Vapnik98?Transductive SVM (TSVM) achieves a high perfor-mance by assuming that the portion of unlabeled examples to be classified into the positive class is determined by the ratio of positive to negative examples in the training data. Nevertheless, the possibility of a decrease in performance re-mains under different but typical conditions when the ratio of positive to negative examples in the training data is very different from that in the test data, e.g., when a classifier, which learned from articles in 1995, is applied to classify articles related to the Internet in 2000. 1.1 Previous research In a study done by William B. Cavnar and John M. Trenkle they investigated how one can use the N-Gram model to classify a text into categories. The N-Gram model is used because it has 4

CHAPTER 1 some tolerance for e.g. spelling errors. They achieved 99.8% classification confidence for news articles and about 80% classification confidence for computer-related articles 6. Thorsten Joachims investigated how well a Support Vector Machine (SVM) classifies text compared to other classification models. The comparison is done by classifying a number of news articles into categories. SVM achieved a classification confidence of about 86% while Naïve Bayes achieved 72% 3. In a study from Stanford University the authors write how they, with help from SVM, used a new algorithm for machine learning. The most common way to use a SVM is to use a randomly selected dataset that have already been classified. But in this study they used pool-based active learning, in which the machine has access to a group of unclassified units and can request their different classifications. The results of this study shows that the author’s algorithm can have contributed to both settings of traditional supervised learning as well as transductive settings, in which a classifier only tries to predict on a know set of examples. The study showed that a larger amount of unclassified data increased the quality of the classifier L. Douglas Baker and Andrew Kachites MacCallum writes, in their article Distributional Clustering of Words for Text Classification, about Distributional Clustering in text classification. In this method words were grouped depending on the words class-label. The result of the study is that they could reduce feature dimensionality by three magni-tudes and they only lost 2% accuracy. They used a classification tool based on the Naïve Bayes model. The data they examined consisted of news articles with a total of 62258 words. Their method of Distributional Clustering achieved an classification confidence of 82.1% 8. 5

CHAPTER 1 In the article Learning and Revising User Profiles: The Identification of Interesting Web Sites the authors M. Pazzani and D. Billsus used a Naïve Bayes classifier to deter- mine which web pages a user could be interested in. They tested the classifier on nine different areas and it proved to be both effective and accurate. The authors noted that the classification confidence increased in relation to the amount of examples that were used, which was expected based on previous research 9. For statistical classifier design using a large number of unclassified text examples and sub-categories, I also apply a probabilistic model, called an Extended Tied Mixture (ETM) model Ueda01, with a Bayesian learning algorithm. I also apply a confidence-based “active labeling” method Ueda01 for text categoriza-tion to e?ciently select informative samples for labeling. The usefulness of the method is assessed through experimental results. Recently, based on probabilistic classifier design, several methods using a large number of unlabeled samples gathered at little cost have been proposed to over-come the small sample size problem Shahshahani94Miller97Nigam00. Before those methods were developed, McLachlan et al. McLachlan87 originally for-mulated this kind of problem as partial nonrandom classification. In this thesis, to overcome this small sample size problem, I utilize ETM models, which are nat-ural extensions of McLachlan’s basic model, for text categorization. In addition, we can often assume a potential sub-category for a given category, for exam-ple, the “sports” category could have the sub-categories “baseball” and “soccer.” The ETM models also contain the idea of latent sub-categories. Furthermore, I consider the issue of how we should select informative samples for new labeling from unlabeled samples to be an important practical problem. Active labeling is required when the number of labeled samples is initially too small or the current classification performance is unsatisfactory. This active labeling technique is es-sentially different from conventional active learning for pattern recognition (e.g., Cohn96) in the sense that the label information is completely missing. 6

CHAPTER 1 1.2 Using Machine Learning in Text Classification scikit-learn provides implementations for a large number of machine learning models, spanning a few different families: • Linear models: Linear Regression, Logistic Regression, … • Ensemble models: Random Forest, Gradient Boosting Trees, Adaboost, … • Bayesian models: (Multinomial/Gaussian/…) Naive Bayes, Gaussian Processes, • Support Vector Machines, k-Nearest Neighbors, and various other models. scikit-learn, on the other hand, does not implement deep learning models or more specialized models, such as Bayesian graphical models or sequenced-based models. For those, you’ll have to use specialized libraries such as tensorflow, keras or hmmlearn. Ultimately, no model is universally superior to all other models, and the performance of a model greatly depends on the the data used and the nature of the classification/prediction problem solved. This is known as the No Free Lunch Theorem. Now, let’s benchmark the following three models: • Logistic Regression: A linear classifier, mostly similar to traditional linear regression, but that fits the output of the logistic function. • (Multinomial) Naive Bayes: A Bayesian model that assumes total independence between features. In our case, this means that P(“football”) is unrelated to P(“stadium”), which of course is a terrible assumption. Still, this model works surprisingly well with the Bag of Words model, and has notably been used for spam detection. • Random Forest: Random Forest (as the name might suggest) is the ensembling of a large number of decision trees, each trained on a random subset of the input features. They work well when complex feature-relations are involved and are relatively robust to overfitting. 7

CHAPTER 1 • We usually also perform a hyperparameter search for each model: tuning each of its “knobs” (number of trees for Random Forest, penalty for Logistic Regression) until we find the optimal ones. For simplicity, we skip this step below and directly provide reasonably good parameters for each model. Model evaluation • One common mistake when evaluating a model is to train and test it on the same dataset: this is problematic because this will not evaluate how well the model works in realistic conditions, on unseen data, and models that overfit to the data will seem to perform better. It is common practice to split the data in three parts: • A training set that the model will be trained on. • A validation set used for finding the optimal parameters • A test set to evaluate the model’s performance. Figure 1.1: The text classification 8

Chapter 2 MULTI LABEL TEXT CLASSIFICATION In this chapter, an overview of text classification will be given. We will start with a general introduction. After that, the pre-processing procedure will be explained and common classification issues will be reviewed. Lastly, the choice of classifiers and their function will be discussed. 2.1 Introduction to text classification The goal in automatic text classification is to assign a document to a category by evaluating its text components. It has been applied successfully multiple times and is integrated in our everyday lives. For instance, newspaper articles and academic papers are often organized by subject or field. Text classification provides a solution in automatically organising countless articles and papers. Another well known application is spam filtering. It is becoming more and more common to receive unsolicited emails daily. By using text classification to label these emails as spam, they can be filtered automatically. Sculley et al.3 achieve the following results; 0.88 accuracy, 0.91 precision, 0.85 recall, in spam filtering by using two variants of Support Vec-tor Machines. 9

CHAPTER 2 Another interesting application of text classification is automated threat detection in social media. Kandias et al.4 use text classification to detect negative attitudes towards law enforcement. They classify comments to two categories, one with negative attitude comments and one with neutral attitude comments. 2.2 Pre-processing Before one can build a classification model, it is necessary to transform the data into useful features. The first six steps of Figure 1 will be used as reference for the pre-processing process. Figure 2.1: The text classification process 2.2.1 Data transformation Tokenization is the first important step in the transformation of the dataset. Before we can use a text for classification, we need to able to identify the separate features of a text. This can be done in different ways. For example, we can split the text into a set of words or we can split it into a set of word sequences. These words or word sequences will be will be referred to as tokens hereafter. 10

CHAPTER 2 Splitting the text into a set of words is done by searching for specific token separators. Two examples of separators that are easy to deter-mine are white spaces and newline indicators like
. However, some separators are ambiguous, such as a dot. It can be part of a token when it is in a number such as 35.0, but it can also indicate the end of a sentence. Splitting the text into a set of words sequences is called n-gram pro-cessing. An n-gram is a continuous sequence of n entities of a text. Entities could be text elements like words, syllables or characters. The benefit of using n-grams is the gain of more context in a single token and that information from word order could be disclosed. Some words are more meaningful when combined with a second or even a third word. For example, the word ‘advisory’ is more meaningful when it is combined with the word ‘board’ . Abou-Assaleh et al.5 show that using n-grams can be beneficial for text classification performance when using up to three entities per token. Stemming is an example of natural language processing. There are other methods of language processing that can be used, but stemming is the most common. Stemming means transforming words with ap-proximately the same semantics to their standard form. For example, reducing all words in plural form to singular form. Another option is to transform all words in past tense to present tense. Examples of other methods of language processing are text simplification or adding synonyms of the words to the token set. Deleting stop words can reduce the size of the feature set without performance loss. Stop words like de(the) and een(a) are likely to have no predictive value. Making a vector representation of a text means transforming the fea-ture set of tokens into a dictionary where numbers represent words. A classifier can only work with numbers and not with the textual repre-sentation of a word. There are several ways to vectorize, vectorization by count and tf-idf will be discussed6. 11

CHAPTER 2 A count vectorizer determines the value of a token based on its fre-quency. This is called term frequency. It keeps track of how many times all seperate tokens occur in a text. The more often it occurs, the higher the value of the token is. Tf-idf takes it one step further. It considers the term frequency, but also takes the token specificity into account. This combination is called term frequency – inverse document frequency. For example, the token ‘the’ often occurs with high frequency in all texts. A tf-idf vectorizer will assign a low value to this token. A word that occurs often in few texts but less in other texts will receive a higher value. 2.2.2 Feature selection Feature selection is a way of finding the best possible feature set. The number of noisy or unspecific features is reduced. Reducing the feature set to a smaller set with specific features can prevent overfitting. It is also important when the feature set is too large to handle. This means that processing and classifying the entire feature set takes such a large amount of time that it is hard to work with. If this is the case, performing dimensionality reduction can reduce the training time of the algorithm without performance loss7. An example would be to filter the words that occur often in all texts of the dataset. These words do not contribute much to the performance of classification but still consume resources. 2.3 Common text classification issues 2.3.1 Overfitting Overfitting occurs when the model learns the training data in a too literal manner. It should actually learn the general patterns that can be found in the data. It prevents the model from correctly predicting unseen data. For a visual representation, see Figure 2.2. 12

CHAPTER 2 (a) Right fitting by the model (b) Overfitting by the model Figure 2.2: Visual representation of Overfitting There is no general solution to overfitting because there are many reasons as to why this can happen. A solution that is often applicable is to collect more data. In cases where this is not possible, the origin of the overfitting problem and its corresponding solution should be determined. There is however a way to detect overfitting by using cross validation8. Cross validation is a technique of partitioning the training and test data. This is different from just having fixed percentages of training and test data. An example is to use the k-fold method. When using k-fold cross validation, the dataset is partitioned into k equally large subsets. One of these subsets is declared test set and the remaining sets form the training data. This process repeats itself k times with other test- and training sets. Every set in the training data is then used to produce a prediction and these results are combined. The results represent how well the model generalizes to new data. If the cross validation score is very high, it is likely that the model is overfitting. For small data-sets, cross-validation can provide some improvement in overfitting. If there is little data, a classifier is likely to overfit due to a high variance, which will be explained in Section 2.5. Cross-validation uses all data to train and test. This means that the model is trained on more data than when there is a certain percentage of train and test data. This reduces variance and can lower overfitting. 13

CHAPTER 2 2.3.2 Imbalanced data Real world datasets often lack an even distribution. Some classes may have more data points than others. This should be taken into account as the classifier might develop a bias towards the larger classes. With a biased classifier, it could seem that the classifier achieves a good performance. For instance, there are two classes x and y. Class x has 80 data points and class y 20. If the classifier were to develop a bias for class x and classifies everything as that class, an accuracy of 0.8 would be achieved. This seems to be a good performance even though all instances of the class y are wrongly predicted. There are several methods to tackle this problem. One option is to resample the dataset in order to get equall sized classes. When there is a small dataset, over-sampling is used. This means that copies of instances of small classes are added tot the dataset. However, having duplicates in the data could cause overfitting. When there is a large dataset, under-sampling is used. This means that instances of large classes are deleted. When the dataset does not allow for under-sampling or over-sampling, there are other methods such as reinforcing different misclassification costs. With this method, misclassifying a large class will have a lower cost than misclassifying a smaller class. This way, the classifier is encouraged to grant a label of the smaller class more often. 2.4 Multi-label classification As mentioned in Chapter 1, most traditional text classification algo-rithms are based on binary problems. In a binary setup, an instance from the dataset can receive only one label. A common approach is to convert the multi-label problem into multiple separate binary problems. This has been applied successfully. For instance, Dhillon et al.9 apply this by using a one-versus-rest method and achieve satisfactory results. In this method, one binary classifier is trained per label, as opposed to training a single classifier for all labels. This solution is independent of the algorithm used for classification. 14

CHAPTER 2 2.5 Choosing a classifier A classifier is an algorithm that can predict the labels of unseen data. There are many different kinds of classifiers that are suitable for dif-ferent problems. Choosing the right one is crucial for the performance of the program. The choice mostly depends on the shape and size of the dataset. A classifier’s error in classification is defined as the sum of the bias and the variance. The bias of a classifier is low if the model represents the true data distribution well. It does not depend on training set size. A high bias can cause the classifier to under fit, which means that the model misses important relations between features. The variance of a classifier is low if there are few fluctuations in the training set. It depends on the training set size. A high variance can cause the classifier to overfit, which means that the classifier models random noise in the dataset instead of the actual relations. With a small dataset, the variance is often high. In this case, a classifier that can deal with high variance like Naïve Bayes is a good choice. This classifier disregards the high variance by processing all features independent from each other. When the dataset is bigger, the shape of the data and the scope of the problem becomes important. The scope of our problem is text classification. Sebastiani10 evaluated different classifiers for text classification and found the following: 1. Boosting-based classifier committees, support vector machines, example-based methods, and regression methods deliver top-notch performance. 2. Neural networks and on-line linear classifiers work very well, although slightly worse than the previously mentioned methods. 15

CHAPTER 2 3. Batch linear classifiers and probabilistic Naïve Bayes classifiers perform the worst of the learning-based classifiers. 4. The data is insufficient to say anything about decision trees. The dataset on which the program will be trained is relatively small. The shape and size of the dataset will be detailed further in Chapter 3. Based on the findings of Sebastiani10 that were presented earlier, four classifiers have been selected for comparison 1. Support vector machines 2. Logistic regression 3. Naïve Bayes 4. Random forest Support vector machines and Logistic regression were chosen as they performed the best in Sebastiani’s research. Secondly, Naïve Bayes was chosen as baseline. Even though it performed worst in Sebastiani’s research, the dataset is small and Naïve Bayes works well with high variance. As the fourth classifier, the Random forest classifier was chosen in order to see how a decision tree based classifier performs on the dataset. On what principles these classifiers are based and how they work will be explained next. 2.5.1 Support vector machines A support vector machine11 is a binary classifier from the outset. Binary means that the classifier divides objects into two classes. In classification, this means that it determines whether the object belongs to the class or not. In order to do that, values are mapped in a hyper-plane. A linear function is used to determine a boundary that divides the objects into two classes. 16

CHAPTER 2 The boundary is based on the distance to the closest objects in either class. This distance needs to be maximized. The closest objects are called the support vectors. 2.5.2 Logistic regression A logistic regression model is a binary classifier. It is similar to linear regression, where the task is to ascertain a value for each object in the dataset. In contrast to the outcome of linear regression, the outcome of logistic regression is not continuous but binary. A logistic function is used to determine the probability of a value belonging to the class or not. 2.5.3 Naïve Bayes A naïve Bayes classifier is a binary classifier. It is based on Bayes’ theorem: This theorem shows a way to find the probability of A given evidence B. In classification, this means that we have an object A and category evidence B. A certain evaluation method such as a threshold is chosen to determine whether the objects belongs to the class or not. For example, if the probability is higher than 0.6, it belongs to the category. This classifier considers all features to be independent. 17

CHAPTER 2 2.5.4 Random forest A random forest classifier averages multiple decision trees based on random samples from the database. A decision tree breaks the dataset down into smaller subsets while simultaneously building a tree with decision nodes and leaf nodes. A decision node has two or more branches with choices or leafs. A leaf node represents a category. Because the trees in a random forest are based on random data, they can be noisy and lack meaning. The random forest averages these trees to create a model with low variance. The irrelevant trees cancel each other out and the remaining meaningful trees yield the result. Figure 2.3: Visual representation of Overfitting 18

Chapter 3 Word Net 3.1 WordNet Introduction WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and are freely available for download from the WordNet website. Both the lexicographic data (lexicographer files) and the compiler (called grind) for producing the distributed database are available. 3.2 Database Contents As of November 2012 WordNet’s latest Online-version is 3.1. The database contains 155 327 words organized in 175 979 synsets for a total of 207 016 word-sense pairs; in compressed form, it is about 12 megabytes in size. WordNet includes the lexical categories nouns, verbs, adjectives and adverbs but ignores prepositions, determiners and other function words. 19

CHAPTER 3 Words from the same lexical category that are roughly synonymous are grouped into synsets. Synsets include simplex words as well as collocations like “eat out” and “car pool.” The different senses of a polysemous word form are assigned to different synsets. The meaning of a synset is further clarified with a short defining gloss and one or more usage examples. An example adjective synset is: good, right, ripe – (most suitable or right for a particular purpose; “a good time to plant tomatoes”; “the right time to act”; “the time is ripe for great sociological changes”) All synsets are connected to other synsets by means of semantic relations. These relations, which are not all shared by all lexical categories, include: • Nouns • hypernyms: Y is a hypernym of X if every X is a (kind of) Y (canine is a hypernym of dog) • hyponyms: Y is a hyponym of X if every Y is a (kind of) X (dog is a hyponym of canine) • coordinate terms: Y is a coordinate term of X if X and Y share a hypernym (wolf is a coordinate term of dog, and dog is a coordinate term of wolf) • meronym: Y is a meronym of X if Y is a part of X (window is a meronym of building) • holonym: Y is a holonym of X if X is a part of Y (building is a holonym of window) • Verbs • hypernym: the verb Y is a hypernym of the verb X if the activity X is a (kind of) Y (to perceive is an hypernym of to listen) 20

CHAPTER 3 • troponym: the verb Y is a troponym of the verb X if the activity Y is doing X in some manner (to lisp is a troponym of to talk) • entailment: the verb Y is entailed by X if by doing X you must be doing Y (to sleep is entailed by to snore) • coordinate terms: those verbs sharing a common hypernym (to lisp and to yell) These semantic relations hold among all members of the linked synsets. Individual synset members (words) can also be connected with lexical relations. For example, (one sense of) the noun “director” is linked to (one sense of) the verb “direct” from which it is derived via a “morphosemantic” link. The morphology functions of the software distributed with the database try to deduce the lemma or stem form of a word from the user’s input. Irregular forms are stored in a list, and looking up “ate” will return “eat,” for example. Figure 3.1: Example Entry 21

CHAPTER 3 3.3 Lexical Ontology WordNet is sometimes called an ontology, a persistent claim that its creators do not make. The hypernym/hyponym relationships among the noun synsets can be interpreted as specialization relations among conceptual categories. In other words, WordNet can be interpreted and used as a lexical ontology in the computer science sense. However, such an ontology should be corrected before being used, because it contains hundreds of basic semantic inconsistencies; for example there are, (i) common specializations for exclusive categories and (ii) redundancies in the specialization hierarchy. Furthermore, transforming WordNet into a lexical ontology usable for knowledge representation should normally also involve (i) distinguishing the specialization relations into subtypeOf and instanceOf relations, and (ii) associating intuitive unique identifiers to each category. Although such corrections and transformations have been performed and documented as part of the integration of WordNet 1.7 into the cooperatively updatable knowledge base of WebKB-2 most projects claiming to re-use WordNet for knowledge-based applications (typically, knowledge-oriented information retrieval) simply re-use it directly. Figure 3.2: WordNet for Container 22

CHAPTER 3 3.4 Limitations WordNet does not include information about the etymology or the pronunciation of words and it contains only limited information about usage. WordNet aims to cover most of everyday English and does not include much domain-specific terminology. WordNet is the most commonly used computational lexicon of English for word sense disambiguation (WSD), a task aimed to assigning the context-appropriate meanings (i.e. synset members) to words in a text.9 However, it has been argued that WordNet encodes sense distinctions that are too fine-grained. This issue prevents WSD systems from achieving a level of performance comparable to that of humans, who do not always agree when confronted with the task of selecting a sense from a dictionary that matches a word in a context. The granularity issue has been tackled by proposing clustering methods that automatically group together similar senses of the same word. 23

Chapter 4 Text Categorization Using Support Vector Machines 4.1 Support Vector Machines A support vector machine (SVM) is a machine learning method that divides space into a training positive examples side and a negative examples side. It also creates hyperplanes as the margin between the positive and negative examples which becomes the maximum. These hyperplanes serve as the optimum solution based on the concept of structural risk minimization. The conceptual structure of SVM is shown in Fig. 4.1. SVM calculates the hyperplanes that separate a positive example from a negative example in hyperspace. We call the distance between the positive-side hyperplane nearest the negative examples and the negative-side hyperplane nearest the positive examples the margin. SVM calculates the optimal hyperplanes that supply the maximum margin, where w · x + b = 0 is the final border hyperplane for classification. The training examples on w ·x+b = 1 and w ·x+b = ?1 are called “support vectors.” However, when positive and negative examples cannot be separated completely, separated hyperplanes are determined by also taking errors into consideration. The problem that maximizes the margin in training data can be converted to the problem of quadratic programming that minimizes the purpose function (for- 24

CHAPTER 4 support vectors margin error wx+b=1 : positive samples wx+b=0 : Negative samples wx+b=-1 Figure 4.1: Support Vector Machines. mula 4.1) under the restricted condition of formula (4.2) by introducing Lagrange multiplier ?i. Here, yi is a variable showing the label of the example xi, which is positive or negative. When xi is a positive example, yi = +1, and when xi is a negative example, yi = ?1. We can determine w and b as follows using solved ?i, positive support vectors xa, and negative support vectors xb by quadratic programming: 25

CHAPTER 4 Moreover, SVM can also solve nonlinear hypotheses by replacing inner product xi, xj in formula (4.1) by nonlinear function K(xi, xj), which is called a kernel function. When learners learn the parameters of a classification model, the phenomenon called “overfitting” arises, where the more complicated the model is, the larger the error classification for new data becomes. On the other hand, when the model better suits the training data, it gets a smaller error for training data. The complexity of a model in SVM is mathematically defined by the measure called VC dimension (Vapnik-Chervonenkis dimension), and an optimum solution can be calculated based on the concept called structural risk minimization Vapnik95, which minimizes the sum of the classification error and the complexity of the model for training data without overfitting. Moreover, in machine learning methods before SVM, when the dimension of the data increases, the amount of calculation and storage capacity required increases abruptly, and the so-called “curse of dimensionality” occurs. For this reason, we could not handle high-dimensional space. SVM uses only support vectors for calculation. Thus, SVM excels over conventional machine learning methods in that it can also handle very-high-dimensional input. It has been reported that SVM surpasses Rocchio, decision tree, naive-Bayes and Bayes net in accuracy. Dumais98 26

CHAPTER 4 In addition, SVM is a large margin classifier like boosting. However, the margin in boosting is expressed by the sum of the distance costs of all samples to the classification border, whereas the margin in SVM is the distance between two hyperplanes along support vectors. SVMs are based on Structural Risk Minimization. The idea of structural risk minimization is to find a hypothesis h for which we can guarantee the lowest generalization error. The following upper bound (4.5) connects errorg (h), the generalization error of a hypothesis h with the error of h on the training set errort (h), and the complexity of h Vapnik95. This bound holds with a proba-bility of at least 1 ? ?. In the second term on the right-hand side, n denotes the number of training examples and ? is the VC dimension, which is a property of the hypothesis space and indicates its complexity. Equation (3.5) reflects the well-known trade-o between the training error and the complexity of the hypothesis space. A simple hypothesis (small ?) would probably not contain good approximating functions and would lead to a high training (and true) error. On the other hand, a too-rich hypothesis space (high ?) would lead to a small training error, but the second term on the right-hand side of (3.5) would be large (overfitting). 27

CHAPTER 4 The right complexity is crucial for achiev-ing good generalization. In the following, we assume that the linear threshold functions represent a hypothesis space in which w and b are parameters of a hyperplane and x is an input vector: +1, if w · x + b > 0 h(x) = sign{w · x + b} = -1, else Note that the VC dimension of these hyperplanes does not always depend on the number of input features. Instead, the VC dimension depends on the Euclidean length ||w|| of the weight vector w. equation (3.6) supports the possibility that SVM text categorization achieves good generalization even if a huge number of words are given as an input space. Further experimental evaluations are required because Equations (4.3) and (4.4) both give us only a loose bound. Basically, SVM finds the hyperplane that separates the training data with the shortest weight vector (i.e., min||w||). The hyperplane maximizes the margin between the positive and negative samples. Since the optimization problem is difficult to handle numerically, Lagrange multipliers are introduced to translate the problem into an equivalent quadratic optimization problem. For this kind of optimization problem, effcient algorithms exist that are guaranteed to find the global optimum. The result of the optimization process is a set of coeffcients ??i for which (4.7) is minimum. These coefficients can be used to construct the hyperplane satisfying the maximum margin requirement. 28

CHAPTER 4 SVMs can handle nonlinear hypotheses by simply substituting every occur-rence of the inner product in equation with any kernel function K(x1, x2) . More precisely, any function which satisfies the Mercer’s condition Vapnik95 can be a kernel function. Among the many types of kernel functions available, I focus on the dth polynomial functions: Kpoly (x1, x2) = (x1 · x2 + 1)d 4.2 Why Should SVMs Work Well for Text Categorization SVMs are very universal learners. In their basic form, SVMs learn linear threshold function. Nevertheless, by a simple plug-in” of an appropriate kernel function, they can be used to learn polynomial classifiers, radial basic function (RBF) networks, and three-layer sigmoid neural nets. One remarkable property of SVMs is that their ability to learn can be independent of the dimensionality of the feature space. SVMs measure the complexity of hypotheses based on the margin with which they separate the data, not the number of features. This means that we can generalize even in the presence of very many features, if our data is separable with a wide margin using functions from the hypothesis space. 4.2.1 High dimensional input space: Nevertheless When learning text classifiers, one has to deal with very many (more than 10000) features. Since SVMs use over fitting protection, which does not necessarily depend on the number of features, they have the potential to handle these large feature spaces 29

CHAPTER 4 4.2.2 Few irrelevant features One way to avoid these high dimensional input spaces is to assume that most of the features are irrelevant. Feature selection tries to determine these irrelevant features. Unfortunately, in text categorization there are only very few irrelevant features. Figure 4.2 shows the results of an experiment on the Reuters acq” category. All features are ranked according to their (binary) information gain. Then a naive Bayes classifier 2 is trained using only those features ranked 1-200, 201-500, 501- 1000, 1001-2000, 2001-4000, 4001-9962. The results in Figure 1 show that even features ranked lowest still contain considerable information and are somewhat relevant. A classifier using only those worst” features has a performance much better than random. Since it seems unlikely that all those features are completely redundant, this leads to the conjecture that a good classifier should combine many features (learn a dense” concept) and that aggressive feature selection may result in a loss of information. Figure 4.2: Learning without best features. 30

CHAPTER 4 4.2.3 Document vectors are sparse Document vectors are sparse: For each document, the corresponding docu- ment vector contains only few entries which are not zero. Kivinen et al. 4 give both theoretical and empirical evidence for the mistake bound model that additive” algorithms, which have a similar inductive bias like SVMs, are well suited for problems with dense concepts and sparse instances. Most text categorization problems are linearly separable: All Ohsumed categories are linearly separable and so are many of the Reuters tasks. The idea of SVMs is to find such linear (or polynomial, RBF, etc.) separators. These arguments give theoretical evidence that SVMs should perform well for text categorization. Figure 4.3: Classes and Support Vectors 4.3 Summary This project report investigates the effect of prior feature selection in Support Vector Machine (SVM) text categorization. The input space was gradually increased by using mutual information (MI) filtering and part-of-speech (POS) filtering, which determine the portion of words that are appropriate for learning from information-theoretic and linguistic perspectives, respectively. 31

CHAPTER 4 The SVMs’ results common to filtering are that 1) the optimal number of features differed completely across categories, and 2) the average performance for all categories was best when all of the words were used. In addition, a comparison of the two filtering methods clarified that POS filtering on SVMs consistently outperformed MI filtering, which indicates that SVMs cannot find irrelevant parts of speech. These results suggest a simple strategy for SVM text categorization: use all of the words found through a rough filtering technique such as part-of-speech tagging. This chapter has described various aspects of feature selection in SVM text categorization. This suggests a simple but highly practical strategy for organizing a large number of words found through a rough filtering technique such as part-of-speech tagging. 32

Chapter 5 Text Categorization: Implementation Details and Results Work has been done in the field of text documents (based on Classification and Clustering techniques) using Word Net. The earliest work was done in 1999. The authors Scott and Matwin, used Word Net for text classification. They used Ripper algorithm for classification. In 2000, Jensen and Martinez used both conceptual and contextual features for improving text classification. There was another work done in 2008, where Wang and Domeniconi built semantic kernels using Wikipedia. In 2015, Wei et. al. suggested a semantic approach for text clustering using Word Net. They have used ontology structure and relations for word sense disambiguation. They have also used the concept of lexical chains to find related words. In another work, authors Li et. al., 2012 used thesaurus and Word Net for text categorization. They have used K-Nearest Neighbor algorithm and neural networks as classifiers. But to the best of our knowledge, no work has been done in multi label text categorization using Word Net. There are different types of relationships between words defined in Word Net. These are synonyms, antonyms, hypernyms, hyponyms, meronyms etc. In this paper, we have used hypernym relation of words. A hypernym is “a kind of” relation among words. For example ‘Organism is a hypernym of plant’ as every plant is an organism. The reason of choosing hypernym relation is as follows: 33

CHAPTER 5 1. We want to obtain more general and conceptual meaning of the words. 2. The authors Dave et. al., 2003 and Sedding , 2004, after experimentation have found that hypernyms give better results than synonyms. 5.1 Implementation Lexical Analysis Module The input given to Lexical Analysis module is a set of journal articles. This module scans the abstract of the journal article and identifies the tokens in it. There are two stages in this module, namely Stop Words Removal and Lexical Analyzer. 1. Stop Words Removal: Firstly, stop words are removed from the abstracts of the journal articles by using a predefined stop word list1. 2. Lexical Analyzer: It scans the abstract of the journal article word by word and matches them in ACM Computing Classification System2.It identifies the tokens and stores them along with their frequency(of occurrence) in the tokens table .This module also extracts the tokens from the title as well as from the given keyword list in the article. If same token is found in the abstract as well as in the title or in the keyword list, then its frequency is increased by 2. The output of this module is a list of tokens along with their frequency, for each journal article. A journal article can be represented as a set of tokens along with their frequency and the entire collection can be represented formally in definition 1 and 2 given below: Definition 1 (Journal Article). A journal article, denoted Ji = {(t1, fi1), (t2, fi2) ,…. (tm, fim)} is represented as a set of token ti together with corresponding frequency fij. 34

CHAPTER 5 Definition 2 (Journal Articles Set). A Journal articles set, denoted J= {J1, J2….Ji,….Jn}, is a set of journal articles, where n is the total number of articles in J. Figure 5.1: Classification with ACM Classifier. 1It contains a list of 571 stop words that was developed by the SMART project. 2It is the standard classification system used by ACM Digital library for the computing field. 35

CHAPTER 5 Semantic Analysis Module The next module is Semantic Analysis. Its purpose is to reduce the dimensionality using the concept of semantics. The aim of this module is to find the minimum number of tokens required to categorize a journal article. We use Word Net in this module. The hypernym relationship of Word Net is used to generate token trees. This module has two components as given below: 1. Token Tree: A token tree is a hierarchical structure constructed for a token, using its hypernyms up to four levels. 2. Token Forest: A token forest is a set of token trees for all the tokens in a journal article. Classification Module In this module, we classify the reduced number of tokens obtained into their respective categories/classes. The standard ACM Computing Classification System for this purpose. Tokens table/ Dictionary The tokens table or a dictionary is a separate component in our algorithm. It is used by the first two modules: Lexical Analyzer and Semantic Analysis to store the tokens. Algorithm 1. Main Program Input: Title – T , Set of Keywords – K = { k1, k2, k3,…….kn} , Abstract- Ab , The 2012 ACM Computing Classification System – A, Word Net – W, Set of journal articles – JT, K, Ab, A,W. Output: { | Ti is the title belonging to some class Cj where i>0, j>=1} for (each Title Ti) Begin 1.1 LAM = Lexical Analyzer (T, K, Ab, A, J) 1.2 = Semantic Analysis (LAM) 1.3 = Classification (Ti, µ) End 36

CHAPTER 5 5.2 Results Naïve Bayes Confusion Matrix: Figure 5.2: Naïve Bayes- Confusion Matrix. SVM Confusion Matrix Figure 5.3: SVM- Confusion Matrix. 5.2.1 Classification Metric- Accuracy No. of Documents Naïve Bayes SVM 5000 0.81 0.91 10000 0.84 0.93 20000 0.85 0.95 Table 5.1: SVM- Accuracy Matrix. 37

CHAPTER 5 5.3 Classification Reports Naïve Bayes classification report for the dataset is as follows: The above computation tabular data represents the precision; recall f1 and support score our dataset. These scores where obtained when Naïve Bayes was used as a classifier for the entire dataset. 38

CHAPTER 5 SVM classification report for the dataset is as follows: The above computation tabular data represents the precision, recall f1 and support score our dataset. These scores where obtained when SVM was used as a classifier for the entire dataset. 39

Chapter 6 Conclusions 6.1 Summary This paper introduces support vector machines for text categorization. It provides both theoretical and empirical evidence that SVMs are very well suited for text categorization. The theoretical analysis concludes that SVMs acknowledge the particular properties of text: (a) high dimensional feature spaces, (b) few irrelevant features (dense concept vector), and (c) sparse instance vectors. The experimental results show that SVMs consistently achieve good performance on text categorization tasks, outperforming existing methods substantially and significantly. With their ability to generalize well in high dimensional feature spaces, SVMs eliminate the need for feature selection, making the application of text categorization considerably easier. Another advantage of SVMs over the conventional methods is their robustness. SVMs show good performance in all experiments, avoiding catastrophic failure, as observed with the conventional methods on some tasks. Furthermore, SVMs do not require any parameter tuning, since they can find good parameter settings automatically. All this makes SVMs a very promising and easy-to-use method for learning text classifiers from examples. In Chapter 3, text classification using a support vector machine was described. Many word features are generally required to perform highly precise text classifi-cation by machine learning. However, when many word features are used as the input of conventional learning techniques, they overfit the training data, and the classification accuracy for unknown data decreases. 40

CHAPTER 6 Therefore, it was necessary to reduce the dimension of features used in learn-ing by selecting a moderate number of features which contain much information. However, it was difficult for the classifier to learn with a sufficiently high accu-racy only with several hundred features. Consequently, text classification was performed by using a support vector machine, which is a new machine learning technique developed to avoid overfitting. I evaluated how feature selection affects performance of SVM. Classification accuracy is very high without feature selection and the classifier created by SVM with part-of-speech filtering has the highest accuracy. What we could conclude from those results are the following: If we have a sufficient amount of labeled samples and it is modeled as a binary-class problem, it is better to use support vector machines with part-of-speech filtering. If we have only a small amount of labeled samples, transduction for large margin classifiers is effective. Furthermore, when we have a few labeled samples, active labeling for ETM models is useful for improving the classification performance. 41

CHAPTER 6 6.2 Future Directions Although all of the text classification methods introduced in this thesis are flat or two-level layered classifications, it is natural that the structure of categories forms a network such as the link structure of the Internet. Furthermore, although We assumed only static structure of categories in this report, there are many cases of changing the structure of categories and updating texts in the categories. We intend to aim our research toward machine learning for the network structure of texts and online learning for dynamic structures. 42

References Attias00 Attias, H.: A Variational Bayesian Framework for Graphical Models, Proc. of the 12th Advances in Neural Information Processing Systems (NIPS- 99), pp. 209–215 (2016). CMUdata 20 Newsgroups Data Set, http://www.cs.cmu.edu/textlearning. Cohn96 Cohn, D., Ghahramani, Z. and Jordan, M.: Active Learning with Sta-tistical Models, Journal of Artificial Intelligence Research, Vol. 4, pp. 129–145 (2013). Cortes95 Cortes, C. and Vapnik, V.: Support Vector Networks, Machine Learn-ing, Vol. 20, pp. 273–297 (2013). Cover91 Cover, T. and Thomas, J.: Elements of Information Theory, John Wiley & Sons (2008). Dempster77 Dempster, A. P., Laird, N. M. and Rubin, D. B.: Maximum Like-lihood from Incomplete Data via the EM Algorithm, Journal of Royal Sta-tistical Society, Series B, Vol. 39, pp. 1–38 (2013). Dumais98 Dumais, S., Platt, J., Heckerman, D. and Sahami, M.: Inductive Learning Algorithm and Representation for Text Categorization, Proc. of the Seventh International Conference on Information and Knowledge Man-agement (CIKM-98), pp. 148–155 (2013). Joachims98 Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proc. of 10th European Conference on Machine Learning (ECML-98), pp. 137–142 (2012). Joachims99 Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines, Proc. of the 16th International Conference on Ma- chine Learning (ICML’99), pp. 200–209 (2014). A Lexical- Semantics based Method for Multi label Text Categorization Using Word Net (2005). 43

Matsumoto97 Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Imaichi, O. and Imamura, T.: Japanese Morphological Analysis System Chasen Manual (1997), NAIST Technical Report NAIST-IS-TR97007. McCallum98 McCallum, A. K. and Nigam, K.: Employing EM and Pool-based Active Learning for Text Classification, Proc. of the 15th International Con- ference on Machine Learning (ICML’98), pp. 350–358 (1998). McLachlan87 McLachlan, G. J. and Basford, K. E.: Mixture Models, Marcel Dekker (2007). 44

Abbreviations IEICE Association for Natural Language Processing IPSJ Information Processing Society of Japan SIG-NLP Special Interest Group on Natural Language Processing JSAI Japanese Society for Artificial Intelligence NLTK- Natural Language Toolkit SVM-Support Vector Machine 45

Appendix Dataset: Context This dataset is a collection newsgroup documents. The dataset contains samples from 20 categories with each category containing 1000 samples each. One Fourth of dataset was partitioned as testing dataset. Content There is file (list.csv) that contains a reference to the document_id number and the newsgroup it is associated with. There are also 20 files that contain all of the documents, one document per newsgroup. In this dataset, duplicate messages have been removed and the original messages only contain “From” and “Subject” headers (20K messages total). Each new message in the bundled file begins with these four headers: Newsgroup: alt.newsgroup Document_id: xxxxxx From: Cat Subject: Meow Meow Meow The Newsgroup and Document_id can be referenced against list.csv Organization – Each newsgroup file in the bundle represents a single newsgroup – Each message in a file is the text of some newsgroup document that was posted to that newsgroup. This is a list of the 20 newsgroups: • comp.graphics • comp.os.ms-windows.misc • comp.sys.ibm.pc.hardware • comp.sys.mac.hardware • comp.windows.x rec.autos • rec.motorcycles • rec.sport.baseball • rec.sport.hockey sci.crypt • sci.electronics 46

• sci.med • sci.space • misc.forsale talk.politics.misc • talk.politics.guns • talk.politics.mideast talk.religion.misc • alt.atheism • soc.religion.christian Acknowledgements Ken Lang is credited by the source for collecting this data. The source of the data files is here: http://qwone.com/~jason/20Newsgroups/ Inspiration • This dataset text can be used to classify text documents 47