Menu

SAVITRIBAI PHULE PUNE UNIVERSITY A PROJECT REPORT ON “Pattern Extraction

SAVITRIBAI PHULE PUNE UNIVERSITY

A PROJECT REPORT ON

“Pattern Extraction & Analysis
Using Lingo Clustering Algorithm”
SUBMITTED TOWARDS THE
FULFILLMENT OF THE REQUIREMENTS OF
BACHELOR OF ENGINEERING (Computer Engineering)
BY
Rohit Kumar Singh B120784260
Sayyad Mohasin Dastagir B120784264
Dube Snehal Anil B120784216
Pingle Kaustubh Kailas B120784255
Under The Guidance of
Prof P. K. Deshmukh

Department of Computer Engineering
GOVT COLLEGE OF ENGINEERING AND RESEARCH AWASARI (KD.), DIST-PUNE-412405
Academic Year: 2017-2018

Govt. College of Engineering and Research, Awasari (Kd), Pune-412405
Department Of Computer Engineering

CERTIFICATE

This is to certify that the Project entitled-

“Pattern Extraction & Analysis
Using Lingo Clustering Algorithm”

Submitted by

Rohit Kumar Singh B120784260
Sayyad Mohasin Dastagir B120784264
Pingle Kaustubh Kailas B120784255
Dube Snehal Anil B120784216

is a bonafide work carried out by Students under the supervision of Prof P. K. Deshmukh and it is submitted towards the partial fulfillment of the requirement of Bachelor of Engineering (Computer Engineering)
Project.

Prof P. K. Deshmukh Dr. S.U.Ghumbre
Internal Guide H.O.D
Dept. of Computer Engg. Dept. of Computer Engg.

Abstract

Data mining refers to the extraction of information from huge chunks of the dataset. It’s also called information mining. It is exercised in numerous fields like medicine, environment, education, crime, etc. This base paper research work crash investigation and analysis of the flights are done. Flight crashes may be induced due to pilot error, mechanical failure, bad weather, sabotages or human fault. This research paper investigates international flight crashes since 1908 to 2009 through LINGO clustering data mining technique. Search results clustering problem is defined as an automatic, online grouping of similar documents in a search results list returned from a search engine.

In this report, we present Lingo—a novel algorithm for clustering search results, which emphasizes cluster description quality. We describe methods used in the algorithm: algebraic transformations of the term-document matrix and frequent phrase extraction using suffix arrays. The research study is performed for all attributes present in datasets to recognize hidden patterns as well as determine similarity among the airplane crashes.

The crash investigation is a major research area and the major techniques employed in this investigation are: statistics, grid computing, cloud computing, digital image processing and data mining. With data mining, we can parse through a vast amount of information and find out unknown patterns of aircraft mishap. The major aim of this research is to utilize data mining techniques to find out unknown patterns in the international flight crash dataset. The work is carried out using LINGO clustering data mining technique and cosine similarity measure.

Search results clustering problem is defined as an automatic, online grouping of similar documents in a search results list returned from a search engine.

Acknowledgments

It gives us great pleasure in presenting the preliminary project report on ‘”Pattern Extraction & Analysis using Lingo Clustering Algorithm”. I would like to take this opportunity to thank my internal guide Prof P. K. Deshmukh for giving us all the help and guidance I needed. I am really grateful to them for their kind support. Their valuable suggestions were very helpful.

I am also grateful to Dr S. U. Ghumbre, Head of Computer Engineering Department, Government College of Engineering, Awasari Khurd for his indispensable support, suggestion.

In the end our special thanks to Miss. Gangurde for providing various resources such as laboratory with all needed software platforms, continuous Internet connection, for Our Project.

Rohit Kumar Singh
Sayyad Mohasin Dastagir
Pingle Kaustubh Kailas
Dube Snehal Anil

(BE Computer Engg.)

Contents

1
Synopsis
1
1.1
Project Title . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Project Option . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Internal Guide . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.4
Sponsorship and External Guide . . . . . . . . . . . . . . . .
2
1.5
Technical Keywords (As per ACM Keywords) . . . . . . . . .
2
1.6
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .
2
1.7
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.8
Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . .
3
1.9
Relevant mathematics associated with the Project . . . . . . .
3
1.10
Names of Conferences / Journals where papers can be published
3
1.11
Review of Conference/Journal Papers supporting Project idea
4
1.12
Plan of Project Execution . . . . . . . . . . . . . . . . . . . .
4
2
Technical Keywords
5
2.1
Area of Project . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2
Technical Keywords . . . . . . . . . . . . . . . . . . . . . . . .
6
3
Introduction
7
3.1
Project Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3.2
Motivation of the Project . . . . . . . . . . . . . . . . . . . .
8
3.3
Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . .
8
4 Problem De nition and scope
9
4.1
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .
10
4.1.1
Goals and objectives . . . . . . . . . . . . . . . . . . .
10
4.1.2
Statement of scope . . . . . . . . . . . . . . . . . . . .
10
4.2
Major Constraints . . . . . . . . . . . . . . . . . . . . . . . .
10
4.3
Methodologies of Problem solving and e ciency issues . . . .
10
4.4
Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
4.5
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
4.6
Hardware Resources Required . . . . . . . . . . . . . . . . . .
11

4.7
Software Resources Required . . . . . . . . . . . . . . . . . . .
11
5
Project Plan
12
5.1
Project Estimates . . . . . . . . . . . . . . . . . . . . . . . . .
13
5.1.1
Reconciled Estimates . . . . . . . . . . . . . . . . . . .
13
5.1.2
Project Resources . . . . . . . . . . . . . . . . . . . . .
13
5.2
Risk Management w.r.t. NP Hard analysis . . . . . . . . . . .
13
5.2.1
Risk Identi cation . . . . . . . . . . . . . . . . . . . .
13
5.2.2
Risk Analysis . . . . . . . . . . . . . . . . . . . . . . .
14
5.2.3
Overview of Risk Mitigation, Monitoring, Management
14
5.3
Project Schedule . . . . . . . . . . . . . . . . . . . . . . . . .
17
5.3.1
Project task set . . . . . . . . . . . . . . . . . . . . . .
17
5.3.2
Task network . . . . . . . . . . . . . . . . . . . . . . .
17
5.3.3
Timeline Chart . . . . . . . . . . . . . . . . . . . . . .
17
5.4
Team Organization . . . . . . . . . . . . . . . . . . . . . . . .
17
5.4.1
Team structure . . . . . . . . . . . . . . . . . . . . . .
17
5.4.2
Management reporting and communication . . . . . . .
17
6
Software requirement speci cation
18
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
6.1.1
Purpose and Scope of Document . . . . . . . . . . . .
19
6.1.2
Overview of responsibilities of Developer . . . . . . . .
19
6.2
Usage Scenario . . . . . . . . . . . . . . . . . . . . . . . . . .
19
6.2.1
User pro les . . . . . . . . . . . . . . . . . . . . . . . .
19
6.2.2
Use-cases . . . . . . . . . . . . . . . . . . . . . . . . .
19
6.2.3
Use Case View . . . . . . . . . . . . . . . . . . . . . .
19
6.3
Data Model and Description . . . . . . . . . . . . . . . . . . .
21
6.3.1
Data Description . . . . . . . . . . . . . . . . . . . . .
21
6.3.2
Data objects and Relationships . . . . . . . . . . . . .
21
6.4
Functional Model and Description . . . . . . . . . . . . . . . .
21
6.4.1
Data Flow Diagram . . . . . . . . . . . . . . . . . . . .
21
6.4.2
Activity Diagram: . . . . . . . . . . . . . . . . . . . . .
21
6.4.3
Non Functional Requirements: . . . . . . . . . . . . . .
21
6.4.4
State Diagram: . . . . . . . . . . . . . . . . . . . . . .
22
6.4.5
Design Constraints . . . . . . . . . . . . . . . . . . . .
22
6.4.6
Software Interface Description . . . . . . . . . . . . . .
22
7 Detailed Design Document using Annexure A and B
23
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
7.2
Architectural Design . . . . . . . . . . . . . . . . . . . . . . .
24
7.3
Data design (using Annexures A and B) . . . . . . . . . . . .
25

GCOEARA, Department of Computer Engineering 2017-18 IV

7.3.1
Internal software data structure . . . . . . . . . . . . .
25
7.3.2
Global data structure . . . . . . . . . . . . . . . . . . .
25
7.3.3
Temporary data structure . . . . . . . . . . . . . . . .
25
7.3.4
Database description . . . . . . . . . . . . . . . . . . .
25
7.4
Compoent Design . . . . . . . . . . . . . . . . . . . . . . . . .
25
7.4.1
Class Diagram . . . . . . . . . . . . . . . . . . . . . . .
25
8
Project Implementation
27
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
8.2
Tools and Technologies Used . . . . . . . . . . . . . . . . . . .
28
8.3
Methodologies/Algorithm Details . . . . . . . . . . . . . . . .
28
8.3.1
Algorithm 1/Pseudo Code . . . . . . . . . . . . . . . .
28
8.3.2
Algorithm 2/Pseudo Code . . . . . . . . . . . . . . . .
28
8.4
Verification and Validation for Acceptance . . . . . . . . . . .
28
9
Software Testing
29
9.1
Type of Testing Used . . . . . . . . . . . . . . . . . . . . . . .
30
9.2
Test Cases and Test Results . . . . . . . . . . . . . . . . . . .
30
10
Results
31
10.1 Screen shots . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
10.2
Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
11
Deployment and Maintenance
33
11.1
Installation and un-installation . . . . . . . . . . . . . . . . .
34
11.2
User help . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
12
Conclusion and Future Scope
35
References
36
Annexure A
Laboratory assignments on Project Analysis of Al-

gorithmic Design
39
Annexure B Laboratory assignments on Project Quality and

Reliability Testing of Project Design
41
Annexure C
Project Planner
43
Annexure D Reviewers Comments of Paper Submitted
45
Annexure E
Plagiarism Report
47

GCOEARA, Department of Computer Engineering 2017-18 V

Annexure F Term-II Project Laboratory Assignments
49
Annexure G Information of Project Group Members
51

List of Figures

6.1
Use case diagram . . . . . . . . . . . . . . . . . . . . . . . . .
20
6.2
State transition diagram . . . . . . . . . . . . . . . . . . . . .
22
7.1
Architecture diagram . . . . . . . . . . . . . . . . . . . . . . .
24
7.2
Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . .
26

GCOEARA, Department of Computer Engineering 2017-18 VII

List of Tables

4.1
Hardware Requirements . . . . . . . . . . . . . . . . . . . . .
11
5.1
Risk Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
5.2
Risk Probability de nitions ? . . . . . . . . . . . . . . . . . .
14
5.3
Risk Impact de nitions ? . . . . . . . . . . . . . . . . . . . .
15
6.1
Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
A.1
IDEA Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . .
40

CHAPTER 1

SYNOPSIS

1.1 PROJEC TTITLE
“Pattern Extraction & Analysis using Lingo Clustering Algorithm”

1.2 PROJECT OPTION
Research Implementation Level Project

1.3 INTERNALGUIDE
Prof P. K. Deshmukh
Assistant Professor, Computer Department
Government College of Engineering, Awasari khurd.

1.4 EXTERNALGUIDE

1.5 TECHNICAL KEYWORDS (AS PER ACM KEYWORDS)
Clustering, Single Value Decomposition, Data mining, Lingo Algorithm

1.6 PROBLEM STATEMENT
The flight crashes investigate using LINGO clustering data mining technique and cosine similarity to identifying aboard or ground fatality rate with operators and location as well as to find similarity among the plane crashes.

1.7 ABSTRACT

In this report, we present Lingo—a novel algorithm for clustering search results, which emphasizes cluster description quality. We describe methods used in the algorithm: algebraic transformations of the term-document matrix and frequent phrase extraction using suffix arrays.The research study is performed for identifying aboard/ground fatality rate with operators and location as well as to determine similarity among the airplane crashes.

The crash investigation is a major research area and the major techniques employed in this investigation are: statistics, grid computing, cloud computing, digital image processing and data mining. With data mining, we can parse through a vast amount of information and find out unknown patterns of aircraft mishap. The major aim of this research is to utilize data mining techniques to find out unknown patterns in the international flight crash dataset. The work is carried out using LINGO clustering data mining technique and cosine similarity measure.

1.8 GOALS AND OBJECTIVES
1.8.1 Aim
Purpose is to develop to use data mining techniques to find out unknown patterns in the international flight crash dataset. Instead of traditional way of taking and finding pattern. It is also a better alternative to normal clustering. This system will automatically generate a report at the end of the day which will be helpful in many means.

1.8.2 Scope
We are clustering dataset based on the analysis three-factor i.e. Location, Type, Operator. These Factors considered with respect to weather,hardware issues, people responsible, technical issues, crashed into, weather, disappeared, mid-air collision, fuel issues, etc.
In this project, each cluster dataset is done analyzing by two factor.
First with respect to Fatality rate with the operator and second with respect to Fatality rate withlocation on that basis to find similarity among the plane crashes.

1.9 RELEVANT MATHEMATICS ASSOCIATED WITH THE PROJECT
Mathematical Model:
Mathematical modeling is a the process of various mathematical structures – graphs, equations, diagrams, scatterplots, tree diagrams and so forth to represent real world situations. The model provides abstraction that reduces a problem to its essential characteristics.
Mathematical modeling is as follows:
Let S be the Set of Flight Crash Investigation
S={ S0,Ip,Op,Fn,Sc,f,End }
Let,S0 be the Starting state,
S0={Initial State}
Let Ip be the set of input dataset,
Ip={ date, time, location, operator, flight number, route, plane type, registration, cn/ln, aboard, fatalities, ground, summary,}
Let Fn be the Functional State,
Fn={Data_Preprocessing(),Frequent_Phrase_Extraction(),Cluster_Label_Induction(), Cluster_ Content_ Discovery(), Final_Cluster_Formation() }
Let Op be the set of output of the system,
Op= {Cluster1, Cluster2,Cluster3…..,Cluster n}
Cluster1, Clusrter2, Clusrter3…… Cluster n € Dataset
Let E be the Ending State,
E={Ending State}
Let Sc be the Success State of the system
Sc=When Browse Data set is loads properly into database then it will give the proper Output.
Let F be the Failure State of the System
F=When Browse Data set isn’t loads properly into database then it will not give the proper Output.

1.10 NAMES OF CONFERENCES / JOURNALS WHERE PAPERS CAN BEPUBLISHED
• IEEE/ACM Conference/Journal 1
• Conferences/workshops in Engg colleges
• Online Journals

1.11 REVIEW OF CONFERENCE/JOURNAL PAPERS SUPPORTING PROJECT IDEA

Sr. no. Paper name Public Action Area Year Key Points Problems ; Future scope
01 Flight Crash Investigation Using Data mining techniques. IEEE DM 2017 1. The factors considered in the analysis are as follows:
1. Fatality rate with operator
2. Fatality rate with location
The above factors were considered for ground/aboard fatality. The research work can be extended using other clustering techniques like Density Based, Hierarchical clustering. The summary report of the dataset is used to identify better clusters
using distance measures like cosine similarity.

02 Improvement of aircraft accident investigation through expert systems. Journal of Aircraft. DMTA 2009 Conducted a mathematical modeling analysis. It is also very important to conduct a study and systematization from the viewpoint of how to survive an accident, to mitigate injury and to reduce the number of injuries.
03 Statistical Data Analyses on Aircraft Accidents in Japan: Occurrences, Causes, and Countermeasures. American Journal of Operations Research. Data Analyses 2015 1. Attempted to investigate the past trend of accidents occurring in Japan for major air traffic services and countermeasures taken to prevent aircraft accidents.
How to deal with these troubles to prevent accidents will be the remaining
and future problems.

1.12 PLAN OF PROJECT EXECUTION
• Extraction of information from Flight Crash DataSet Excel File
• Data Reduction/ Pre-processing for structure of DataSet
• Cluster DataSet using Lingo Clustering Algorithms
• Development of frontend and backend toward data mining
• Analysis, Verification, and Visualization of output graph
• Deployment of Result, Conclusion on the basis of graph

CHAPTER 2

TECHNICAL KEYWORDS

2.1 AREA OF PROJECT

Data Mining:

Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. It is an essential process where intelligent methods are applied to extract data patterns. It is an interdisciplinary subfield of computer science. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Data mining is the analysis step of the “knowledge discovery in databases” process, or KDD.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

2.2 TECHNICAL KEYWORDS
Clustering, Single Value Decomposition, Data mining, Lingo Algorithm,

CHAPTER 3

INTRODUCTION

3.1 PROJECT IDEA
We are trying to develop a “Pattern Extraction & Analysis using Lingo Clustering Algorithm” which will be more

? Time Saving
? All in One Store
? Hands on Inspection
? Very Small Response Time
? Readable Cluster labels.
? Energy Efficient Clustering
? More cluster accuracy.

3.2 MOTIVATION OF THE PROJECT
With an enormous growth of the Internet it has become very difficult for the users to find relevant documents. In response to the user’s query, currently available search engines return a ranked list of documents along with their partial content (snippets). If the query is general, it is extremely difficult to identify the specific document which the user is interested in. The users are forced to sift through a long list of off-topic documents. Moreover, internal relationships among the documents in the search result are rarely presented and are left for the user. One of the alternative approaches is to automatically group search results into thematic groups (clusters).
• Readable and unambiguous descriptions of the thematic groups are an important factor of the overall quality of clustering.
• Lingo is a next-generation text clustering algorithm capable of processing tens of gigabytes of text and millions of documents. The lingo can both process the whole collection as well as an arbitrary subset of the collection in near-real-time. This makes Lingo particularly suitable as a component of text document analysis suites.

3.3 LITERATURE SURVEY

Sr. no. Paper name Public Action Area Year Key Points Problems & Future scope
01 Flight Crash Investigation Using Data mining techniques. IEEE DM 2017 1. The factors considered in the analysis are as follows:
1. Fatality rate with operator
2. Fatality rate with location
The above factors were considered for ground/aboard fatality. The research work can be extended using other clustering techniques like Density Based, Hierarchical clustering. The summary report of the dataset is used to identify better clusters
using distance measures like cosine similarity.

02 Improvement of aircraft accident investigation through expert systems. Journal of Aircraft. DMTA 2009 Conducted a mathematical modeling analysis. It is also very important to conduct a study and systematization from the viewpoint of how to survive an accident, to mitigate injury and to reduce the number of injuries.
03 Statistical Data Analyses on Aircraft Accidents in Japan: Occurrences, Causes, and Countermeasures. American Journal of Operations Research. Data Analyses 2015 1. Attempted to investigate the past trend of accidents occurring in Japan for major air traffic services and countermeasures taken to prevent aircraft accidents.
How to deal with these troubles to prevent accidents will be the remaining
and future problems.

CHAPTER 4

PROBLEM DEFINITION AND

SCOPE

4.1 PROBLEM STATEMENT
The flight crashes investigate using LINGO clustering data mining technique and cosine similarity to identifying aboard or ground fatality rate with operators and location as well as to find similarity among the plane crashes.

4.1.1 Goal &Objectives:
Purpose is to develop to use data mining techniques to find out unknown patterns in the international flight crash dataset. Instead of traditional way of taking and finding pattern. It is also a better alternative to normal clustering. This system will automatically generate a report at the end of the day which will be helpful in many means.

4.1.2 Statement of scope
We are clustering dataset based on the analysis three-factor i.e. Location, Type, Operator. These Factors considered with respect to weather,hardware issues, people responsible, technical issues, crashed into, weather, disappeared, mid-air collision, fuel issues, etc.
In this project, each cluster dataset is done analyzing by two factor.
First with respect to Fatality rate with the operator and second with respect to Fatality rate withlocation on that basis to find similarity among the plane crashes.

4.2 SOFTWARE CONTEXT
The major objective is to use data mining techniques to find out unknown patterns in the international flight crash dataset.

• Operating System : Window7
• Coding language : Java
• IDE : Netbeans IDE8.0.1
• Database :MySQL
4.3 MAJOR CONSTRAINTS
1. Objective difficulties : language properties (syntax, inflection,
text segmentation), definition of similarity between documents.

2. Subjective difficulties : “Good” cluster label choice

Solving the problem: the idea is to
Split the process into two independent phases:
1. Cluster label candidate discovery,
2. Clusters discovery and combine them to produce the desired effect.

4.4 METHODOLOGIES OF PROBLEM SOLVING AND DEFFICIENCY ISSUES

4.4.1 Algorithmic Strategies
Designing algorithm is best process for solving a problem. Algorithm is a step by step process for getting results of the problem. There are various algorithms to solve a particular problem. Out of which best efficient is chosen which will give result. In the project we are using algorithm for pattern detection and pattern recognition.

Algorithm:
________________________________________
4.4.1.1 LINGO Algorithm:
When designing search clustering algorithm, special attention must be paid to ensuring that both content and description (labels) of the resulting groups are meaningful to humans. As stated “a good cluster—or document grouping—is one, which possesses a good, readable description”. The majority of open text clustering algorithms follows a scheme where cluster content discovery is performed first, and then, based on the content, the labels are determined. But very often intricate measures of similarity among documents do not correspond well with plain human understanding of what a cluster’s “glue” element has been.

To avoid such problems Lingo reverses this process—we first attempt to ensure that we can create a human-perceivable cluster label and only then assign documents to it. Specifically, we extract frequent phrases from the input documents, hoping they are the most informative source of human-readable topic descriptions. Next, by performing reduction of the original term-document matrix using SVD, we try to discover any existing latent structure of diverse topics in the search result.
Finally, we match group descriptions with the extracted topics and assign relevant documents to them.

Input:
1: DATASET ? input documents (or snippets)
STEP 1: Preprocessing
1. for all d ? DATASET do
2. perform text segmentation of d; {Detect word boundaries etc.}
3. if language of d recognized then
4. apply stemming and mark stop-words in d;
5. end if
6. end for

STEP 2: Frequent Phrase Extraction
7. concatenate all documents;
8. Pc ? discover complete phrases;
9. Pf ? p: {p ? Pc ? frequency(p) ; Term Frequency Threshold};
STEP 3: Cluster Label Induction

10. A ? term-document matrix of terms not marked as stop-words and with frequency higher than the Term Frequency Threshold;

11. ?, U, V ? SVD(A); {Product of SVD decomposition of A}
12. k ? 0; {Start with zero clusters}
13. n ? rank(A);
14. repeat
15. k ? k + 1;
16. 17: q ? (P
17. until q 75%
Medium Probability of occurrence is 26-75%
Low Probability of occurrence is 10% Schedule impact or Unacceptable quality
High 5-10% Schedule impact or Some parts of the project have low
quality

Medium Term Frequency Threshold};

{STEP 3: Cluster Label Induction}

11: A ? term-document matrix of terms not marked as stop-words and with frequency higher than the Term Frequency Threshold;

12: ?, U, V ? SVD(A); {Product of SVD decomposition of A}
13: k ? 0; {Start with zero clusters}
14: n ? rank(A);
15: repeat
16: k ? k + 1;
17: q ? (P
=1 ?ii)/(Pi=1 ?ii);
18: until q ; Candidate Label Threshold;
19: P ? phrase matrix for Pf ; {See section 3.3}
20: for all columns of UkT P do
21: find the largest component mi in the column;

22: add the corresponding phrase to the Cluster Label Candidates set;
23: labelScore ? mi;
24: end for

25: calculate cosine similarities between all pairs of candidate labels;

26: identify groups of labels that exceed the Label Similarity Threshold;

27: for all groups of similar labels do

28: select one label with the highest score;

29: end for

{STEP 4: Cluster Content Discovery}
30: for all L ? Cluster Label Candidates do
31: create cluster C described with L;
32: add to C all documents whose similarity to C exceeds the Snippet Assignment Theshold;

33: end for

34: put all unassigned documents in the “Others” group;

{STEP 5: Final Cluster Formation}

35: for all clusters do
36: cluster Score ? label Score × kCk;
37: end for

8.4 Verification and Validation for Acceptance

Verification And Validation includes followings:
1. Enter User name And Password
2. Database connectivity.
3. Load record of flight crash investigation into Sql Server (Database).
4. Data pre-processing.
5. Apply Lingo Clustering Algorithm and getting Result as Multiple Clusters.
6. Create Pie Chart of the result.
7. Create Line Chart of the result.
8. Create Bar Chart of the result.

CHAPTER 9

SOFTWARE TESTING

9.1 Type of Testing Used

All major testing types were included in the project:-
1. Unit Testing
2. Integration Testing
3. System Testing

9.2 Test Cases and Test Results

Test Case Id 1
Test Case Name: Enter User name And Password
Test Case Description If its Correct, login steps will performed by the user.
Test Step: 1. Fill out the username and password and login.
Expected Result: Application should provide the respected page and display
Following option to the user:
1.Login
2.Cancel
Actual Result: Login Successful
Status: Pass

Test Case Id 2
Test Case Name: Enter User name And Password
Test Case Description If its Correct, login steps will performed by the user.
Test Step: 1. Fill out the username and password and login.
Expected Result: Application should provide the respected page and display
Following option to the user:
1.Login
2.Cancel
Actual Result: Login Unsuccessful.
Status: Failed
Error Invalid Username And Password.

Test Case Id 3
Test Case Name: Database connectivity.
Test Case Description Properly connected to database.
Test Step: To check database properly connected or not.
Expected Result: Connected properly.
Actual Result: Successfully Done.
Status: Pass

Test Case Id 4
Test Case Name: Load record of flight crash investigation into Sql Server(Database).
Test Case Description To check record inserted one by one into database.
Test Step: 1. Download flight crash Investigation record from Internet.
2. Insert record into sql server(Database).
Expected Result: Load record Successfully.
Actual Result: Successfully Done.
Status: Pass

Test Case Id 5
Test Case Name: Data pre-processing.
Test Case Description Processing on text input.
Test Step: Pre-process on data for analysis the Result.
Expected Result: Pre-process.
Actual Result: Successfully Done.
Status: Pass

Test Case Id 6
Test Case Name: Apply Lingo Clustering Algorithm and getting Result as Multiple Clusters.
Test Case Description Take Flight crash investigation record form database and apply to lingo clustering Algorithm.
Test Step: 1. To fetch all record of Flight crash investigation form database.
2. Use Lingo Clustering Algorithm.
3. Result.
Expected Result: Result as Multiple Clusters.
Actual Result: Success.
Status: Pass

Test Case Id 7
Test Case Name: Create pie Chart of the result.
Test Case Description To Show the clusters % pie Chart.
Test Step: 1.Cluster the data by using Lingo Clustering.
2. To Show the clusters % pie Chart.
Expected Result: Success.
Actual Result: Success.
Status: Pass

Test Case Id 8
Test Case Name: Create Line Chart of the result.
Test Case Description To Show the clusters % pie Chart.
Test Step: 1. Cluster the data by using Lingo Clustering.
2. To Show the clusters % Line Chart.
Expected Result: Success.
Actual Result: Success.
Status: Pass

Test Case Id 9
Test Case Name: Create Bar Chart of the result.
Test Case Description To Show the clusters % pie Chart.
Test Step: 1. Cluster the data by using Lingo Clustering.
2. To Show the clusters % Bar Chart.
Expected Result: Success.
Actual Result: Success.
Status: Pass

CHAPTER 10

RESULTS

10.1 Screen shots

10.1.1 Load Data and Cluster Window

10.1.2 Loading the DataSet

10.1.3 Clustering Window

10.1.4 Selecting Attributes for Clustering

10.1.5 Clustered Crashes in Particular Cluster

10.2 Outputs

We using Lingo clustering algorithm to create multiple clusters i.e. C1,C……Cn which will be reliable and will reduce human efforts. The final output is the Clusters.

10.2.1 Showcase of Precautions for particular Cluster

10.2.2 Bar Chart for particular attribute

10.2.3 Line Chart for particular attribute

10.2.4 Pie Chart for particular attribute

CHAPTER 11

DEPLOYMENT AND

MAINTENANCE

11.1 Installation and un-installation

11.2 User help

CHAPTER 12

CONCLUSION AND FUTURE

SCOPE

In this module, as in base paper; instead of using RapidMiner, it is a software tool that offers an integrated environment for machine learning, data mining, text mining, predictive testing and business investigation. The core part of the project is a lingo clustering algorithm.
After generating the graphs, trends, and patterns; we can deploy the precautions, results, and conclusion to avoid and reduction in further crashes that will be the final aim of the project.

THEORETICAL RESULT

We evaluated Lingo algorithm performance by means of a large set of experiments we obtained the following results:
• Proposed algorithm able to mine all attributes of the datasets.
• It is the first attempt to perform data mining along with precautionary measures.
• Comparisons done between existing algorithm Rapid Miner and the latest algorithm Lingo that performs mining on both content and description (labels) of the resulting groups are meaningful to humans.
• The result concludes that Lingo clustering is shown to be orders of magnitude faster than state-of-the-art algorithms for all considered parameter settings and data sets.
• Lingo Clustering is faster or competitive with state-of-the-art approaches, especially when setting or coping with denser data sets.

CONCLUSION AND FURTHER WORK

The Lingo clustering technique was used to find the clusters and fatality for the flight crash investigation. Various graphical representation and patterns were generated along with precautionary measures to avoid flight crashes.

The research work can be extended using other clustering techniques like Density Based, Hierarchical clustering. Proper Precaution could be generated automatically in the future scenarios. One of the major researches is providing real time event similarity detection. Moving from data mining to machine learning will cover all new advantages. Extra sensor on the airplane accompanying with machine learning to check present situation with any past history of crash and to any measure to avoid that can truly remove the generalization and proper specific instance will be captured.

REFERENCES

1. ;Flight Crash Investigation Using Data mining techniques;.Shagun Sharma, Ms. A.Sai Sabitha Department of Computer Science and Engineering Amity University, Uttar Pradesh,Noida, India.
2. Oluwatuyi, O., ;Ileri, O. N. (2013). Air disaster and its implications in the developing countries: a case study of Nigeria. Modern Social Science Journal, 2014, Article-ID.SAFETY, A. (2002). Australian Aviation Accidents Involving Fuel Exhaustion and Starvation.
3. Airplanes, C. (1959). Statistical Summary of Commercial Jet Airplane Accidents.Worldwide Operations, 2008.
4. Nazeri, Z., Donohue, G., ; Sherry, L. (2008). Analyzing Relationships Between Aircraft Accidents and Incidents.In Proceedings of the International Conference on Research in Air Transportation.
5. Mugtussids, I. B. (2000). Flight Data Processing Techniques to Identify Unusual Events.
6. Iwadare, K., ;Oyama, T. (2015). Statistical Data Analyses on Aircraft Accidents in Japan: Occurrences, Causes,and Countermeasures. American Journal of Operations Research, 5(03), 222.
7. Lagos, A., Motevalli, M., ; Sakata, N. (2005, March). Analysis of the Effect of Milestone Aviation Accidents on Safety Policy, Regulation, and Technology. In 46th Annual Transportation Research Forum, Washington, DC, March 6-8, 2005 (No. 208180). Transportation Research Forum.
8. Milosovski, G., Bil, C., ; Simon, P. (2009). Improvement of aircraft accident investigation through expert systems. Journal of Aircraft, 46(1),10-24.
9. Chappell, S. L. (1990). Pilot performance research for TCAS (No.902357).SAE Technical Paper.

ANNEXURE A

LABORATORY ASSIGNMENTS ON PROJECT ANALYSIS OF ALGORITHMIC DESIGN

To develop the problem under consideration and justify feasibility using concepts of knowledge canvas and IDEA Matrix.

Refer for IDEA Matrix and Knowledge canvas model. Case studies are given in this book. IDEA Matrix is represented in the following form. Knowledge canvas represents about identification of opportunity for product. Feasibility is represented w.r.t. business perspective.

I D E A
Increase
Effortless and increase the speed for finding unknown patterns. Drive
Lingo,next-generation text clustering algorithm drives the project. Educate
Educate of system over thumbs entry or register policy. Accelerate
Lingo can process large sets of documents in a matter of seconds or minutes.

Improve
Improves efficiency and pattern recognition to find out unknown patterns. Deliver
Data mining techniques to find out unknown patterns in the international flight crash dataset.
Eliminate
Eliminate and reduce the flight crashes by precautionary measures. Associate
Associate the record with respective cluster.
Ignore
Ignore old data processing algorthims. Decrease
Proxy and readymade processing technique. Evaluate
Time required for clustering is very less. Avoid
Avoid dissimilarity among crashes report.

Table A.1: IDEA Matrix

Project problem statement feasibility assessment using NP-Hard, NP-Complete or satisfy ability issues using modern algebra and/or relevant mathematical models.

input x,output y, y=f(x)

ANNEXURE B

LABORATORY ASSIGNMENTS ON PROJECT QUALITY AND RELIABILITY TESTING OF PROJECT DESIGN

It should include assignments such as

Use of divide and conquer strategies to exploit distributed/parallel/concurrent processing of the above to identify object, morphisms, overloading in functions (if any), and functional relations and any other dependencies

(as per requirements). It can include Venn diagram, state diagram, function relations, i/o relations; use this to derive objects, morphism, overloading

Use of above to draw functional dependency graphs and relevant Soft-ware modeling methods, techniques including UML diagrams or other necessities using appropriate tools.

Testing of project problem statement using generated test data (using mathematical models, GUI, Function testing principles, if any) selec-tion and appropriate use of testing tools, testing of UML diagram's reliability. Write also test cases Black box testing for each identi ed functions. You can use Mathematica or equivalent open source tool for generating test data.

Additional assignments by the guide. If project type as Entreprenaur, Refer ?,?,?, ?

ANNEXURE C

PROJECT PLANNER

Project Planner For Flight Crash Investigation System

ANNEXURE D

REVIEWERS COMMENTS OF

PAPER SUBMITTED

(At-least one technical paper must be submitted in Term-I on the project de-sign in the conferences/workshops in IITs, Central Universities or UoP Con-ferences or equivalent International Conferences Sponsored by IEEE/ACM)

1. Paper Title:

2. Name of the Conference/Journal where paper submitted :

3. Paper accepted/rejected :

4. Review comments by reviewer :

5. Corrective actions if any :

ANNEXURE E

PLAGIARISM REPORT

Plagiarism report

ANNEXURE F

TERM-II PROJECT LABORATORY ASSIGNMENTS

1. Review of design and necessary corrective actions taking into consid-eration the feedback report of Term I assessment, and other competi-tions/conferences participated like IIT, Central Universities, University Conferences or equivalent centers of excellence etc.

2. Project workstation selection, installations along with setup and instal-lation report preparations.

3. Programming of the project functions, interfaces and GUI (if any) as per 1 st Term term-work submission using corrective actions recom-mended in Term-I assessment of Term-work.

4. Test tool selection and testing of various test cases for the project per-formed and generate various testing result charts, graphs etc. including reliability testing.

Additional assignments for the Entrepreneurship Project:

5. Installations and Reliability Testing Reports at the client end.

ANNEXURE G

INFORMATION OF PROJECT

GROUP MEMBERS

one page for each student .

GCOEARA, Department of Computer Engineering 2017-18 52

1. Name :

2. Date of Birth :

3. Gender :

4. Permanent Address :

5. E-Mail :

6. Mobile/Contact No. :

7. Placement Details :

8. Paper Published :

GCOEARA, Department of Computer Engineering 2017-18 53