MCA-20-32: Data Mining and Integration using R
Type: Compulsory
Contact Hours: 4 hours/week
Examination Duration: 3 Hours
Mode: Lecture
External Maximum Marks: 75
External Pass Marks: 30(i.e. 40%)
Internal Maximum Marks: 25
Total Maximum Marks: 100
Total Pass Marks: 40(i.e. 40%)
Instructions to paper setter for End semester examination:
Total number of questions shall be nine. Question number one will be compulsory and will be consisting of short/objective type questions from complete syllabus. In addition to compulsory first question there shall be four units in the question paper each consisting of two questions. Student will attempt one question from each unit in addition to compulsory question. All questions will carry equal marks.
Course Objectives: The objective of this course is to provide the in- depth coverage of data mining and integration aspects along with its implementation in R programming language.
Course Outcomes (COs) At the end of this course, the student will be able to:
MCA-20-32.1 understand the fundamental concepts of data warehousing and data mining;
MCA-20-32.2 acquire skills to implement data mining techniques;
MCA-20-32.3 learn schema matching, mapping and integration strategies;
MCA-20-32.4 implement data mining techniques in R to meet the market job requirements.
UNIT – I
Data Warehouse: A Brief History, Characteristics, Architecture for a Data Warehouse. Data Mining: Introduction: Motivation, Importance, Knowledge Discovery Process, Data Mining Functionalities, Interesting Patterns, Classification of Data Mining Systems, Major issues, Data Preprocessing: Overview, Data Cleaning, Data Integration, Data Reduction, Data Transformation and Data Discretization, Outliers.
UNIT – II
Data Mining Techniques: Clustering- Requirement for Cluster Analysis, Clustering Methods- Partitioning Methods, Hierarchical Methods, Decision Tree- Decision Tree Induction, Attribute Selection Measures, Tree Pruning. Association Rule Mining- Market Basket Analysis, Frequent Itemset Mining using Apriori Algorithm, Improving the Efficiency of Apriori. Concept of Nearest Neighborhood and Neural Networks.
UNIT – III
Data Integration: Architecture of Data Integration, Describing Data Sources: Overview and Desiderate, Schema Mapping Language, Access Pattern Limitations, String Matching: Similarity Measures, Scaling Up String Matching, Schema Matching and Mapping: Problem Definition, Challenges, Matching and Mapping Systems, Data Matching: Rule- Based Matching, Learning- Based Matching, Matching by Clustering.
UNIT – IV
R Programming: Advantages of R over other Programming Languages, Working with Directories and Data Types in R, Control Statements, Loops, Data Manipulation and integration in R, Exploring Data in R: Data Frames, R Functions for Data in Data Frame, Loading Data Frames, Decision Tree packages in R, Issues in Decision Tree Learning, Hierarchical and K-means Clustering functions in R, Mining Algorithm interfaces in R.
Text Books:
⦁ J Hanes, M. Kamber, Data Mining Concepts and Techniques, Elsevier India.
⦁ A.Doan, A. Halevy, Z. Ives, Principles of Data Integration, Morgan Kaufmann Publishers.
⦁ S. Acharya, Data Analytics Using R, McGraw Hill Education (India) Private Limited.
Reference Books:
⦁ G.S. Linoff, M.J.A. Berry, Data Mining Techniques, Wiley India Pvt. Ltd.
⦁ Berson, S.J. Smith, Data Warehousing, Data Mining & OLAP, Tata McGraw-Hill.
⦁ J.Horbulyk, Data Integration Best Practices.
⦁ Jared P. Lander, R For Everyone, Pearson India Education Services Pvt. Ltd.