• Class Number 9185
  • Term Code 3460
  • Class Info
  • Unit Value 6 units
  • Mode of Delivery In Person
  • COURSE CONVENER
    • Charini Nanayakkara
  • LECTURER
    • Charini Nanayakkara
    • Prof Graham Williams
  • Class Dates
  • Class Start Date 22/07/2024
  • Class End Date 25/10/2024
  • Census Date 31/08/2024
  • Last Date to Enrol 29/07/2024
SELT Survey Results

Real-world data are commonly messy, distributed, and heterogeneous. This course introduces core concepts of data cleaning, standardisation, and data integration, that are aimed at converting and mapping raw data into other formats that allow more efficient and convenient use and analysis of data. The courses also discusses data quality, management, and storage issues as relevant to data analytics.

Learning Outcomes

Upon successful completion, students will have the knowledge and skills to:

  1. Critically reflect upon different data sources, types, formats and structures,
  2. Justify and apply data cleaning, preprocessing, and standardisation for data analytics,
  3. Apply data integration concepts and techniques to heterogeneous and distributed data,
  4. Interpret, assess and discuss data quality measurements,
  5. Understand and be able to use advanced data wrangling, data integration, and database techniques as relevant to data analytics.

Books

  1. Data matching - Concepts and techniques for record linkage, entity resolution and duplicate detection (Peter Christen, Springer, 2012). This book is a required text for major parts of the course. There are several copies available in the ANU library.
  2. Data mining: Concepts and techniques, 3rd edition (Jiawei Han, Micheline Kamber and Jian Pei, Morgan Kaufmann, 2011) Note: This is also the text book for the data mining course (COM3425 and COMPP8410).
  3. Data mining with Rattle and R is a useful book if you plan to use Rattle in this course as well as the Data Mining course (COMP3425 and COMP8410).


Software

  1. Pandas (which is included in Anaconda), based on Python.
  2. Matplotlib (also included in Anaconda), based on Python.
  3. Rattle, based on R.
  4. Code repository for the Data wrangling with Python book: https://github.com/jackiekazil/data-wrangling
  5. Code repository for data wrangling with Pandas: https://github.com/fonnesbeck/statistical-analysis-python-tutorial (see 2. Data wrangling with Pandas)

Staff Feedback

Students will be given feedback in the following forms in this course:

  • Individual written comments for all assessments. For each assignment task, when the marks are released, we will also provide the students with a marking feedback document. This document will contain general feedback for each task in the assignment and what criteria we followed for the marking. We will also let students know the distribution of the marks so the students can see where they sit in the course overall.
  • Verbal comments. During interactive lectures, we will further discuss the assignment and quiz marks and provide feedback to students. We plan to give feedback for quiz 1 in week 3, quiz 2 in 6, and quiz 3 in week 9 interactive lectures. We will also discuss the distribution of marks during the lectures.
  • Feedback during labs. Labs for the course will start from week 3 onwards. During each lab, tutors will engage with students both individually and as a group to discuss and give feedback on lab tasks that students are required to complete.
  • Self-assessment and feedback. In the first week of the course, we will release a sample set of Python scripts that students can use to self-assess themselves. This is to ensure that the students understand the level of programming skills requires for the course throughout.
  • We will also provide additional feedback to groups, to individuals, and/or focus groups should the need arise.

Student Feedback

ANU is committed to the demonstration of educational excellence and regularly seeks feedback from students. Students are encouraged to offer feedback directly to their Course Convener or through their College and Course representatives (if applicable). The feedback given in these surveys is anonymous and provides the Colleges, University Education Committee and Academic Board with opportunities to recognise excellent teaching, and opportunities for improvement. The Surveys and Evaluation website provides more information on student surveys at ANU and reports on the feedback provided on ANU courses.

Other Information

The use of Generative AI Tools (e.g., ChatGPT) is permitted in this course (excluding the final exam), given that proper citation and prompts are provided, along with a description of how the tool contributed to the assignment. Guidelines regarding appropriate citation and use can be found on the ANU library website https://libguides.anu.edu.au/generative-ai

Marks will reflect the contribution of the student rather than the contribution of the tools. Further guidance on appropriate use should be directed to the convener for this course.

Class Schedule

Week/Session Summary of Activities Assessment
1 Introduction to Data WranglingLecture 1: What is data wrangling; and course overview.Lecture 2: The data wrangling process; understanding data.Lecture 3: Data extraction and storage, data warehousing.
Interactive lectureDiscussion of issues raised in recorded lectures 1 to 3.
Reading material (all students)
  • Rahm and Do (2000): Data cleaning: Problems and current approaches.
  • New York Times article (2014): For Big Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights
  • Chu et. al. (2016): Data cleaning: Overview and emerging challenges
2 Data quality, exploration and cleaning Lecture 4: Web scraping and geocoding of data.Lecture 5: Data quality assessment, data quality dimensions, data profiling, data visualisation, real-world data is dirty.Lecture 6: Resolving data quality issues, data cleaning overview, dealing with missing data.
Interactive lectureDiscussion of issues raised in recorded lectures 4 to 6.
Reading material (COMP8430 students only)
  • Strong, Lee and Wang (1997): Data Quality in Context.

Online quiz 1 (progress questions weeks 1 and 2, average of two best quiz marks is worth 3% of total course mark).Release of assignment 1.
3 Data pre-processingLecture 7: Data transformation, aggregation and reduction, Metadata.Lecture 8: Data parsing and standardisation, special case of personal data.Lecture 9: Example data cleaning using Rattle (R based) and Python (Pandas).
Tutorial/lab 1Data exploration using Rattle and practical Pandas exercises.
Interactive lectureDiscussion of issues raised in recorded lectures 7 to 9.
Reading material (COMP8430 students only)
  • Krishnan, Haas, Franklin and Wu (2016): Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations.

Participation in 65% (4 out of 6) labs contributes to 2% of total course marks.
4 Data integrationLecture 10: Overview of data integration and its importance.Lecture 11: Schema mapping and matching.Lecture 12: Overview of record linkage (process, history, challenges) and data fusion.
Tutorial/lab 2Data cleaning and preprocessing using practical Rattle and Pandas exercises.
Interactive lectureDiscussion of issues raised in recorded lectures 10 to 12.
Reading material (all students)
  • First two chapters of Christen (2012): Data Matching – Introduction and the Data Matching Process.

Participation in 65% (4 out of 6) labs contributes to 2% of total course marks.Release of assignment 2.
5 Record linkageLecture 13: Blocking and indexing for record linkageLecture 14: More on blocking/indexing (phonetic encoding).
No practical lab/tutorial (work on assignment 1).
Interactive lectureDiscussion of issues raised in recorded lectures 13 to 14.
Reading material (all students)
  • Chapter 4 of Christen (2012): Data Matching – Indexing.

Online quiz 2 (progress questions weeks 2 to 5, average of two best quiz marks is worth 3% of total course mark).
6 Record linkage (2)Lecture 15: Record linkage comparison (basics).Lecture 16: Record linkage comparison (string comparison functions).
Tutorial/lab 3Record linkage blocking using Python.
Interactive lectureDiscussion of issues raised recorded lectures 15 to 16.
Reading material (all students)
  • Chapter 5 of Christen (2012): Data matching – Field and Record Comparison.

Participation in 65% (4 out of 6) labs contributes to 2% of total course marks.Assignment 1 due (Friday 30 August).Release of assignment 3.Release of assignment 4 (COMP8430 only).
7 Record linkage (3)Lecture 17: Record linkage classification (basics).Lecture 18: Record linkage classification (advanced).
Tutorial/lab 4Record linkage comparison using Python.
Interactive lectureDiscussion of issues raised recorded lectures 17 to 18.
Reading material (all students)
  • Chapter 6 of Christen (2012): Data Matching – Classification.

Participation in 65% (4 out of 6) labs contributes to 2% of total course marks.
8 Record linkage (4)Lecture 19: Record linkage scalability evaluation, test data generation.Lecture 20: Record linkage quality evaluation, clerical review.
Tutorial/lab 5Record linkage classification using Python.
Interactive lectureDiscussion of issues raised recorded lectures 19 to 20.
Reading material (all students)
  • Chapter 7 of Christen (2012): Data Matching – Evaluation of Matching Quality and Completeness.

Online quiz 3 (Progress questions weeks 6 to 8, average of two best quiz marks is worth 3% of total course mark).Participation in 65% (4 out of 6) labs contributes to 2% of total course marks.Assignment 2 due (Friday 27 September).
9 Advanced record linkageLecture 21: Data fusion, merging records after integration.Lecture 22: Group linkage, collective linkage, active learning, geocode matching, linking temporal and dynamic data, and real-time linkage.Lecture 23: Privacy aspects in data wrangling, and privacy-preserving record linkage.
Tutorial/lab 6Record linkage evaluation using Python.
Interactive lectureDiscussion of issues raised recorded lectures 21 to 23.
Reading material (all students)
  • Chapters 8 and 9 of Christen (2012): Data Matching – Privacy Aspects of Data Matching and Further Topics and Research Directions (all students).
  • Schnell et al. (2009): Privacy-Preserving Record Linkage using Bloom Filters (COMP8430 students only).

Participation in 65% (4 out of 6) labs contributes to 2% of total course marks.
10 Data fusion and ontologiesLecture 24: Ontology mapping and matching.Lecture 25: Wrangling dynamic data and data streams and location (spatial) data.
Interactive lectureDiscussion of issues raised in recorded lectures 24 to 25 and assignment issues and questions.
11 Applying Python linkage program to other provided data sets and linkage evaluation. Assignment 3 due (Friday 18 October).
12 Course summary - Interactive lecture Summary of course topics (the main important aspects for the students to get out of this course), and discussion of the final examination. Assignment 4 due (Friday 25 October - COMP8430 students only).

Tutorial Registration

Sign-up for lab sessions will be available via either Wattle or My Timetable (MyTT) system

Assessment Summary

Assessment task Value Due Date Return of assessment Learning Outcomes
Practical data exploration 10 % 30/08/2024 27/09/2024 1, 2, 4
Practical data cleaning 15 % 27/09/2024 11/10/2024 1, 2, 4
Practical record linkage 20 % 18/10/2024 01/11/2024 1, 2, 3, 4, 5
Online Quizzes 3 % * * 1, 2, 3, 4
In-person Labs 2 % * * 1,2,4,5
Final examination 50 % * * 1, 2, 3, 4, 5

* If the Due Date and Return of Assessment date are blank, see the Assessment Tab for specific Assessment Task details

Policies

ANU has educational policies, procedures and guidelines, which are designed to ensure that staff and students are aware of the University’s academic standards, and implement them. Students are expected to have read the Academic Misconduct Rule before the commencement of their course. Other key policies and guidelines include:

Assessment Requirements

The ANU is using Turnitin to enhance student citation and referencing techniques, and to assess assignment submissions as a component of the University's approach to managing Academic Integrity. For additional information regarding Turnitin please visit the Academic Skills website. In rare cases where online submission using Turnitin software is not technically possible; or where not using Turnitin software has been justified by the Course Convener and approved by the Associate Dean (Education) on the basis of the teaching model being employed; students shall submit assessment online via ‘Wattle’ outside of Turnitin, or failing that in hard copy, or through a combination of submission methods as approved by the Associate Dean (Education). The submission method is detailed below.

Moderation of Assessment

Marks that are allocated during Semester are to be considered provisional until formalised by the College examiners meeting at the end of each Semester. If appropriate, some moderation of marks might be applied prior to final results being released.

Participation

You are expected to go to every laboratory session.

Assessment Task 1

Value: 10 %
Due Date: 30/08/2024
Return of Assessment: 27/09/2024
Learning Outcomes: 1, 2, 4

Practical data exploration

This assessment covers the topics of data quality, data exploration, and data profiling as presented in the first few weeks of the course. It also includes questions about what data wrangling is, why it is important, and how it fits into the broader field of data analytics.

Assessment Task 2

Value: 15 %
Due Date: 27/09/2024
Return of Assessment: 11/10/2024
Learning Outcomes: 1, 2, 4

Practical data cleaning

This assessment covers the topics of data integration and data cleaning, with a focus on identifying possible data quality problems in data sets and taking necessary steps to correct them. It will reflect real world data cleaning aspects where students will need to take decisions based on a final data analysis task.

Assessment Task 3

Value: 20 %
Due Date: 18/10/2024
Return of Assessment: 01/11/2024
Learning Outcomes: 1, 2, 3, 4, 5

Practical record linkage

This assessment covers the topics of record linkage, with a focus on identifying and applying appropriate record linkage techniques in each step of the process; blocking, comparison, classification, and evaluation. The students will work with different data sets in this assessment.

Assessment Task 4

Value: 3 %
Learning Outcomes: 1, 2, 3, 4

Online Quizzes

Online Quizzes will cover topics of data quality, data exploration, data profiling, data integration, data cleaning, and record linkage.

Assessment Task 5

Value: 2 %
Learning Outcomes: 1,2,4,5

In-person Labs

Labs will cover topics of data quality, data exploration, data profiling, data cleaning, and record linkage.

Assessment Task 6

Value: 50 %
Learning Outcomes: 1, 2, 3, 4, 5

Final examination

Final examination will cover all course content students learned during the course.

Hurdle - Students must obtain a final exam mark of at least 45% AND a total mark over 50% to pass the course

Academic Integrity

Academic integrity is a core part of our culture as a community of scholars. At its heart, academic integrity is about behaving ethically. This means that all members of the community commit to honest and responsible scholarly practice and to upholding these values with respect and fairness. The Australian National University commits to embedding the values of academic integrity in our teaching and learning. We ensure that all members of our community understand how to engage in academic work in ways that are consistent with, and actively support academic integrity. The ANU expects staff and students to uphold high standards of academic integrity and act ethically and honestly, to ensure the quality and value of the qualification that you will graduate with. The University has policies and procedures in place to promote academic integrity and manage academic misconduct. Visit the following Academic honesty & plagiarism website for more information about academic integrity and what the ANU considers academic misconduct. The ANU offers a number of services to assist students with their assignments, examinations, and other learning activities. The Academic Skills and Learning Centre offers a number of workshops and seminars that you may find useful for your studies.

Online Submission

The ANU uses Turnitin to enhance student citation and referencing techniques, and to assess assignment submissions as a component of the University's approach to managing Academic Integrity. While the use of Turnitin is not mandatory, the ANU highly recommends Turnitin is used by both teaching staff and students. For additional information regarding Turnitin please visit the ANU online website.

Hardcopy Submission

None. All assessment submissions are electronic through Wattle.

Late Submission

No submission of assessment tasks without an extension after the due date will be permitted. If an assessment task is not submitted by the due date, a mark of 0 will be awarded.

Referencing Requirements

The Academic Skills website has information to assist you with your writing and assessments. The website includes information about Academic Integrity including referencing requirements for different disciplines. There is also information on Plagiarism and different ways to use source material. Any use of artificial intelligence must be properly referenced. Failure to properly cite use of Generative AI will be considered a breach of academic integrity.

Extensions and Penalties

Extensions and late submission of assessment pieces are covered by the Student Assessment (Coursework) Policy and Procedure The Course Convener may grant extensions for assessment pieces that are not examinations or take-home examinations. If you need an extension, you must request an extension in writing on or before the due date. If you have documented and appropriate medical evidence that demonstrates you were not able to request an extension on or before the due date, you may be able to request it after the due date.

Privacy Notice

The ANU has made a number of third party, online, databases available for students to use. Use of each online database is conditional on student end users first agreeing to the database licensor’s terms of service and/or privacy policy. Students should read these carefully. In some cases student end users will be required to register an account with the database licensor and submit personal information, including their: first name; last name; ANU email address; and other information. In cases where student end users are asked to submit ‘content’ to a database, such as an assignment or short answers, the database licensor may only use the student’s ‘content’ in accordance with the terms of service — including any (copyright) licence the student grants to the database licensor. Any personal information or content a student submits may be stored by the licensor, potentially offshore, and will be used to process the database service in accordance with the licensors terms of service and/or privacy policy. If any student chooses not to agree to the database licensor’s terms of service or privacy policy, the student will not be able to access and use the database. In these circumstances students should contact their lecturer to enquire about alternative arrangements that are available.

Distribution of grades policy

Academic Quality Assurance Committee monitors the performance of students, including attrition, further study and employment rates and grade distribution, and College reports on quality assurance processes for assessment activities, including alignment with national and international disciplinary and interdisciplinary standards, as well as qualification type learning outcomes. Since first semester 1994, ANU uses a grading scale for all courses. This grading scale is used by all academic areas of the University.

Support for students

The University offers students support through several different services. You may contact the services listed below directly or seek advice from your Course Convener, Student Administrators, or your College and Course representatives (if applicable).

  • ANU Health, safety & wellbeing for medical services, counselling, mental health and spiritual support
  • ANU Accessibility for students with a disability or ongoing or chronic illness
  • ANU Dean of Students for confidential, impartial advice and help to resolve problems between students and the academic or administrative areas of the University
  • ANU Academic Skills supports you make your own decisions about how you learn and manage your workload.
  • ANU Counselling promotes, supports and enhances mental health and wellbeing within the University student community.
  • ANUSA supports and represents all ANU students
Charini Nanayakkara
comp3430@anu.edu.au

Research Interests


Record linkage, Machine learning, Data mining

Charini Nanayakkara

By Appointment
Sunday
Charini Nanayakkara
u6507558@anu.edu.au

Research Interests


Record linkage, Machine learning, Data mining

Charini Nanayakkara

By Appointment
Sunday
Prof Graham Williams
u8303784@anu.edu.au

Research Interests


Prof Graham Williams

Sunday

Responsible Officer: Registrar, Student Administration / Page Contact: Website Administrator / Frequently Asked Questions