Processing of semi-structured documents such as internet pages, Really Simple Syndication (RSS) feeds and their accompanying news items, and Portable Document Format (PDF) brochures is considered from the perspective of interpreting the content. This course considers the document and its various genres as a fundamental object for business, government and community. For this, the course covers four broad areas:
1. Information Retrieval (IR),
2. Natural Language Processing (NLP),
3. Machine/Deep Learning (M/DL) for documents, and
4. relevant tools for the Web.
Basic tasks here are covered, including content collection and extraction, formal and informal NLP, Information Extraction (IE), IR, classification, and analysis. Fundamental probabilistic techniques for performing these tasks and some common software systems will be covered, though no area will be covered in any depth.
(The older version: This course considers the “document” and its various genres as a fundamental object for business, government and community, such as web pages, social media feeds, news items, and PDF brochures. The goal is to introduce concepts and hands-on tools for automated understanding of large amounts of text. For this, the course covers four broad areas: (A) information retrieval, (B) natural language processing, (C) machine learning for documents, and (D) relevant tools for the web. Tasks include content collection and extraction, formal and informal natural language processing, information extraction, information retrieval, classification and analysis. Fundamental probabilistic techniques for performing these tasks, and some common software systems will be covered, though no area will be covered in great depth.)
Learning Outcomes
Upon successful completion, students will have the knowledge and skills to:
- differentiate between the basic probabilistic theories of language, document structure, and their applications to text generation and analysis.
- apply, analyse, and evaluate methods for natural language processing, word representation, document feature engineering, information retrieval, text classification, text clustering, and language modelling.
- create automated workflows for document analysis, information retrieval, and text classification tasks by developing and deploying typical algorithms, code libraries, and software (e.g., for natural language processing and machine/deep learning).
- demonstrate intermediate proficiency at designing, justifying, conducting, and reporting on such analysis and evaluation experiments at document and corpus levels.
Indicative Assessment
- Assignments (45) [LO 1,2,3,4]
- Written final exam (55) [LO 1,2,3,4]
The ANU uses Turnitin to enhance student citation and referencing techniques, and to assess assignment submissions as a component of the University's approach to managing Academic Integrity. While the use of Turnitin is not mandatory, the ANU highly recommends Turnitin is used by both teaching staff and students. For additional information regarding Turnitin please visit the ANU Online website.
Workload
Lectures, laboratory sessions and self study to a total of 130 hours
Inherent Requirements
None
Requisite and Incompatibility
Prescribed Texts
The following reference books will be used.
- Introduction to Information Retrieval, C.D. Manning, P. Raghavan and H. Scutze, Cambridge University Press, 2008.
- Foundations of Statistical Natural Language Processing, C.D. Manning and H. Scutze, MIT Press, 1999.
Fees
Tuition fees are for the academic year indicated at the top of the page.
Commonwealth Support (CSP) Students
If you have been offered a Commonwealth supported place, your fees are set by the Australian Government for each course. At ANU 1 EFTSL is 48 units (normally 8 x 6-unit courses). More information about your student contribution amount for each course at Fees.
- Student Contribution Band:
- 2
- Unit value:
- 6 units
If you are a domestic graduate coursework student with a Domestic Tuition Fee (DTF) place or international student you will be required to pay course tuition fees (see below). Course tuition fees are indexed annually. Further information for domestic and international students about tuition and other fees can be found at Fees.
Where there is a unit range displayed for this course, not all unit options below may be available.
Units | EFTSL |
---|---|
6.00 | 0.12500 |
Course fees
- Domestic fee paying students
Year | Fee |
---|---|
2025 | $5280 |
- International fee paying students
Year | Fee |
---|---|
2025 | $6720 |
Offerings, Dates and Class Summary Links
ANU utilises MyTimetable to enable students to view the timetable for their enrolled courses, browse, then self-allocate to small teaching activities / tutorials so they can better plan their time. Find out more on the Timetable webpage.
Class summaries, if available, can be accessed by clicking on the View link for the relevant class number.
Second Semester
Class number | Class start date | Last day to enrol | Census date | Class end date | Mode Of Delivery | Class Summary |
---|---|---|---|---|---|---|
8841 | 21 Jul 2025 | 28 Jul 2025 | 31 Aug 2025 | 24 Oct 2025 | In Person | N/A |