Skip to main content

We are aiming for an incremental return to campus in accordance with guidelines provided by NSW Health and the Australian Government. Until this time, learning activities and assessments will be planned and scheduled for online delivery where possible, and unit-specific details about face-to-face teaching will be provided on Canvas as the opportunities for face-to-face learning become clear.

Unit of study_

DATA3404: Data Science Platforms

This unit of study provides a comprehensive overview of the internal mechanisms data science platforms and of the systems that manage large data collections. These skills are needed for successful performance tuning and to understand the scalability challenges faced by when processing Big Data. This unit builds upon the second' year DATA2001 - 'Data Science - Big Data and Data Diversity' and correspondingly assumes a sound understanding of SQL and data analysis tasks. The first part of this subject focuses on mechanisms for large-scale data management. It provides a deep understanding of the internal components of a data management platform. Topics include: physical data organization and disk-based index structures, query processing and optimisation, and database tuning. The second part focuses on the large-scale management of big data in a distributed architecture. Topics include: distributed and replicated databases, information retrieval, data stream processing, and web-scale data processing. The unit will be of interest to students seeking an introduction to data management tuning, disk-based data structures and algorithms, and information retrieval. It will be valuable to those pursuing such careers as Software Engineers, Data Engineers, Database Administrators, and Big Data Platform specialists.

Code DATA3404
Academic unit Computer Science
Credit points 6
DATA2001 OR DATA2901 OR ISYS2120 OR INFO2120 OR INFO2820
INFO3504 OR INFO3404
Assumed knowledge:
This unit of study assumes that students have previous knowledge of database structures and of SQL. The prerequisite material is covered in DATA2001 or ISYS2120. Familiarity with a programming language (e.g. Java or C) is also expected.

At the completion of this unit, you should be able to:

  • LO1. demonstrate experience with using/tuning data science platforms
  • LO2. understand different physical data organisations including data partitioning and data replication
  • LO3. understand disk-based indexing structures such as B-Trees, extensible hashing and bitmap indexes
  • LO4. understand the principles of query processing and query optimization
  • LO5. understand the principles of (distributed) data science platforms.
  • LO6. understand data sharding algorithms and data replication protocols
  • LO7. make effective physical data design decisions
  • LO8. identify a performance problem and be able to effectively tune the performance of a (distributed) data processing system

Unit outlines

Unit outlines will be available 2 weeks before the first day of teaching for 1000-level and 5000-level units, or one week before the first day of teaching for all other units.

There are no unit outlines available online for previous years.