Transport for NSW – Customer Success Story
Data-Driven helps TfNSW build the data foundation for an Operational Data Lake with self-service capability to enable advanced analytics previously not possible.
Transport for NSW (TfNSW) manages one of the largest fleets of vehicles in Australia. This includes Sydney CBD and NSW regional Buses, Ferries, Light Rail, Trains and Metro. The near real-time telemetry generated by these vehicles provides analytical opportunities to improve transport services by measuring service performance, optimising routes and reducing trip delays, however, the sheer volume of data presents data management challenges, which in turn makes it difficult to mine.
Data-Driven partnered with TfNSW to deliver the data foundation for the Operational Data Lake platform with self-service and advanced analytics capabilities previously not possible. Both internal and external users can leverage this platform in performing advanced analytics and analyse historical data to look for opportunities to reduce trip delays, improve customer service and measure operator service performance.
Historical GTFS data has always been too large and costly to store efficiently and analyse. Every TfNSW vehicle sends its’ position and other telemetry every 10 seconds which results in a huge stockpile of data.
In the past, this GTFS realtime data was not kept anywhere; at any given time only the last copy is published on the Transport Open Data Hub. This makes it impossible to obtain insights from past services, and rules out valuable insights such as the ability to predict trip delays or optimise trip routes, report and analyse service performance thereby improving customer services.
Key Operational Data-related challenges to overcome:
TfNSW’s vision was to build an Operational Data Lake (ODL), a unified next-generation data and analytics platform, leveraging native Azure Cloud services and open source Big Data technologies, e.g. Databricks and Spark. The ODL platform service offerings include but not limited to:
The requirements for the GTFS data self-service analytics were:
The Operational Data Lake Data Foundation solution by Data-Driven was the perfect starting point for TfNSW as it was designed to be highly scaleable, extensible and whilst providing a platform for citizen data scientists and business users to perform advanced analytics.
Modern data platform capable of ingesting and storing infinite operational data in realtime or batch in a well-governed, secure and cost-effective manner
Real-time positions and telemetry for every TfNSW mode is now tracked and stored for analysis
Self-service analytics and data-sharing on those operational data sets can be done by internal and public users
TfNSW data scientists and analysts can perform advanced analytics and machine learning on operational data
The true business value will grow even greater in the future as machine learning and analysis is done on the data. Here are some examples of insights that are now possible with the right data:
The architecture uses native Microsoft Azure technologies to reduce the learning curve for operators of the data platform which also makes it easy to extend for new use-cases as they come onboard. Azure Data Factory and Azure Functions are used to ingest data depending on the data access method and frequency of ingress data. Storage and data life cycle management is performed by Azure Data Lake Gen 2 components.
Azure Databricks is the compute engine used to transform millions of IoT files into a usable Big Data within the cost and performance constraints. The Unified Data Platform workspaces of Databricks were the perfect solution for the self-service capability to allow internal data analysts to explore data and run experements in a secure and controlled manner. Governance around cluster use, data access, data management and security is handled by Azure Databricks RBAC controls to ensure the user sees only the data they are meant to see.
A key technical challenge to overcome was the storing of millions of Json files per day from the IoT devices for each vehicle. Delta Lake was used to process the raw operational data as well as providing data integrity, ACID transactions and data versioning to add a governance layer to the Data Lake. The use of Delta Lake and Databricks allows “Hot” analytics to be performed on the Delta lake with a rapid response time across vast amounts of data.
Transport for NSW is a government run enterprise responsible for the delivering and development of of safe, integrated and efficient transport systems for the people of NSW; including transport planning, strategy, policy, procurement and other non-service delivery functions across all modes of transport: Buses, Ferries, Light Rail, Trains and Metro.
NSW Transport works hand-in-hand with operating agencies, private operators and industry partners to deliver customer-focused services and projects in order to make NSW a better place to live, work and visit.