Case Study: Enabling self-service and advanced analytics for TfNSW on Azure

Empowering Transport for NSW with a self-service analytics platform on Microsoft Azure for faster insights and improved decision-making.
Transport for NSW on Azure

Executive Summary

Data-Driven helps TfNSW build the data foundation for an Operational Data Lake with self-service capability to enable advanced analytics previously not possible. Transport for NSW (TfNSW) manages one of the largest fleets of vehicles in Australia. This includes Sydney CBD and NSW regional buses, ferries, light rail, trains, and metro. The near real-time telemetry generated by these vehicles provides analytical opportunities to improve transport services by measuring service performance, optimizing routes, and reducing trip delays. However, the sheer volume of data presents data management challenges, making it difficult to mine. Data-Driven partnered with TfNSW to deliver the data foundation for the Operational Data Lake platform with self-service and advanced analytics capabilities previously not possible. Both internal and external users can leverage this platform to perform advanced analytics and analyze historical data to identify opportunities to reduce trip delays, improve customer service, and measure operator service performance.

Jump to:

“TfNSW needed a solution to capture real-time data for every vehicle in motion across the state. This solution just gives us that so that we mine nuggets from this data at a later date. We now have an ability to self-service without waiting for someone else to curate operational data.”

Sandeep Mathur
Program Manager, Transport for NSW

Challenges

Historical operational data too large and costly to store efficiently and analyse.​

Historical GTFS data has always been too large and costly to store efficiently and analyze. Every TfNSW vehicle sends its position and other telemetry every 10 seconds, which results in a huge stockpile of data.

In the past, this GTFS Realtime data was not retained; at any given time, only the latest copy is published on the Transport Open Data Hub. This makes it impossible to obtain insights from past services, ruling out valuable insights such as the ability to predict trip delays, optimize trip routes, report, and analyze service performance, thereby improving customer services.

Key operational data-related challenges to overcome:

  • There is no centralized, scalable data platform to ingest, store, and process the vast volume of operational data in a cost-effective and performant way.
  • There is no platform to perform advanced analytics to mine valuable transport operational data.
  • Self-service analytics requires a technical platform, operational support, and the necessary operational data sets (including relevant master and reference data), along with defined roles and processes.
  • Sharing operational data between systems is difficult due to the organization’s low maturity in data management capabilities.

The Solution

Building an Operational Data Lake capable of ingesting and storing infinite data and allowing self-service analytics

TfNSW’s vision was to build an Operational Data Lake (ODL), a unified next-generation data and analytics platform, leveraging native Azure Cloud services and open-source Big Data technologies, e.g. Databricks and Spark.The ODL platform service offerings include but are not limited to:
  • Continuous collection/curation of diverse transport operational data sets
  • Data management
  • Self-service analytics
  • Platform services for advanced analytics, for example, AI/ML/DS.
The requirements for the GTFS data self-service analytics were:
  • Allow both internal and external users to access historical GTFS data sets
  • Enable TfNSW data scientists to perform advanced analytics and run machine learning experiments in a cost-efficient manner on operational data
  • Support data discovery, with built-in monitoring and cost management tools
  • Ensure data privacy and security, with robust governance controls supported by the platform
  • Deliver insights to the organization in an automated, interactive, and near real-time manner
Azure Data FactoryAzure DevOpsDatabricksPower BI

Key Business Outcomes

ODL Data Foundation ready for Advanced Analytics
Operational Data Lake

Modern data platform capable of ingesting and storing infinite operational data in realtime or batch in a well-governed, secure and cost-effective manner.

Infinite Cost Efficient Storage

Real-time positions and telemetry for every TfNSW mode is now tracked and stored for analysis

Self-Service Analytics

Self-service analytics and data-sharing on those operational data sets can be done by internal and public users

The Operational Data Lake Data Foundation solution by Data-Driven was the perfect starting point for TfNSW as it was designed to be highly scalable and extensible, while providing a platform for citizen data scientists and business users to perform advanced analytics. The true business value will grow even greater in the future as machine learning and analysis is applied to the operational data. Here are some examples of insights now possible with the right data:
  • The ability to validate insurance claims by verifying a vehicle’s exact location at a point in time in the past
  • Performing machine learning on all historical ferry routes to optimise routes and minimise delays
  • Predicting the possibility of vehicle breakdowns or trip delays by analysing historical trips
  • Understanding the effect of the weather on delays by comparing historic vehicle data with weather data

“With the ODL platform we are able to ingest and process 500GB, millions of various data files a day, in real time and batch efficiently, which is unprecedented in NSW Transport. The ODL is a great example of building a next-generation Cloud-based data and analytics platform using native Azure services. We can deliver what we had in mind with the Azure ODL because it is a flexible, rich in services and features, high-performant and easily extensible.”

About transport for NSW

The Technical Solution

The architecture uses native Microsoft Azure technologies to reduce the learning curve for operators of the data platform, which also makes it easy to extend for new use cases as they come onboard. Azure Data Factory and Azure Functions are used to ingest data depending on the data access method and frequency of ingress data. Storage and data life cycle management is performed by Azure Data Lake Gen2 components. Azure Databricks is the compute engine used to transform millions of IoT files into usable Big Data within the cost and performance constraints. The Unified Data Platform workspaces of Databricks were the perfect solution for the self-service capability, allowing internal data analysts to explore data and run experiments in a secure and controlled manner. Governance around cluster use, data access, data management, and security is handled by Azure Databricks RBAC controls to ensure the user sees only the data they are meant to see. A key technical challenge to overcome was storing millions of JSON files per day from the IoT devices for each vehicle. Delta Lake was used to process the raw operational data while providing data integrity, ACID transactions, and data versioning to add a governance layer to the Data Lake. The use of Delta Lake and Databricks allows “Hot” analytics to be performed on the Delta lake with a rapid response time across vast amounts of data

Transport for NSW

Transport_for_NSW_logo.
Transport for NSW is a government-run enterprise responsible for delivering and developing safe, integrated, and efficient transport systems for the people of NSW, including transport planning, strategy, policy, procurement, and other non-service delivery functions across all modes of transport: Buses, Ferries, Light Rail, Trains, and Metro. NSW Transport works hand-in-hand with operating agencies, private operators, and industry partners to deliver customer-focused services and projects to make NSW a better place to live, work, and visit.

Government

Transportation

NSW, Australia

Subscribed! We'll let you know when we have new blogs and events...