Performance and benchmarking

We understand that benchmarking and monitoring performance are not just routine processes; they are essential to ensuring the safety and effectiveness of our AI-driven educational tools. Benchmarking is crucial for:

  • Ensuring Safe AI: By continuously monitoring performance, we can promptly identify and address any issues, ensuring that our AI remains a safe and reliable educational resource.

  • Quality Improvement: Performance metrics provide our team with valuable insights into areas where we can enhance the quality and effectiveness of NSWEduChat, ensuring it meets the evolving needs of NSW Public Education.

Our benchmarking approach

To ensure that NSWEduChat aligns with the specific needs of NSW Public Education, we:

  • Utilise a Proprietary List of Questions: Our benchmarking process involves testing all models against a comprehensive, proprietary list of questions. These questions are tailored to reflect the diverse and dynamic curriculum of NSW Public Education.

  • Transparent Redirection: Based on the outcome of the above benchmarking process of each potential model, NSWEduChat identifies the most appropriate model to use to respond to the user’s query most accurately. The tool’s orchestrator reaches out to the chosen model to provide the user with a response and transparently notifies the user which model is being used. This ensures that every interaction with NSWEduChat is optimised for educational value and user satisfaction.

Comparison with leading GenAI chatbots

When compared to the two most popular free-to-use generative AI tools, NSWEduChat stands out, scoring higher in our benchmark:

86%

NSWEduChat

77%

Popular free AI A

71%

Popular free AI B

Benchmarks conducted 13 August 2024.

Continuous improvement

Our journey doesn't end with achieving high benchmark scores. We are committed to ongoing improvement, ensuring that NSWEduChat remains at the forefront of educational AI technology, consistently providing high-quality, tailored educational experiences to our users.

We believe that transparency in our performance and benchmarking processes is key to building trust and ensuring the highest standards of educational excellence. By openly sharing our achievements and areas for growth, we invite our users to join us on our journey of continuous improvement and innovation in AI-driven education.

Current benchmark outcomes

The benchmarking exercises help to identify subjects that may be more challenging for NSWEduChat to provide accurate responses because:

  • The app may lack sufficient information on that subject

  • The app may lack sufficient information on that context, e.g., Australian-specific questions

  • The app responds best to questions that could be answered in many ways rather than one specific answer. Therefore, it will struggle with subjects that require specific answers.

Users should review every output produced using generative AI to ensure accuracy. The benchmark scores help to identify subjects where extra caution should be taken when using NSWEduChat. Currently, those subjects are Maths Extensions 1 and Maths Extension 2. NSWEduChat will be updated to improve these benchmark scores.

This page has recently been modified to reflect the latest information.

Category:

  • DoE

Business Unit:

  • Centre for Education Statistics and Evaluation
  • Educational Standards
  • Information Technology
Return to top of page Back to top