QtCS2024 AI tooling for Qt developers: Difference between revisions
Nick Bennett (talk | contribs) (→Notes) |
mNo edit summary |
||
(One intermediate revision by the same user not shown) | |||
Line 16: | Line 16: | ||
==Session Owners== | ==Session Owners== | ||
Daniel Smith | |||
==Notes== | ==Notes== | ||
Line 27: | Line 27: | ||
Use Case2: CI Failure Analysis (work in progress) | Use Case2: CI Failure Analysis (work in progress) | ||
Prompt engineering, leveraging json structure to get the LLM to provide useful analysis. | |||
looking for ways to provide feedback on the ticket. | looking for ways to provide feedback on the ticket. | ||
Line 39: | Line 39: | ||
How to minimize hallucinations, etc. | How to minimize hallucinations, etc. | ||
[[File:LLMs In the Qt Development Environment.pdf|thumb|453x453px|Slide deck from the session]] | |||
[[Category:QtCS2024]] | [[Category:QtCS2024]] |
Latest revision as of 10:48, 19 September 2024
Session Summary
Qt already has a network of bots which augment development workflows-- Cherry-Pick Bot, Submodule Update Bot, Flake8 Bot for Python, and so on.
With the rise of LLMs, a couple bots in this vain have been implemented with pretty good success, even considering limitations.
- API Header Review Bot - Identifies changes to public headers, summarizes them, and flags the change for review before the next release
- Uses GPT-4 for analysis. Generally good results, but in current state, inputs are not comprehensive and do not represent a full "API change" across multiple change reviews.
- Useful enough to at least flag changes.
- CI Failure Analysis Bot - Analyzes failure log, test sources, and change diff to determine if the change caused the failure. May suggest fixes if obvious.
- Very good results during a Proof-of-concept trial run in Qt Company bugfix Sprint H2 2024.
- Guessing at least 90% accuracy for changes causing/not causing the CI failure based on manual sampling and review of outputs.
- Identification of infrastructure issues as cause of failure.
- Identification of flaky tests as cause of failure.
- Limitation of 128k context, covers all but the largest changes.
- Limitation of only analyzing atomic changes, cannot take in a full relation chain or topic.
- Sometimes results in blaming multiple changes as the cause of failure with ambiguous analysis results, but even so, remains usually correct about the changes being related.
Session Owners
Daniel Smith
Notes
Daniel has been working will LLMs and the Qt review systems.
A primer on Large Language Models and Daniels findings and learnings.
Use Case 1:API Change Identification: In production, it will request you to give feedback on a particular bug ticket.
Use Case2: CI Failure Analysis (work in progress)
Prompt engineering, leveraging json structure to get the LLM to provide useful analysis.
looking for ways to provide feedback on the ticket.
looking for new ideas, must be non-intrusive, no high false-positive rate.
Q and A and Discussion
RAG database would help provide better contextual knowledge
Discussion about training the models with feedback (using CI failure analysis as the example) the method of fine tuning has not shown much promise so far.
How to minimize hallucinations, etc.