Trying to create a problem and explain it, not just solve it. I think it’s a good way to test if LLM or AGI is just a probabilistic parrot. Generating well and having a good understanding of knowledge would be on another level.
If you pass this comprehensive verification well, LLM will be more reliable then.
제목: ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions
Summary:
Artificial intelligence (AI) can have a huge impact on traditional teaching methods in the educational field. Several concerns have been raised, especially regarding the use of ChatGPT to write new essays. However, AI programs such as ChatGPT can complement teaching methods. In this article, we created 60 multiple-choice questions using ChatGPT 3.5. We uploaded a text written by the author and asked ChatGPT to create multiple-choice questions with explanations for correct answers and incorrect answers. Unfortunately, ChatGPT produced correct and answer with explanations in only 32% of the questions (19 out of 60). In many cases, ChatGPT failed to provide explanations for incorrect answers. An additional 25% of the questions had incorrect or misleading answers. For most courses, 32% of grades are considered rejected. Despite these issues, instructors may find ChatGPT useful in creating mock tests with explanations, but it should be noted that it may require extensive editing.
Paper: https://www.sciencedirect.com/science/article/pii/S2374289523000313 There was a lack of evaluation criteria regarding the clinical application of LLM in healthcare, and the criteria for “Evaluation of the Application of Large-Scale Language Models in Clinical Research” presented in October at JAMA is a good reference.
- The application to be evaluated should be clinically meaningful; the ideal report covers a wide range of experimental conditions related to the clinical environment rather than a single narrow area.
- The experimental conditions must be described in sufficient detail to be reproduced by a knowledgeable user. Every LLM has several adjustable parameters, such as temperature, which affects the output, even if the chatbot does not necessarily expose these parameters. Moreover, the version of the model itself can change over time and must be specified.
- Multiple copies are needed to determine the reliability of the responses. Because text generation is stochastic, it is important to repeat the same request multiple times to estimate the effectiveness or association and measure the variability to some extent because each run on the same prompt can produce quite different responses. In addition, the author recommends considering how much alternative prompts can change the output, as this can be a useful sensitivity analysis to establish robustness of the results.
- In general, LLM version comparisons are more appropriate for other journals. Model-to-model comparisons are essential in many AI papers, but it should be taken into account that most non-professionals are not interested in how much better the new model is compared to previous models, especially when models are being replaced at a fast pace. Studies evaluating the most up-to-date versions of learning models receive the most attention.
- In addition to basic accuracy indicators, the characteristics of inaccurate or defective responses should be described. As with biomarkers, the value of accuracy depends significantly on the context in which the technology is applied and the consequences of the error. For example, most accurate answers to case presentations that recommend alternative but reasonable treatments, and highly inaccurate answers that recommend contraindication treatment, can lead to very different results. We recommend that you make an effort to understand when and how the model reports false results through examples of these responses.
- Bias and fairness should be evaluated. As with other forms of AI, machine learning learned from a wide range of the internet is prone to providing biased responses. Despite efforts to fine-tune training to censor or minimize these responses, authors should consider modifying prompts to reflect individual characteristics results in different results.
- Confidentiality should be protected. One of the most promising applications of LLM may be the summary and interpretation of clinical data. As pointed out in the context of manuscript and research expense review, simply uploading such data to a third party constitutes a breach of confidentiality. There are ways to maintain proper protection (e.g., hosting LLM within an institutional firewall), but further consideration and investment in resources are required. Plaintiffs who apply LLM to clinical data should report how confidentiality was maintained.
- As Turing advised, where possible, authors should aim to compare LLM results with existing expert references rather than simply describing the model results. It is highly recommended that these reference standards be provided as part of the publication in accordance with other journal policies regarding data sharing. If those criteria are from clinicians or other human participants, authors should carefully consider whether they require review by the Human Participant Committee, and in many cases individuals who have established the reference criteria may be considered study participants. Even if researchers think that the study is likely to be considered exempt, this is an institutional review board decision.
paper: https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2809977