Numerous egregious issues with this paper

Here's a list of issues others and I have found with your paper, code, data, methodology, and experiment design:

Issues pertaining to overall experiment design and methodology
- Quality vs. Correctness Discrepancy - How exactly do you differentiate between these two metrics given that the correctness of the response is correlated to the overall quality of the response? E.g How is it possible that you claim to find significant improvements to the correctness of the response while seeing only a partial improvement to the overall quality of the response (Principles 17, 18, 19 in particular sticks out to me, >60% improvement in quality but only <40% improvement to the overall quality, principle 1 is the most egregious but this is due to another issue entirely)?
  - Missing Methodology - What exactly are the guidelines by which you measured the quality or the correctness of a response given that both seem subjective and can vary significantly depending on the context? E.g Principle 2 & 5, are you assessing the quality and correctness of the response from the standpoint of whatever audience you're prompting the LLM to address? What about for the prompts such as "What were the main global events of 2022?" or "What are the potential future trends in renewable energy?"?
- Comparative Analysis - Where are the baseline instructions and baseline responses for your comparison?
- Unlikely Results - Many of the instructions are overly simple tasks where one would expect to see marginal improvements, especially for larger models. Specifically, I've noticed many instructions in different principles (8, 6, 19) are extremely simple and one would expect to see only marginal improvements to the response, yet there's somehow >50% improvements to the correctness? There are also certain prompts where your results cannot be replicated, such as "###Instruction###\nTranslate a given word from English to French.\n### Question ###\nWhat is the French word for "book"?" on Llama7b.
- Choice of Model - Why did you decide to use the baseline models for your small and medium sized models but dialogue/preference tuned models for your large models? Given that models of an entirely different architecture and training format was used, why did you proceed with doing a comparison between baseline and tuned models when there are alternative baseline 70b models (WizardLM, Orca)? Furthermore, there's a massive gap between the parameter sizes within the large models, 70b vs 200b vs 1t+. All of this makes me extremely dubious of your findings given the majority of the gap in performance between the different size classes in your paper can simply explained due to the large models being tuned and having much much more parameters. This fact can be seen in your detailed percentages, there's a massive gap in performance between GPT4 and all other models simply due to it having >1t parameters.
- Inconsistent Handling of Responses - Why is it for the GPT3.5 and GPT4 models that you prompted the model 10 times while only prompting the open sourced models once? How did you even choose which response to use? Was this treatment consistent with how you generated the baseline (I won't take your word for this one given the numerous flaws and errors I've observed so far)? If not, how are your results not biased (already biased IMO given the lack of guidelines for your evaluation combined with your model choice)?
- Misc - Was your evaluation done blind? Did the evaluator know which was the baseline and which was the principled prompts? Who were the ones evaluating these results?

Issues pertaining to code, implementation, and the actual data

Unprincipled Prompts - For principle 1, which was "No need to be polite with LLM so there is no need to add phrases like “please"", anyone who bothered to even take a look could see that none of your instructions even follows your principle. All of them are polite, yet you somehow see a difference in both quality AND correctness? How is this even possible, and what were the baseline for this principle which resulted in these improvements?

Literally Impossible Data - Based off the generate.py code you've released, it's literally impossible to generate the responses as shown for Prompt 14 since all you're doing is calling the model using the same prompt without updating it with the model's questions or the user's response

ATLAS/generate.py

Lines 40 to 43 in 03511d3

    
           for _ in range(10): 
        
               a = generate_answers(q) 
        
               questions.extend(q) 
        
               answers.extend(a)

Furthermore, using these clearly fabricated responses, you claim to somehow achieved 100% improvement across all three model sizes? Really?

Inconsistencies between Code and Data Format - In the code, the output is written without the model's name, yet in the data all the model's names are magically filled out?

ATLAS/generate.py

Line 44 in 03511d3

qa_pairs = [{"instruction": q, "output": a} for q, a in zip(questions, answers)]

How can you actually guarantee the data comes from the model you claim to be given that you clearly modified the data using external code?
Inconsistencies between Data and Paper - In the paper, you claimed to have used Llama-70b-chat, how come this isn't reflected in your data?
Missing Data - I noticed that correctness data for principles 14, 15, 21, 22, 23 were outright omitted from the paper. Why is this the case?
Mixing of Principles - I cba even citing direct examples for this, many of your instructions use a mix of CoT along with whatever principle the instruction is for.

There are significant issues with your paper which makes your findings "dubious" to say the least. Was this written by freshmen undergrads over two to three weeks? This paper comes off as sloppy, and the way this was written makes me think the authors were trying to just fill pages without regard to the quality of the content. Almost 1/5th of the pages are dedicated to just the Gemini and GPT4 references when there's no other (decent) paper referencing either paper that does so in this manner. I get this was released on arxiv, but how such glaring flaws weren't caught by your advisor is honestly beyond me.

Hi @wemoveon2, thanks for sharing these comments, we noticed you may have certain misconceptions here, the clarifications for them are:

Quality vs. Correctness Discrepancy: For Quality: We consider the quality of responses with the use of principled prompts and the absence of them, even if both of the model's responses are factually correct (through human evaluation). For Correctness: We consider the factual correctness proportion of the responses with and without the principles. The questions used are different from the ones for Quality, which are much more difficult.
Missing Methodology: This is an empirical work, so there is no specific methodology. The quality of responses is evaluated mainly through human evaluation, which seems the most reliable solution currently.
Comparative Analysis: We have provided all the instructions without principles and their corresponding responses in the benchmark as the baseline instructions and baseline responses (the principle benchmark construction is still in progress to make it more comprehensive and unbiased, please stay tuned).
Unlikely Results: Result improvement is highly relevant to the selected questions (simpleness, domains, etc.), as we mentioned in Limitations section, and we are building more diverse questions for each principle to mitigate this.
Choice of Model: Our small and medium sized models are also fine-tuned on the dialogue instruction dataset of ShareGPT. We did not provide too many details for the instruction tuning because our focus is not here. LLaMA-1/2 are representative models, the models you mentioned (WizardLM, Orca) seem also fined on LLaMA and there is no significant difference from ours. For there's a massive gap between the parameter sizes within the large models, 70b vs 200b vs 1t+, the groups of small, medium and large-scale are roughly split, we need a strong open-source LLM for the large-scale category, and LLaMA-2-70B seems a good choice for this, the Falcon 180B may also work.
Inconsistent Handling of Responses: For the GPT3.5 and GPT4 models that we prompted the model 10 times, this is only for dataset construction in the github repo to increase the scale of the benchmark for further instruction tuning. The performance in the paper is evaluated by one response without any response selection.
Misc: The human evaluators can also see the questions associated with the responses.
Unprincipled Prompts: P-1 is not a strong principle, so it doesn't present a significant difference before and after applying the principle.
Literally Impossible Data: The responses used in the paper are collected from OpenAI's interactive website (GPT-3.5/4), and all of them are recorded in our internal Google sheet. The generate.py code here is for obtaining more GPT-3.5&4 responses to construct the principled benchmark, so 1) No interactive conversation in the code currently. 2) it is different from the released data format since we involve all our obtained responses from various models in the data files.
Inconsistencies between Data and Paper: The LLaMA-2-70b-chat's responses are already in the data files, you probably haven't noticed and missed them.
Missing Data: As we mentioned above, for the Correctness evaluation, we use questions that are more difficult than for Boosting, typically involving complex reasoning tasks, so some principles are not applicable to them.
Mixing of Principles: Thanks for pointing out this. Some instructions may contain multiple principles, but the proportion should not be very high. We will double-check the instructions to avoid this.

Thank you for the clarifications, however, I have further questions regarding some of these points.

...even if both of the model's responses are factually correct. I asked why there's a discrepancy in cases where there's a large improvement in the correctness but a much lower improvement to the overall quality, which I suppose you subsequently answered with your response that The questions used are different from the ones for Quality. This, however, raises further issues as nowhere in the paper is this stated to be the case, infact your paper only stated that For each principle, it contains 20 human-selected questions with and without the principled prompts. It is not my misconception as to there being a discrepancy, it is a mistake on your part to state you prompted the model using ATLAS and then evaluated these responses in two different settings when you have just stated that you actually used two different sets of questions. I had suspected this to have been the case as this is one of the two ways where you could've ran into this discrepancy, but I gave you the benefit of the doubt given that such a significant discrepancy between the methodology as described in the paper and the actual approach taken can be classified as misconduct in an academic setting depending on researcher intentions. Please address and clarify the discrepancy present in your experiment design, not just here, but in your paper and code as nowhere does it suggest that ATLAS contains two independent datasets meant to evaluate two different metrics independently.
This is an empirical work, so there is no specific methodology. Empirical work still needs to document how data was collected, analyzed, and biases minimized. You know this as you presented a methodology for collecting the data and attempted to present a methodology for your evaluation by stating that Following [10, 25], we evaluate the various scales of LLM outputs by human evaluation. The two works you cited are [10] - AlpacaEval : An Automatic Evaluator for Instruction-following Language Models and [25] - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Both present methods for using a LLM as an evaluator, so you wouldn't need a methodology if you just used their methods, but instead you contradict yourself and state you were going to use human evaluation instead.
We have provided all the instructions without principles and their corresponding responses You have provided only principled prompts without their baseline, otherwise there would be a corresponding baseline for this principled prompt https://github.com/VILA-Lab/ATLAS/blob/03511d305ff51d5647059822ed5a1b2777fdb30d/data/principles/principle_6.json#L3C106-L3C125 the principle benchmark construction is still in progress to make it more comprehensive and unbiased, please stay tuned So why release data labeled ATLAS? Why even make ATLAS public if your results are based on this other dataset which cannot be reviewed? Why didn't you didn't you just use an established benchmark and generate principled instructions based using questions from it? None of the widely accepted benchmarks attempts to benchmark for a metric as generic as "quality", why taint the reproducibility of the results with the subjectiveness of evaluating for something as generic as quality?
Again, why did you not focus on correctness alone and use an existing benchmark that has hundreds if not thousands of examples? You're also using generated questions, so what is the delay behind this? Including the statement does not excuse a researcher from the responsibility of presenting a balanced and representative set of results. It is the responsibility of the researchers to ensure that their selection of data is as unbiased and representative as possible, even in preliminary stages of research much less a paper that has been released to the public along with the dataset it supposedly used.
are also fine-tuned on the dialogue instruction dataset of ShareGPT I mean why do this in the first place? Given that there are existing dialogue-tuned variants that are widely used and open source? How does this make your principles reproducible if their effectiveness was evaluated using tuned variants only you have access to? We did not provide too many details for the instruction tuning because our focus is not here. Fine-tuning can have a significant impact on the performance of the model and the overall findings of the paper, yet you chose to undergo this process instead of using available variants and omit details pertaining to how it was tuned? WizardLM, Orca seem also fined on LLaMA and there is no significant difference from ours. Actually there is, because they both have entire papers written on them documenting how they were tuned, both have derivative work analyzing them, and both are widely available, unlike your particular fine-tuned variants. we need a strong open-source LLM for the large-scale category So having this particular need, you decide to include GPT3.5 and GPT4 over the aforementioned open-source variants?
this is only for dataset construction in the github repo to increase the scale of the benchmark for further instruction tuning. Let me get this straight, in your paper it states ** All our evaluation is performed on ATLAS [18], a manually crafted benchmark for principled prompt evaluation.**, now you're telling me the data here isn't the actual benchmark used for evaluation of correctness nor is it the benchmark used for assessing quality alone, but the dataset meant for instruction tuning models?
human evaluators can also see the questions Is this the baseline question or the principled prompt? Otherwise, how can you justify your evaluation results aren't biased due to evaluators knowing which are baselines and which will be used for your paper?
P-1 is not a strong principle Or maybe it looks that way because ATLAS contains no prompt following principle 1?
The LLaMA-2-70b-chat's responses are already in the data files, you probably haven't noticed and missed them. Please, show this lowly and incompetent mortal where in your repo llama2-70b-chat appears because I am so incompetent I just cannot find it no matter how hard I work my ten braincells. https://github.com/search?q=repo%3AVILA-Lab%2FATLAS+llama2-70b-chat&type=code What I've found however is llama2-70b.

Principle 4 is contradictory to Principle 15 and 22.It's very confusing.

Principle 4:Employ affirmative directives such as "do," while steering clear of negative language like "don’t."

Principle 15:To inquire about a specific topic or idea or any information and you want to test your understanding, you can use the following phrase: “Teach me the [Any theorem/topic/rule name] and include a test at the end, but **don’t** give me the answers and then tell me if I got the answer right when I respond”

Principle 22:To correct/change specific text without changing its style: "Try to revise every paragraph sent by users. You should only improve the user’s grammar and vocabulary and make sure it sounds natural. You **should not** change the writing style, such as making a formal paragraph casual"

Hi @AreChen, thank you for your query. To provide clarity on the principles and directly address your concerns, let me explain further. Each principle has been crafted with a distinct aim to facilitate effective communication. While Principle 4 advocates for a positive tone in dialogues, Principles 15 and 22 provide a structured approach to specific requests. It’s understood that the presence of 'don't' within these instructions could appear to contradict the affirmative stance of Principle 4. However, our main focus in Principles 15 and 22 is on guiding users in how to frame requests effectively, irrespective of a positive or negative tone. The focus here is on the manner of articulation—Principle 15 encourages interactive learning and meaningful engagement with the content for a deeper understanding, and Principle 22 is about fine-tuning the text while maintaining the original tone and intent.

@wemoveon2 Thanks for writing that up. Agree with a lot of points.

Would like to see some elbaoration on evaluation methodology. Without human reviewer criteria for "Boosting", (and/or criteria/examples for Correctness), it's hard to draw any conclusions whatsoever from this paper. What did your evaluators see? Where did you source them from?

it's hard to draw any conclusions whatsoever from this paper

I wish others would agree, but one of the main reasons I took the time to review this paper was due to the numerous influencers on linkedin and twitter touting the results of this paper as "removing the need for prompt engineering".

I really dislike how work in prompt engineering is not seen as proper science in the wider academic (AI) community, and I see this paper and its authors as propagators of this stereotype.

Of course, I could stand corrected if independent parties demonstrate that this paper's results are replicable across the widely used and accessible LLMs, not just the author's own fine-tuned variants which could've easily had its training data contaminated by their benchmark.

Would like to see some elbaoration on evaluation methodology.

@darinkishore Tbh, I read the whole paper (at least 3x) just because I felt I was missing something related to the evaluation methodology.
I don't think it's a problem from this paper exclusively, tho, but rather a common issue around most LLM papers these days (most frequently when coming from startups/private research groups). I can't precise why but this "more handmade than repeatable science" is quite misleading and almost impossible to be used to elaborate solid conclusions.

I wish others would agree, but one of the main reasons I took the time to review this paper was due to the numerous influencers on linkedin and twitter touting the results of this paper as "removing the need for prompt engineering".

@wemoveon2 I appreciate your time doing this. This is how we suppose to approach these studies - with all due respect for those that wrote it. I appreciate the effort from them, but it missed some important clarifications.

The lack of "proper science" as you mentioned is a poisoning idea that backfires in the whole industry as we grow in popularity (and promises) - and its stagerring that this isn't being perceived by the academic community.

I don't know how many readers perceived flaws and did not opened issues or discussed it publicly, but a single one is alarming, and definitely reinforce the stereotype.

Thanks for all these comments. I'm not an academic and I'm not an influencer. But I, too, was wondering about the heatmap. I couldn't figure out how you came up with the values on page 9. I couldn't find the methodology in the paper so I thought I would check here.

The only way I saw the data on page 9 making sense is if 20 people were shown the result and were asked to judge which resuld was better, the principled or the non-principled one. And I'm not getting into the double-blind side of things or even the correctness vs quality distinction. I would have to assume that all 20 participants (if my methodology is correct) understood the criteria to evaluate clearly.

I see that this is V1 so I think it has to be reviewed and maybe a second round of tests will be done. I hope so, anyway. I would really like to know if these principles are really solid before I write about it.

	for _ in range(10):
	a = generate_answers(q)
	questions.extend(q)
	answers.extend(a)