Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix eos not stopping issue when batch_size >1 and set ignore_eos to False #1287

Closed
wants to merge 0 commits into from

Conversation

heyuanliu-intel
Copy link

@heyuanliu-intel heyuanliu-intel commented Aug 23, 2024

What does this PR do?

This PR will fix the eos token not stopping issue when the batch size is greater than 1 and set ignore_eos to False.

How to reproduce it ?

python run_generation.py \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--max_new_tokens 1024 \
--bf16 \
--use_hpu_graphs \
--use_kv_cache \
--batch_size 2 \
--attn_softmax_bf16 \
--limit_hpu_graphs \
--reuse_cache \
--trim_logits \
--no-ignore_eos \
--prompt "Please introduce yourself in 10 words." "How are you?"

The response will never stop even generating the eos token.

@heyuanliu-intel heyuanliu-intel changed the title Fix eso not stopping issue when batch_size >1 Fix eos not stopping issue when batch_size >1 and set ignore_eos to False Aug 23, 2024
@imangohari1
Copy link
Contributor

@heyuanliu-intel
Thanks.
We likely need to run CI jobs on this to make sure there is no side effects on other models for this.
Do you have access to CI systems to do so?

@heyuanliu-intel
Copy link
Author

Do you have access to CI systems to do so?

I don't have the right to access the CI systems.

@sywangyi sywangyi mentioned this pull request Sep 18, 2024
3 tasks
@libinta
Copy link
Collaborator

libinta commented Sep 24, 2024

@heyuanliu-intel please run with your gaudi machine by adding
GAUDI2_CI=1 RUN_SLOW=1 python -m tests/test_text_generation_example.py xxx with your specific case

@libinta
Copy link
Collaborator

libinta commented Oct 1, 2024

@heyuanliu-intel if you have access to gaudi2, please run on your local with below cmd:
GAUDI2_CI=true RUN_SLOW=1 python -m tests/test_text_generations .....

@mounikamandava
Copy link
Contributor

LGTM.

@vidyasiv
Copy link
Contributor

@heyuanliu-intel , can you run on 8 hpu baremetal Gaudi2 and paste test results at least for:

setup:

  • export GAUDI2_CI=1
  • export RUN_SLOW=1
  • pip install .[tests]

tests:

  • (fast tests) python -m pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py tests/test_trainer_seq2seq.py
  • (text-gen) python -m tests/test_text_generation_example.py

we want to make sure this doesnt introduce failures

Copy link
Contributor

@vidyasiv vidyasiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see request for testing

@vidyasiv
Copy link
Contributor

vidyasiv commented Nov 14, 2024

Results of CI(370) run:

  1. fast_tests: 81 passed
  2. single card tests: tests/test_examples.py::CausalLanguageModelingExampleTester::test_run_clm_gpt2_single_card - AssertionError: 46.1763 not less than or equal to 46.13290500000001
  3. text-gen: FAILED tests/test_text_generation_example.py::test_text_generation_fp8[token0-mistralai/Mixtral-8x7B-v0.1-2-48-True-2048-2048-1147.5] - assert 1074.9889343503237 >= ((2 - 1.05) * 1147.5)
  4. multi card tests: 11 failed, 34 passed- failures also seen on main (build 368)

@heyuanliu-intel can you verify that failures 2 and 4 are independent of your changes?

Setup:

Run on main:

  • pytest -v -s tests/test_examples.py::CausalLanguageModelingExampleTester::test_run_clm_gpt2_single_card --token <>
  • pytest -v -s tests/test_text_generation_example.py::test_text_generation_fp8[token0-mistralai/Mixtral-8x7B-v0.1-2-48-True-2048-2048-1147.5]

Then run same commands w/ your changes and paste test results on PR.

@heyuanliu-intel
Copy link
Author

I will try it.

@heyuanliu-intel
Copy link
Author

heyuanliu-intel commented Nov 15, 2024

For the second case (pytest -v -s tests/test_text_generation_example.py::test_text_generation_fp8[token0-mistralai/Mixtral-8x7B-v0.1-2-48-True-2048-2048-1147.5]), doesn't apply my PR. It stills fails.

    with TemporaryDirectory() as tmp_dir:
            command.append(f"--output_dir {tmp_dir}")
            command.append(f"--token {token.value}")

            pattern = re.compile(r"([\"\'].+?[\"\'])|\s")

            if fp8:
                env_variables["TQDM_DISABLE"] = "1"
                if measure_command is not None:
                    measure_command.append(f"--token {token.value}")
                    env_variables["QUANT_CONFIG"] = os.path.join(
                        path_to_example_dir, "text-generation/quantization_config/maxabs_measure_include_outputs.json"
                    )
                    measure_command = [x for y in measure_command for x in re.split(pattern, y) if x]
                    print(f"\n\nMeasure Command to test: {' '.join(measure_command[:-2])}\n")
                    proc = subprocess.run(measure_command, env=env_variables)

                    # Ensure the run finished without any issue
                    # Use try-except to avoid logging the token if used
                    try:
>                       assert proc.returncode == 0
E                       AssertionError: The following command failed:
E                       python3 /root/optimum-habana/examples/gaudi_spawn.py --use_deepspeed --world_size 2 /root/optimum-habana/examples/text-generation/run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --batch_size 1 --use_kv_cache --reuse_cache --bucket_size 128 --bucket_internal --use_hpu_graphs --trim_logits

tests/test_text_generation_example.py:280: AssertionError
========================================================================================================== short test summary info ==========================================================================================================
FAILED tests/test_text_generation_example.py::test_text_generation_fp8[token0-mistralai/Mixtral-8x7B-v0.1-2-48-True-2048-2048-1147.5] - AssertionError: The following command failed:
======================================================================================================= 1 failed in 893.20s (0:14:53) =======================================================================================================
root@sysid674639:~/optimum-habana#

@heyuanliu-intel
Copy link
Author

For the first case: pytest -v -s tests/test_examples.py::CausalLanguageModelingExampleTester::test_run_clm_gpt2_single_card --token <>. This case is passed with/without my PR.

***** Running Evaluation *****
[INFO|trainer.py:1852] 2024-11-15 08:13:58,717 >>   Num examples = 240
[INFO|trainer.py:1855] 2024-11-15 08:13:58,717 >>   Batch size = 4
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:01<00:00, 33.12it/s]
***** eval metrics *****
  epoch                       =        2.0
  eval_accuracy               =     0.4194
  eval_loss                   =     3.0499
  eval_runtime                = 0:00:02.95
  eval_samples                =        240
  eval_samples_per_second     =    127.818
  eval_steps_per_second       =     31.954
  max_memory_allocated (GB)   =      92.88
  memory_allocated (GB)       =      28.74
  perplexity                  =    21.1136
  total_memory_available (GB) =      93.55
PASSED

======================================================================================================= 1 passed in 71.32s (0:01:11) ========================================================================================================

@heyuanliu-intel
Copy link
Author

So my summary:

  1. For first case, it is passed with my PR.
  2. For second case, it still fail without my PR.

@vidyasiv
Copy link
Contributor

vidyasiv commented Nov 15, 2024

So my summary:

1. For first case, it is passed with my PR.

2. For second case, it still fail without my PR.

1st case could be from host to host variation.Thanks for testing.

You should be (w/ -v , -s) able to see what the failure is for 2nd case if you scroll.

Did you pass --token <> for second case too? I realized i missed it above. If still fails check for below:

  • Maybe you can check if u have access to model : https://huggingface.co/mistralai/Mixtral-8x7B-v0.1.
  • You need to be on 8 hpu.
  • Could be you have to run measure and then quant for fp8 and specific test command jumps to quant and cant find generated folder. Try running something like: pytest -v -s tests/test_text_generation_example.py -k mixtral --token (your token) (will run run measure). Followed by test, whose command is
python3 /root/optimum-habana/examples/gaudi_spawn.py --use_deepspeed --world_size 2 /root/optimum-habana/examples/text-generation/run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --batch_size 48 --use_kv_cache --max_new_tokens 2048 --reuse_cache --bucket_size 128 --bucket_internal --use_hpu_graphs --trim_logits --max_input_tokens 2048 --limit_hpu_graphs

I tested on your behalf but on your side if you still face issues we can get on a call. Everyone should be able to run tests.

@vidyasiv
Copy link
Contributor

For 2nd test

pytest -v -s tests/test_text_generation_example.py::test_text_generation_fp8[token0-mistralai/Mixtral-8x7B-v0.1-2-48-True-2048-2048-1147.5] --token ()
Got a particularly slow host
On this PR 1287:

Throughput (including tokenization) = 975.8450231790654 tokens/second
Number of HPU graphs                = 71
Memory allocated                    = 67.17 GB
Max memory allocated                = 94.2 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 317.2299808960006 seconds

On main:

Input tokens
Throughput (including tokenization) = 975.924843479448 tokens/second
Memory allocated                    = 67.16 GB
Max memory allocated                = 93.79 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 316.1472292409999 seconds

Nearly same so can also consider this a non issue.

Testing PR w/ given command:

python  examples/text-generation/run_generation.py \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--max_new_tokens 1024 \
--bf16 \
--use_hpu_graphs \
--use_kv_cache \
--batch_size 2 \
--attn_softmax_bf16 \
--limit_hpu_graphs \
--reuse_cache \
--trim_logits \
--no-ignore_eos \
--prompt "Please introduce yourself in 10 words." "How are you?"

Output

Input/outputs:
input 1: ('Please introduce yourself in 10 words.',)
output 1: ('Please introduce yourself in 10 words.\nI am a friendly and curious person.',)

input 2: ('How are you?',)
output 1: ('How are you? I hope you are doing well.\nI am writing to you today to ask for your help. As you may know, I am a big fan of your work and I have been following your career for many years. I must say, you are one of the most talented and dedicated actors of our time.\n\nI am reaching out to you because I am in need of your help. I am producing a new film and I am looking for an actor to play the lead role. I believe that you would be perfect for the part and I would be honored if you would consider it.\n\nThe film is a drama about a man who is struggling to come to terms with a personal tragedy. It is a complex and challenging role, but I believe that you have the depth and range to bring it to life.\n\nI understand that you must be very busy, but I hope that you will take the time to consider this opportunity. I would be happy to discuss the project further with you and answer any questions you may have.\n\nThank you for your time and consideration. I look forward to hearing from you soon.\n\nSincerely,\n[Your Name]',)


Stats:
----------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 848.3231944449811 tokens/second
Number of HPU graphs                = 2329
Memory allocated                    = 13.63 GB
Max memory allocated                = 13.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 36.30529521800054 seconds

Testing w/o ignore EOS option

python  examples/text-generation/run_generation.py \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--max_new_tokens 1024 \
--bf16 \
--use_hpu_graphs \
--use_kv_cache \
--batch_size 2 \
--attn_softmax_bf16 \
--limit_hpu_graphs \
--reuse_cache \
--trim_logits \
--prompt "Please introduce yourself in 10 words." "How are you?"

Output @heyuanliu-intel , let me know if this is expected result without --no-ignore_eos

Input/outputs:
input 1: ('Please introduce yourself in 10 words.',)
output 1: ('Please introduce yourself in 10 words.\nI am a friendly and curious person.01. What is your name?\nMy name is Sherlock Holmes.\n02. What is your occupation?\nI am a consulting detective.\n03. What is your favorite hobby?\nSolving mysteries and uncovering the truth.\n04. What is your favorite food?\nIrish stew and scones.\n05. What is your favorite drink?\nA good strong cup of tea.\n06. What is your favorite place to visit?\nThe British Museum.\n07. What is your favorite book?\n"The Adventures of Sherlock Holmes" by Arthur Conan Doyle.\n08. What is your favorite music?\nClassical music, particularly Chopin.\n09. What is your favorite sport?\nI do not have a favorite sport, as I find physical activity to be a waste of time.\n10. What is your favorite thing to do on a rainy day?\nSolving a puzzling case or reading a good book. The 10 Best Sherlock Holmes Quotes\nSherlock Holmes is one of the most iconic fictional characters in history, known for his incredible powers of observation, his keen mind, and his ability to solve even the most complex of mysteries. Here are 10 of the best Sherlock Holmes quotes that showcase his wit, intelligence, and unique perspective on the world:\n1. "Data! Data! Data! I can\'t make bricks without clay." - This quote, often misattributed to Sherlock Holmes, is a reminder that facts and evidence are the foundation of any good investigation.\n2. "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts." - This quote highlights the importance of approaching a case with an open mind and not jumping to conclusions based on incomplete information.\n3. "The game is afoot!" - This quote is often used to signal that an investigation is underway, and it has become a catchphrase for fans of the character.\n4. "Elementary, my dear Watson!" - This quote is often used to indicate that something is obvious or simple, but it also has a deeper meaning as it highlights the contrast between Holmes\' logical, analytical mind and Watson\'s more emotional and intuitive approach to solving cases.\n5. "The world is full of obvious things which nobody by any chance ever observes." - This quote highlights the idea that the most obvious clues are often overlooked, and that it takes a unique and observant mind like Holmes\' to notice them.\n6. "I have no knowledge of the future. I can only see the present." - This quote is often used to remind readers that Holmes is not a fortune teller or a predictor of the future, but rather a detective who uses his powers of observation and deduction to solve crimes in the present.\n7. "The art of detection is not a subtle, delicate thing, but a brutal and often unsavory process." - This quote highlights the idea that solving crimes is not always a glamorous or pleasant task, but rather a difficult and unpleasant one that requires a strong stomach and a willingness to confront unpleasantness.\n8. "I am the only being who can see the truth, and I am the only one who can uncover it." - This quote highlights the idea that Holmes is a unique and special individual, with a gift for seeing the truth that others may miss.\n9. "The game is afoot! And I am the only player." - This quote is often used to indicate that Holmes is ready to take on a new case, and that he is the only one who can solve it.\n10. "The world is a stage, and I am the only actor." - This quote highlights the idea that Holmes sees himself as a performer, with a unique role to play in the world of crime and detection. It also highlights his sense of self-importance and his belief that he is the only one who truly understands the game of detection.\nOverall, these quotes showcase the unique personality and perspective of Sherlock Holmes, and they highlight the key elements of his character that have made him such an enduring and beloved figure in popular culture. The 10 Best Books on the Art of Storytelling\nStorytelling is an essential part of human communication, and the art of storytelling has been passed down through generations. Whether you\'re a writer, a marketer, or simply',)

input 2: ('How are you?',)
output 1: ('How are you? I hope you are doing well.\nI am writing to you today to ask for your help. As you may know, I am a big fan of your work and I have been following your career for many years. I must say, you are one of the most talented and dedicated actors of our time.\n\nI am reaching out to you because I am in need of your help. I am producing a new film and I am looking for an actor to play the lead role. I believe that you would be perfect for the part and I would be honored if you would consider it.\n\nThe film is a drama about a man who is struggling to come to terms with a personal tragedy. It is a complex and challenging role, but I believe that you have the depth and range to bring it to life.\n\nI understand that you must be very busy, but I hope that you will take the time to consider this opportunity. I would be happy to discuss the project further with you and answer any questions you may have.\n\nThank you for your time and consideration. I look forward to hearing from you soon.\n\nSincerely,\n[Your Name] \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',)


Stats:
---------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 274.32692529299936 tokens/second
Number of HPU graphs                = 17
Memory allocated                    = 13.63 GB
Max memory allocated                = 13.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 23.226845264000076 seconds
---------------------------------------------------------------------------------------------------------------

@vidyasiv
Copy link
Contributor

@heyuanliu-intel pls clarify above output w/o --no-ignore_eos is expected or not?

@heyuanliu-intel
Copy link
Author

@vidyasiv Yes, the output is expected. If you run without --no-ignore_eos, the output length will depends on the value of --max_new_tokens. If you run with --no-ignore_eos, it should stop when meet eos token.

Copy link
Contributor

@vidyasiv vidyasiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisss please take a look

@libinta libinta added the run-test Run CI for PRs from external contributors label Nov 21, 2024
@regisss
Copy link
Collaborator

regisss commented Nov 25, 2024

@heyuanliu-intel I cannot reproduce the issue on main, can you check and let me know if that works on your side too?

@heyuanliu-intel
Copy link
Author

@regisss I have verified this issue on the main branch and I can't reproduce it on main branch now. Maybe this issue has been fixed by other way.

@regisss
Copy link
Collaborator

regisss commented Nov 27, 2024

@regisss I have verified this issue on the main branch and I can't reproduce it on main branch now. Maybe this issue has been fixed by other way.

Okay, let's keep this PR open till next release in case the issue appears again

@regisss regisss removed the run-test Run CI for PRs from external contributors label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants