microsoft / MInference Public

Notifications You must be signed in to change notification settings
Fork 38
Star 801

Code
Issues 35
Pull requests 1
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: microsoft/MInference

[ToDo]: V0.1.6 Iteration Plan

#50 opened Jul 18, 2024 by iofu728

Open

[ToDo]: V0.1.5 Iteration Plan

#27 by iofu728 was closed Jul 24, 2024

Closed

Labels 15 Milestones 0

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

35 Open 26 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[Question]: How can I reproduce the FullAttention results on the Ruler dataset question

Further information is requested

#87 opened Nov 25, 2024 by LfieLike

[Question]: CUDA error: an illegal memory access was encountered when running benchmark_e2e.py question

Further information is requested

#86 opened Nov 20, 2024 by lepangdan

[Feature Request]: Is it possible to get the returned logsumexp in streamingllm forward? feature request

New feature or request

#85 opened Nov 17, 2024 by 311dada

[Question]: Discrepancy in Pre-filling Time and Memory Consumption on Single A100 question

Further information is requested

#84 opened Nov 15, 2024 by lepangdan

[Question]: Am I using minference correctly? question

Further information is requested

#83 opened Oct 30, 2024 by YLGH

[Question]: analysis of attention scores (too sparse) question

Further information is requested

#82 opened Oct 19, 2024 by wiluen

[Question]: sparsity of minference question

Further information is requested

#78 opened Sep 23, 2024 by susu1210

[Bug]: Torch not found: can't install with pip install (Python 3.12, CUDA 12.6 Update 1, PyTorch 2.4.1) bug

Something isn't working

#77 opened Sep 20, 2024 by atemerev

[Question]: Could you provide more examples about other attention usage, e.g., dilated1, streaming, snapkv question

Further information is requested

#76 opened Sep 18, 2024 by gaow0007

[Bug]: loc("Minference/minference/ops/pit_sparse_flash_attention_v2.py":110:23): error: operation scheduled before its operands bug

Something isn't working

#75 opened Sep 18, 2024 by leoyuppieqnew

[Feature Request]: Support LLaVA Model feature request / Low generation speed feature request

New feature or request

#74 opened Sep 18, 2024 by ThisisBillhe

[Question]: what is the speedup of attention kernel of current implemetation? question

Further information is requested

#73 opened Sep 10, 2024 by foreverpiano

Performance Degradation when Using MInference with Qwen2-7B-Instruct Model question

Further information is requested

#71 opened Aug 26, 2024 by yumingfan-0219

[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner' bug

Something isn't working

#67 opened Aug 8, 2024 by TPLink32

[Question]: Confusion about Optimal Search Pattern Configuration question

Further information is requested

#64 opened Aug 6, 2024 by Dianaia

[Question]: It seems that minference does not currently support tensor parallelism under vllm, right? Because in a multi-card environment, the head_id here is incorrect compared to a single card feature request

New feature or request

question

Further information is requested

#62 opened Aug 4, 2024 by zh2333

[Question]: Why is every head config saved with "vertical_and_slash"? question

Further information is requested

#57 opened Jul 29, 2024 by fmmoret

Does MInference supports CUDA11.8? question

Further information is requested

#56 opened Jul 29, 2024 by hensiesp32

Shape of slash mismatch when input batchsize > 1 bug

Something isn't working

#53 opened Jul 23, 2024 by polarispw

[Question]: attn_type="minference" and attn_type= "hf" got different result question

Further information is requested

#52 opened Jul 21, 2024 by qiling1345

[ToDo]: V0.1.6 Iteration Plan plan

#50 opened Jul 18, 2024 by iofu728

3 tasks

[Question]: Question about the settings of vertical_size and slash_size in vertical_and_slash pattern question

Further information is requested

#47 opened Jul 17, 2024 by ALUKErnel

[Question]: Does vertical_slash_sparse_attention supported to concatenate all batches into a single row for operation like flash_attn_2_cuda.varlen_fwd? question

Further information is requested

#46 opened Jul 17, 2024 by Amanda-Barbara

[Question]: ModuleNotFoundError: No module named 'minference.cuda' question

Further information is requested

#45 opened Jul 16, 2024 by lai-serena

[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone? question

Further information is requested

#43 opened Jul 16, 2024 by zjjznw123

Previous 1 2 Next

Previous Next

ProTip! Type g p on any issue or pull request to go back to the pull request listing page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly