Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Update CI image and unit tests #283

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
c104c72
Update incremental code test coverage report online address
phoenixdong Nov 5, 2024
c703f01
Merge branch 'FlagOpen:main' into update_CI
phoenixdong Nov 14, 2024
240f504
Coverage Online Report Location Advance
phoenixdong Nov 14, 2024
276ef62
Merge branch 'FlagOpen:main' into update_CI
phoenixdong Nov 28, 2024
a726f48
Add export testing and adjust the testing status verification method
phoenixdong Nov 28, 2024
2e9fad0
Some implementations of updating code format and format checking
phoenixdong Nov 28, 2024
45f7220
Merge branch 'FlagOpen:main' into update_CI
phoenixdong Dec 8, 2024
72c7f33
Megatron Unit Test Update, Fix or Skip
phoenixdong Dec 8, 2024
fe9b1f8
FlagScale Unit Test Repair
phoenixdong Dec 8, 2024
79513ad
Update the CI image and add the Dockerfile for the CI
phoenixdong Dec 8, 2024
6f4cac4
Update tests/scripts/unit_tests/test_subset.sh
phoenixdong Dec 8, 2024
4eec993
Merge branch 'FlagOpen:main' into update_CI
phoenixdong Dec 9, 2024
5a97612
Add diff-cover
phoenixdong Dec 9, 2024
a9329b7
Update Dockerfile.ci
phoenixdong Dec 9, 2024
2a7f852
Update tests/scripts/unit_tests/test_subset.sh
phoenixdong Dec 9, 2024
6f9a32a
Update tests/scripts/unit_tests/test_subset.sh
phoenixdong Dec 9, 2024
9705478
Update megatron/tests/unit_tests/transformer/moe/test_a2a_token_dispa…
phoenixdong Dec 9, 2024
a4c0d46
Update and test
phoenixdong Dec 9, 2024
afde473
Adjust the temporary file path for coverage to avoid coverage loss ca…
phoenixdong Dec 9, 2024
6d03c43
Update cov-temp-flagscale prefix path
phoenixdong Dec 9, 2024
103cfbd
Clean old coverage files
phoenixdong Dec 9, 2024
28fb8b7
Update clean old coverage files
phoenixdong Dec 9, 2024
880ccb2
Update clean old coverage files
phoenixdong Dec 9, 2024
3773b67
Update clean old coverage files
phoenixdong Dec 9, 2024
6937807
Update clean old coverage files
phoenixdong Dec 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
directory = coverage

[run]
data_file = .coverage_$LOCAL_RANK
data_file = /workspace/report/$COMMIT_ID/cov-temp-flagscale/.coverage_$LOCAL_RANK
15 changes: 15 additions & 0 deletions .github/workflows/all-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,23 @@ concurrency:
cancel-in-progress: true

jobs:
# Megatron Report Clean
megatron-report-clean:
uses: ./.github/workflows/report-clean.yml
with:
backend: megatron

# Megatron Unit Tests with Matrix
megatron-unit-tests:
needs: megatron-report-clean
uses: ./.github/workflows/unit-tests.yml
strategy:
matrix:
subset:
- data
- dist_checkpointing
- distributed
- export
- fusions
- inference
- models
Expand All @@ -33,8 +41,15 @@ jobs:
backend: megatron
subset: ${{ matrix.subset }}

# FlagScale Report Clean
flagscale-report-clean:
uses: ./.github/workflows/report-clean.yml
with:
backend: flagscale

# Flagscale Unit Tests with Matrix
flagscale-unit-tests:
needs: flagscale-report-clean
uses: ./.github/workflows/unit-tests.yml
strategy:
matrix:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/coverage-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
test-coverage:
runs-on: self-hosted
container:
image: localhost:5000/flagscale_cicd:v1.5
image: localhost:5000/flagscale_cicd:v2.0-pytorch-2.5.1-cuda-12.4.131-ngc-24.05
ports:
- 80
volumes:
Expand Down
7 changes: 4 additions & 3 deletions .github/workflows/format.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ env:

jobs:
format:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04

steps:
- name: Checkout Code
Expand All @@ -37,8 +37,9 @@ jobs:

- name: Run Black
run: |
black --verbose --include $INCLUDE_FILES ./ --diff
black --verbose --include "$INCLUDE_FILES" ./ --check || { echo "Code formatting does not comply with Black's rules. Please reformat the code according to Black and resubmit."; exit 1; }

- name: Run Isort
run: |
isort --verbose --profile black $INCLUDE_FILES --diff --known-local-folder flagscale
isort --verbose --profile black $INCLUDE_FILES --check-only --diff --known-local-folder flagscale || { echo "Import order does not comply with isort rules. Please fix the import order and resubmit."; exit 1; }

2 changes: 1 addition & 1 deletion .github/workflows/functional-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
functional-test:
runs-on: self-hosted
container:
image: localhost:5000/flagscale_cicd:v1.5
image: localhost:5000/flagscale_cicd:v2.0-pytorch-2.5.1-cuda-12.4.131-ngc-24.05
ports:
- 80
volumes:
Expand Down
32 changes: 32 additions & 0 deletions .github/workflows/report-clean.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Clean Old Report

on:
workflow_call:
inputs:
backend:
required: true
type: string

jobs:
clean-report:
runs-on: self-hosted
container:
image: localhost:5000/flagscale_cicd:v2.0-pytorch-2.5.1-cuda-12.4.131-ngc-24.05
ports:
- 80
volumes:
- /home/flagscale_cicd/flask/static:/workspace/report
- /home/flagscale_cicd/flask/config:/workspace/config
options: --hostname flagscale_cicd

steps:
- name: Clean Old Report Report
run: |
REPORT_ADDR=$(cat "/workspace/config/report_address")
echo "Clean old Report report at the http://${REPORT_ADDR}/${{github.sha}}/cov-report-${{ inputs.backend }}/diff-cover-report-${{ inputs.backend }}.html"
if [ -d "/workspace/report/${{ github.sha }}/cov-report-${{ inputs.backend }}" ]; then
rm -r /workspace/report/${{ github.sha }}/cov-report-${{ inputs.backend }}
fi
if [ -d "/workspace/report/${{ github.sha }}/cov-temp-${{ inputs.backend }}" ]; then
rm -r /workspace/report/${{ github.sha }}/cov-temp-${{ inputs.backend }}
fi
2 changes: 1 addition & 1 deletion .github/workflows/unit-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
unit-test:
runs-on: self-hosted
container:
image: localhost:5000/flagscale_cicd:v1.5
image: localhost:5000/flagscale_cicd:v2.0-pytorch-2.5.1-cuda-12.4.131-ngc-24.05
ports:
- 80
volumes:
Expand Down
100 changes: 100 additions & 0 deletions docker/Dockerfile.ci
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
FROM nvcr.io/nvidia/pytorch:24.05-py3

ENV DEBIAN_FRONTEND noninteractive
ENV TZ=Asia/Shanghai


##############################################################################
# To avoid "curl 92 HTTP/2 stream 0 was not closed cleanly: CANCEL (err 8)" or "fetch-pack: unexpected disconnect while reading sideband packet".
##############################################################################
# lowSpeedTime=300s lowSpeedLimit=100B
RUN git config --global http.lowSpeedTime 300 \
&& git config --global http.lowSpeedLimit 100 \
&& git config --global http.postBuffer 524288000


##############################################################################
# Change apt source to Ksyun
##############################################################################
RUN sed -i "s#\S\+#http://apt.ksyun.cn/ubuntu/#2" /etc/apt/sources.list && \
> /etc/apt/apt.conf.d/docker-clean && \
> /etc/dpkg/dpkg.cfg.d/pkg-config-hook-config


##############################################################################
# Install basic utilities
##############################################################################
RUN apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common build-essential autotools-dev \
nfs-common pdsh \
curl wget vim tmux less unzip \
htop iftop iotop ca-certificates openssh-client openssh-server \
rsync iputils-ping net-tools sudo \
tzdata psmisc screen libx11-dev llvm-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*


##############################################################################
# Uninstall unnecessary packages and their dependencies
##############################################################################
RUN pip install --upgrade pip && pip install pip-autoremove && \
pip-autoremove torch torchvision torchaudio torch-tensorrt transformer_engine \
pytorch-quantization pytorch-triton \
flash-attn tensorboard apex cudf dask-cudf \
cugraph cugraph-dgl cugraph-pyg cugraph-service-server -y


##############################################################################
# Install PyTorch
##############################################################################
RUN pip install --upgrade pip \
&& pip install --no-cache-dir torch==2.5.1 torchvision torchaudio \
-f https://download.pytorch.org/whl/cu124/torch_stable.html -v \
|| { echo 'PyTorch installation failed'; exit 1; }


##############################################################################
# Install, run, and test dependent environments and data
##############################################################################
RUN pip install pytest pytest-cov pytest_mock pytest-random-order \
pre-commit black isort diff-cover \
zarr tensorstore==0.1.45 wrapt tiktoken omegaconf setuptools_scm hydra-core Ray==2.40.0 numpy==1.26.4 pillow==10.4.0 \
git+https://github.com/fanshiqing/[email protected] nltk==3.8.1 \
&& python -m nltk.downloader -d /root/nltk_data punkt


# apex
RUN cd /workspace \
&& git clone https://github.com/NVIDIA/apex \
&& cd apex \
&& pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
--config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./


# flash-attention
# Supported flash-attn versions are >= 2.1.1, <= 2.6.3.
# flash-attn==2.6.3
RUN cd /workspace \
&& git clone https://github.com/Dao-AILab/flash-attention.git \
&& cd flash-attention \
&& git checkout c1d146c \
&& git submodule update --init --recursive \
&& MAX_JOBS=96 python setup.py install


# TransformerEngin
RUN cd /workspace \
&& git clone -b stable https://github.com/NVIDIA/TransformerEngine.git \
&& cd TransformerEngine \
&& git submodule update --init --recursive \
&& pip install .


# xformers
RUN cd /workspace \
&& git clone https://github.com/facebookresearch/xformers.git \
&& cd xformers \
&& git submodule update --init --recursive \
&& pip install -v -U .
2 changes: 1 addition & 1 deletion megatron/.coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
directory = coverage

[run]
data_file = .coverage_$LOCAL_RANK
data_file = /workspace/report/$COMMIT_ID/cov-temp-megatron/.coverage_$LOCAL_RANK
3 changes: 2 additions & 1 deletion megatron/megatron/core/models/gpt/gpt_layer_specs.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ def get_gpt_layer_with_transformer_engine_spec(
qk_layernorm: Optional[bool] = False,
multi_latent_attention: Optional[bool] = False,
fp8: Optional[str] = None,
use_te: Optional[bool] = True,
) -> ModuleSpec:
"""Use this spec to use lower-level Transformer Engine modules (required for fp8 training).

Expand All @@ -72,7 +73,7 @@ def get_gpt_layer_with_transformer_engine_spec(
ModuleSpec: Module specification with TE modules
"""
mlp = _get_mlp_module_spec(
use_te=True, num_experts=num_experts, moe_grouped_gemm=moe_grouped_gemm, fp8=fp8
use_te=use_te, num_experts=num_experts, moe_grouped_gemm=moe_grouped_gemm, fp8=fp8
)

if multi_latent_attention:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,8 @@ def get_query_key_value_tensors(
query = torch.cat([q_no_pe, q_pos_emb], dim=-1)

# key: [s, b, n, 192]
# https://github.com/NVIDIA/Megatron-LM/pull/1203
k_pos_emb = k_pos_emb.repeat_interleave(self.num_attention_heads_per_partition, dim=2)
key = torch.cat([k_no_pe, k_pos_emb], dim=-1)

query = query.contiguous()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from typing import List, Tuple
from unittest import mock

import os
import pytest
import torch

Expand Down Expand Up @@ -280,6 +281,8 @@ def test_load_distribution(self, parallelization_along_dp, tmp_path_dist_ckpt):

assert loaded_state_dict.keys() == state_dict.keys()

# The mock function running internally was not called
@pytest.mark.skipif(os.getenv('FLAGSCALE_SKIP') == '1', reason="FLAGSCALE_SKIP is enabled, skipping test.")
@pytest.mark.parametrize('state_dict_device', ['cpu', 'cuda'])
@pytest.mark.flaky
def test_memory_usage(self, state_dict_device, tmp_path_dist_ckpt):
Expand Down
38 changes: 24 additions & 14 deletions megatron/tests/unit_tests/models/test_bert_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import pytest
import torch
from packaging.version import Version as PkgVersion
from packaging.version import parse
from pytest_mock import mocker

from megatron.core.models.bert.bert_layer_specs import (
Expand All @@ -16,7 +17,7 @@
from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
from megatron.core.transformer.enums import AttnMaskType
from megatron.core.transformer.transformer_config import TransformerConfig
from megatron.core.utils import is_te_min_version
from megatron.core.utils import is_te_min_version, get_te_version
from tests.unit_tests.test_utilities import Utils


Expand Down Expand Up @@ -161,26 +162,35 @@ def test_transformer_engine_version_1_7_to_1_10_flash_attn(self, mocker):

@pytest.mark.internal
def test_transformer_engine_version_1_7_to_1_10_rng_error(self, mocker):
os.environ['NVTE_FLASH_ATTN'] = '0'
os.environ['NVTE_FUSED_ATTN'] = '0'

bert_layer_with_transformer_engine_spec.submodules.self_attention.params[
'attn_mask_type'
] == AttnMaskType.padding
mocker.patch("megatron.core.utils.get_te_version", return_value=PkgVersion("1.8"))
with pytest.raises(Exception) as exc_info:
# Get the current version of Transformer Engine
te_version = f"{get_te_version().major}.{get_te_version().minor}"

# Check if the version is between 1.7 and 1.10
if parse("1.7") <= parse(te_version) <= parse("1.10"):
# Expect an exception during BertModel initialization
with pytest.raises(Exception) as exc_info:
self.bert_model = BertModel(
config=self.transformer_config,
num_tokentypes=0,
transformer_layer_spec=bert_layer_with_transformer_engine_spec,
vocab_size=100,
max_sequence_length=4,
)
# Verify the exception message matches the expected error
assert str(exc_info.value) == (
"Linear.__init__() got an unexpected keyword argument 'rng_tracker_name' when "
"instantiating TERowParallelLinear when instantiating SelfAttention when "
"instantiating TransformerLayer"
)
else:
# For versions outside the range, initialize the model without expecting an exception
self.bert_model = BertModel(
config=self.transformer_config,
num_tokentypes=0,
transformer_layer_spec=bert_layer_with_transformer_engine_spec,
vocab_size=100,
max_sequence_length=4,
)
assert str(exc_info.value) == (
"Linear.__init__() got an unexpected keyword argument 'rng_tracker_name' when "
"instantiating TERowParallelLinear when instantiating SelfAttention when "
"instantiating TransformerLayer"
)

@pytest.mark.internal
def test_transformer_engine_version_1_7_to_1_10_unfused_attention(self, mocker):
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.

import os
import pytest
import torch

Expand Down Expand Up @@ -66,10 +67,11 @@ def test_capacity_forward_backward(self, tp_size, ep_size):
)
container.dispacher_capacity_test()

# Skip because not running in internal and flaky
@pytest.mark.skipif(os.getenv('FLAGSCALE_SKIP') == '1', reason="FLAGSCALE_SKIP is enabled, skipping test.")
@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
@pytest.mark.internal
@pytest.mark.timeout(120)
@pytest.mark.internal
@pytest.mark.parametrize("tp_size,ep_size", [(1, 8), (8, 1), (4, 2), (1, 1)])
@pytest.mark.flaky
def test_capacity_padding_forward_backward(self, tp_size, ep_size):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def setup_method(self, method, use_cpu_initialization=False, swiglu=True):
_set_random_seed(seed_=123, data_parallel_random_init=False)
tf_config.moe_grouped_gemm = True
transformer_layer_spec = get_gpt_layer_with_transformer_engine_spec(
self.num_experts, moe_grouped_gemm=True
self.num_experts, moe_grouped_gemm=True, use_te=False
)
self.grouped_mlp = MoELayer(tf_config, transformer_layer_spec.submodules.mlp.submodules)
self.grouped_mlp = Float16Module(self.grouped_mlp, self.args).module
Expand Down
Loading
Loading