Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPU embedding table implementation with DPU multiColumn strategy #21

Open
wants to merge 98 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
2dfece2
first version of multi-table
Mar 15, 2022
d675ea3
minor changes
Mar 15, 2022
0862d75
Added type casting for host code
justin-wong-ce Mar 20, 2022
f21a0a5
Added custom Pytorch submodule
justin-wong-ce Mar 20, 2022
8121f5f
Updated submodule commit ref
justin-wong-ce Mar 25, 2022
7ab442f
updated submodules refs
justin-wong-ce Mar 27, 2022
69b4f8d
added dpu_set_ptr
Mar 30, 2022
5c8ffc7
changed lookup() casting
justin-wong-ce Mar 31, 2022
dddafd4
Updated makefile to compile load generator
justin-wong-ce May 6, 2022
3eea8bb
emb_host runs with ./run.sh -br toy, still seg faults
justin-wong-ce May 11, 2022
8e77888
makefile and load_generator runs
justin-wong-ce May 17, 2022
dbaadca
pushing latest version which stalls
May 19, 2022
223af36
updated commit refs, updated ./upmem structure
justin-wong-ce May 20, 2022
3a419dd
Merge branch 'loadgen' of https://github.com/UBC-ECE-Sasha/PIM-Embedd…
justin-wong-ce May 20, 2022
a53c196
profiling dpu vs cpu
Jun 1, 2022
9d67b74
adding new dlrm
Jun 1, 2022
09844e4
added dpu_per_rank flexibility
Jun 29, 2022
274b0b0
pointing to own submodule
SylvanBrocard Jul 1, 2022
ce57e2c
added clangd config file
SylvanBrocard Jul 1, 2022
0006b08
updated for includes
SylvanBrocard Jul 1, 2022
e3ef711
fixed validation function
SylvanBrocard Jul 4, 2022
ad83789
code refacto + naming
Jul 4, 2022
dcee4d3
streamlined kernel
SylvanBrocard Jul 4, 2022
acfafb9
fix synthetic_populate random generation
Jul 4, 2022
fa10a74
codebase refactoring
Jul 4, 2022
01caff0
fix WARNINGS
Jul 5, 2022
d57fa3c
small scenario
Jul 5, 2022
4689c68
small scenario
Jul 5, 2022
96059d3
add tracing facilities to Makefile
Jul 5, 2022
af67194
comments & naming
Jul 5, 2022
0599741
fix MAX_NR_BATCHES
Jul 5, 2022
3687488
redo populate mram
Jul 5, 2022
b88ef69
populate fix
SylvanBrocard Jul 5, 2022
4a81d72
moved post-processing outside callback
SylvanBrocard Jul 5, 2022
f5d4f72
removed mem_reset
SylvanBrocard Jul 5, 2022
98ec3d9
prevent overflow in synthetic tests
SylvanBrocard Jul 5, 2022
4b06262
code refacto for emb_tables
Jul 5, 2022
fcfcec3
add constant def && profiling app func
Jul 5, 2022
4d5ee8c
fixed post-processing
SylvanBrocard Jul 5, 2022
ab4f839
fix missing NR_TASKLETS && code cleaning
Jul 5, 2022
eddb382
preventing overflow in tests
SylvanBrocard Jul 5, 2022
8f98712
explicit loop control
SylvanBrocard Jul 5, 2022
ae53265
explicit for loop
SylvanBrocard Jul 5, 2022
37f0a30
use uint64_t for dimentional variables && naming && result buffer all…
Jul 5, 2022
56818db
restored handling for large nr of indices
SylvanBrocard Jul 6, 2022
94bb9a2
Merge previous commits
SylvanBrocard Jul 6, 2022
52a5a37
commented debug code
SylvanBrocard Jul 6, 2022
c1e4584
wrong assert
SylvanBrocard Jul 6, 2022
0907925
format && replact uint32_t by uint64_t
Jul 6, 2022
1f6856b
preparation for async
SylvanBrocard Jul 6, 2022
eefea21
imports cleanup
SylvanBrocard Jul 6, 2022
e76261e
unused parameter
SylvanBrocard Jul 6, 2022
25dd8ba
fixed time counter
SylvanBrocard Jul 6, 2022
986c409
added perf counters + experiment
SylvanBrocard Jul 6, 2022
22ab4fd
BUG & PERF fix : separate alloc_dpu_results
Jul 6, 2022
190dd61
preparation for async
SylvanBrocard Jul 6, 2022
4197901
refacto : dont use precompiler constants
Jul 6, 2022
377c960
32 bits int in kernel
SylvanBrocard Jul 6, 2022
0de53a0
formatting
SylvanBrocard Jul 6, 2022
11c891a
now supports >512 batches
SylvanBrocard Jul 6, 2022
946f44c
refacto : separates indices, offsets and input info allocation
Jul 6, 2022
4b6a46c
param : switch to original
Jul 6, 2022
c6426db
fixed
SylvanBrocard Jul 6, 2022
973ae76
put counters in ifdef blocks
SylvanBrocard Jul 6, 2022
2dd274c
updated clangd
SylvanBrocard Jul 6, 2022
24b382d
imports cleanup
SylvanBrocard Jul 6, 2022
3fd8ffe
unused parameter
SylvanBrocard Jul 6, 2022
b8a5b02
fixed time counter
SylvanBrocard Jul 6, 2022
de8acbb
added perf counters + experiment
SylvanBrocard Jul 6, 2022
9ac37f3
32 bits int in kernel
SylvanBrocard Jul 6, 2022
eaca4e2
formatting
SylvanBrocard Jul 6, 2022
5fc81ac
now supports >512 batches
SylvanBrocard Jul 6, 2022
390254b
put counters in ifdef blocks
SylvanBrocard Jul 6, 2022
0cf34fb
updated clangd
SylvanBrocard Jul 6, 2022
4cafc19
measuring verification time
SylvanBrocard Jul 6, 2022
b16b4f4
pipeline stage 1 [build_synthetic_input_data -> synthetic_inference]
Jul 6, 2022
06d3d35
Merge remote-tracking branch 'origin/sbrocard/dpu_profiling' into dge…
Jul 7, 2022
a407ff6
few merge fix
Jul 7, 2022
94daeba
post processing rank callback
Jul 7, 2022
08d15c4
bug fix
SylvanBrocard Jul 7, 2022
4224a9f
Merge branch 'merge_tmp' into pipelinemerge
Jul 8, 2022
5eda412
Merge commit '94daeba2e6264733e4c46a2ccd9d2704bdb6b8ba' into dgerin/p…
Jul 8, 2022
c4fe841
BUGGY rank_mapping not functional
Jul 7, 2022
df2e350
merge with host pipelining (synthetic_data->inference->gather_dpu_res…
dgerinmem Jul 8, 2022
e2c0cb8
fix inconsistency : release mode host side compilation
Jul 8, 2022
817b8ab
Merge remote-tracking branch 'origin/upmem_internal' into pipeline
Jul 11, 2022
140d845
flexible rank/embedding mapping feature
Jul 13, 2022
ab57898
clock_gettime() fix clock type : use Wall time
Jul 13, 2022
1ad486d
bugfix alloc_dpus
Jul 18, 2022
6c3cc4f
multicol && rm2 perf
Jul 20, 2022
9f8b654
refacto separate CPU look and CPU/DPU check
Jul 29, 2022
72bd435
[bugfix] nr_ranks used before read actual number of ranks
Aug 23, 2022
04b4b60
[app] code cleaning & add comments about mapping
Sep 29, 2022
a705c77
[bugfix] memory leak missing free()
Sep 29, 2022
d13cf0a
[bugfix] bad loop informations cause memory fault
Sep 29, 2022
d87cb6c
[app] use DPU ASYNC jobs
Sep 29, 2022
9028bbd
[app] move lenght structure allocation out of critical func
Sep 29, 2022
766531c
[profiling] fix APP function profiling
Sep 29, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,7 @@
url = https://github.com/UBC-ECE-Sasha/PIM-DeepRecSys.git
[submodule "PIM-dlrm-new"]
path = PIM-dlrm-new
url = https://github.com/UBC-ECE-Sasha/PIM-dlrm-new.git
url = https://github.com/upmem/PIM-dlrm.git
[submodule "PIM-Pytorch"]
path = PIM-Pytorch
url = https://github.com/UBC-ECE-Sasha/PIM-Pytorch.git
1 change: 1 addition & 0 deletions PIM-Pytorch
Submodule PIM-Pytorch added at 9b4f65
2 changes: 1 addition & 1 deletion PIM-dlrm-new
Submodule PIM-dlrm-new updated 1 files
+100 −3 dlrm_dpu_pytorch.py
52 changes: 52 additions & 0 deletions upmem/.clangd
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
CompileFlags:
Add:
[
-DMAX_INDICES_PER_LOOKUP=32,
-DMAX_BATCH_SIZE=64,
-DEMBEDDING_DIM=64,
-DBATCH_SIZE=64,
-DNR_RUN=100,
-DMAX_INDICES_PER_LOOKUP=32,
-DEMBEDDING_DEPTH=50000,
-DNR_EMBEDDING=9,
-DNR_TASKLETS=16,
]
---
If:
PathMatch: src/dpu/.* # Configuration for dpu binaries
CompileFlags:
Add:
[
--target=dpu-upmem-dpurte,
-I../../PIM-common,
-I../../include,
]
---
If:
PathMatch: src/.*
PathExclude: .*\/dpu/.* # Configuration for host binaries
CompileFlags: # Tweak the parse settings
Add:
[
-Iupmem/PIM-common,
-I../include,
-I/usr/include/dpu,
-ldpu,
-I/usr/lib/gcc/x86_64-linux-gnu/8/include,
-DNR_TASKLETS=16,
-lm,
--std=c11,
-fPIC,
-D_POSIX_C_SOURCE=199309L,
]
---
If:
PathMatch: include/.*
CompileFlags: # Tweak the parse settings
Add:
[
-shared,
-Wl,
-I/usr/include/dpu,
-ldpu,
]
Binary file added upmem/.run.sh.swp
Binary file not shown.
237 changes: 74 additions & 163 deletions upmem/Makefile
Original file line number Diff line number Diff line change
@@ -1,168 +1,79 @@
# Setting defaults
.PHONY:all

# DPUs information
NR_TASKLETS=16

# limits
MAX_EMBEDDING_DIM=64
MAX_NR_EMBEDDING=2000
MAX_INDICES_PER_LOOKUP=160
MAX_BATCH_SIZE=32
MAX_INDICES_PER_LOOKUP_RAND=160

# embedding information
EMBEDDING_DIM=64
EMBEDDING_DEPTH=500000
NR_EMBEDDING=50

# inputs information
BATCH_SIZE=32
INDICES_PER_LOOKUP=120
RAND_INPUT_SIZE=0

# benchmark information
NR_RUN=5
CHECK_RESULTS=1

# CPU/DPU check rounding error margin
CHECK_RESULT_ABSOLUTE_ROUNDING_ERROR_MARGIN=5000

build: build/bench build/embdpu

build/bench: build/emblib
gcc -O3 -Wl,-rpath=build -Wall -mavx -msse4 -lm -lz -lpthread -L build/ -lemb -I include \
-DINDICES_PER_LOOKUP=${INDICES_PER_LOOKUP} \
-DEMBEDDING_DIM=${EMBEDDING_DIM} \
-DEMBEDDING_DEPTH=${EMBEDDING_DEPTH} \
-DNR_RUN=${NR_RUN} \
-DBATCH_SIZE=${BATCH_SIZE} \
-DNR_EMBEDDING=${NR_EMBEDDING} \
-o build/emb synthetic_dataset_embedding.c `dpu-pkg-config --cflags --libs dpu`

build/emblib: src/*.c include/*.h
mkdir -p build
gcc -O3 -Wall -mavx -msse4 -lm -lz -lpthread -I include \
-DINDICES_PER_LOOKUP=${INDICES_PER_LOOKUP} \
-DCHECK_RESULTS=${CHECK_RESULTS} \
-DRAND_INPUT_SIZE=${RAND_INPUT_SIZE} \
-DMAX_INDICES_PER_LOOKUP_RAND=${MAX_INDICES_PER_LOOKUP_RAND} \
-DMAX_NR_EMBEDDING=${MAX_NR_EMBEDDING} \
-DMAX_BATCH_SIZE=${MAX_BATCH_SIZE} \
-DCHECK_RESULT_ABSOLUTE_ROUNDING_ERROR_MARGIN=${CHECK_RESULT_ABSOLUTE_ROUNDING_ERROR_MARGIN} \
-shared -o build/libemb.so -fPIC src/*.c `dpu-pkg-config --cflags --libs dpu`

build/embdpu: src/dpu/*.c
mkdir -p build
# TODO : why flto fails
dpu-clang -O3 -flto=thin -I include \
-DMAX_EMBEDDING_DIM=${MAX_EMBEDDING_DIM} \
-DNR_TASKLETS=${NR_TASKLETS} \
-DMAX_INDICES_PER_LOOKUP=${MAX_INDICES_PER_LOOKUP} \
-DMAX_BATCH_SIZE=${MAX_BATCH_SIZE} \
-o build/embdpu src/dpu/dpu_embedding.c

PROJECT = emb_host
PROJECT_LIB = emblib.so
EXE_DPU ?= emb_dpu_lookup
BUILD_DIR ?= build
NR_DPUS ?= 32
NR_TASKLETS ?= 14
COUNTER_CONFIG ?= "COUNT_CYCLES"
SHOW_DPU_LOGS ?= 1
MAX_ENC_BUFFER_MB ?= 1
NR_TABLES ?= 8
MAX_NR_BATCHES ?= 32
RT_CONFIG ?= "ALL"

# TEST with c_test.py
DPU_TEST ?= 0

ifeq ($(DPU_TEST),1)
NR_COLS ?= 6
NR_DPUS = 8
else
NR_COLS ?= 16
NR_DPUS ?= 32
endif

# Version information
VERSION = 0.0.0

# Application sources and artifacts
APP_BIN = $(BUILD_DIR)/$(PROJECT)
APP_LIB = $(BUILD_DIR)/$(PROJECT_LIB)
APP_SOURCES =
APP_MAIN = src/emb_host.c
APP_OBJS = $(patsubst %.c,$(BUILD_DIR)/%.o,$(APP_SOURCES) $(APP_MAIN))

# Includes
INC = -Iinclude

# Test sources and artifacts
TEST_BIN = $(BUILD_DIR)/$(PROJECT)_tests
TEST_SOURCES = $(APP_SOURCES) tests/main.c
TEST_OBJS = $(patsubst %.c,$(BUILD_DIR)/tests/%.o,$(TEST_SOURCES))

# Generated dependency files
DEPS = $(APP_OBJS:.o=.d) \
$(TEST_OBJS:.o=.d)

# Compiler options
CC = gcc
COMMON_CFLAGS = -lm --std=c11 -fPIC # -Wall -Wextra -MMD -Werror
SHARED_CFLAGS = -shared -Wl,-soname,$(PROJECT)
DPU_OPTS = `dpu-pkg-config --cflags --libs dpu`

# Tools
CLANG_FORMAT = clang-format

# Debug/Release mode
ifneq ($(DEBUG),)
COMMON_CFLAGS += -g -DDEBUG
BUILD_DIR := $(BUILD_DIR)/debug
else
COMMON_CFLAGS += -O3
BUILD_DIR := $(BUILD_DIR)/release
endif

CFLAGS += $(COMMON_CFLAGS) \
-DVERSION=$(VERSION) \
-DNR_DPUS=$(NR_DPUS) \
-DNR_TASKLETS=$(NR_TASKLETS) \
-DCOUNTER_CONFIG=$(COUNTER_CONFIG) \
-DNR_TABLES=$(NR_TABLES) \
-DNR_COLS=$(NR_COLS) \
-DMAX_NR_BATCHES=$(MAX_NR_BATCHES) \
-DHOST=1 \
-DDPU_BINARY=\"dpu/$(EXE_DPU)\" \
-DMAX_ENC_BUFFER_MB="$(MAX_ENC_BUFFER_MB)" \
-D_POSIX_C_SOURCE=199309L # For clock_gettime

# define SHOW_DPU_LOGS in the source if we want DPU logs
ifeq ($(SHOW_DPU_LOGS), 1)
CFLAGS+=-DSHOW_DPU_LOGS
endif

# Silence make
ifneq ($(V),)
SILENCE =
else
SILENCE = @
endif

# Fancy output
SHOW_COMMAND := @printf "%-15s%s\n"
SHOW_CC := $(SHOW_COMMAND) "[ $(CC) ]"
SHOW_CLEAN := $(SHOW_COMMAND) "[ CLEAN ]"
SHOW_GEN := $(SHOW_COMMAND) "[ GEN ]"
SHOW_MAKE := $(SHOW_COMMAND) "[ MAKE ]"
SHOW_FORMAT := $(SHOW_COMMAND) "[ FORMAT ]"

##############################################################################################
# Default target and help message
##############################################################################################
DEFAULT_TARGET = $(APP_BIN)

all: $(DEFAULT_TARGET) $(APP_LIB) dpu
.PHONY: all

# Take care of compiler generated depedencies
-include $(DEPS)

##############################################################################################
# Application
##############################################################################################
$(APP_BIN): $(APP_OBJS)
$(SHOW_CC) $@
$(SILENCE)$(CC) -o $@ $(APP_OBJS) $(DPU_OPTS)

$(APP_LIB): $(APP_OBJS)
$(SHOW_CC) $@
$(SILENCE)$(CC) -o $@ $(APP_OBJS) $(DPU_OPTS) $(SHARED_CFLAGS)

$(BUILD_DIR)/%.o: %.c
$(SHOW_CC) $@
$(SILENCE)mkdir -p $(dir $@)
$(SILENCE)$(CC) $(CFLAGS) $(INC) -c $< -o $@ $(DPU_OPTS)

##############################################################################################
# DPU Application
##############################################################################################

export BUILD_DIR
export DEBUG
export NR_DPUS
export NR_TASKLETS
export COUNTER_CONFIG
export EXE_DPU
export NR_TABLES
export NR_COLS
export MAX_NR_BATCHES

dpu:
$(SHOW_MAKE) $@
$(SILENCE)$(MAKE) -C src/dpu

##############################################################################################
# Tests
##############################################################################################



##############################################################################################
# Cleanup
##############################################################################################
clean:
$(SHOW_CLEAN) $(BUILD_DIR)
$(SILENCE)rm -rf $(BUILD_DIR)
([ -d "build" ] && rm -r build/ && rm -f out.cpu.log out.dpu.log) || [ ! -d "build" ]

.PHONY: clean
testdpu: FORCE
./build/emb

##############################################################################################
# Format
##############################################################################################
format:
$(SHOW_FORMAT) $@
$(SILENCE)$(CLANG_FORMAT) -i src/*.c # src/*.h include/*.h
$(SILENCE)$(MAKE) format -C src/dpu
tracedpu: FORCE
dpu-profiling functions -o dpu.json -a -A \
-f build_synthetic_input_data \
-f synthetic_inference \
--external-function ./build/libemb.so:lookup \
--external-function ./build/libemb.so:populate_mram \
--external-function ./build/libemb.so:gather_rank_embedding_results \
-- ./build/emb

.PHONY: format
FORCE: ;
Loading