Skip to content

Commit b8697b7

Browse files
tretre91masterleinadldh4cwpearsonscience-enthusiast
authored
Release briefing 5.0 (#144)
* Initial commit for release briefing 5.0 * Add the date * Add empty slides for incompatibilities and deprecations * Add New Features * Start General Enhancements * Start slides on Deprecations * Add Bug Fixes section Signed-off-by: Carl Pearson <cwpears@sandia.gov> * BugFixes: improve styling Signed-off-by: Carl Pearson <cwpears@sandia.gov> * Start organizational section * Description of Breaking Changes. Signed-off-by: Hariprasad Kannan <hkannan@gmail.com> * General Enhancements: StaticBatchSize Signed-off-by: Adrien Taberner <adrien.taberner@outlook.com> * General Enhancements: first touch Signed-off-by: Adrien Taberner <adrien.taberner@outlook.com> * Remove stray \item in the StaticBatchSize slide * General Enhancements: UnorderedMap Signed-off-by: Adrien Taberner <adrien.taberner@outlook.com> * add build system updates * Removed blank slide titled Incompatibilities Signed-off-by: Hariprasad Kannan <hkannan@gmail.com> * Add concent to Deprecations Signed-off-by: Jan Ciesko <jan.ciesko@gmail.com> * add CUDA backend updates * add slides for sycl and openmptarget * Add slides for the OpenACC backend. Signed-off-by: Seyong Lee <lees2@ornl.gov> * Add KUG Europe slide to organizational * Small fixes in New Features * Small fixes in Backend Updates * Reword simd type conversions in General Enhancements * Fix typo in Deprecations * Some rewording in Breaking Changes * General Enhancements: Array and reduction identity Signed-off-by: Adrien Taberner <adrien.taberner@outlook.com> * rename No to Number * Add slides for HIP backend * Fix typo in Backend Updates * bug fixes: improve wording Signed-off-by: Carl Pearson <cwpears@sandia.gov> * Update MDSpan explanation and other minor fixes Signed-off-by: Christian Trott <crtrott@sandia.gov> * Separate Deprecations and Breaking Changes * Use bold font for main points in Breaking Changes * Mention Kokkos_ENABLE_DEPRECATED_CODE_5 in Deprecations * rework build system slides * Fix a typo in BuildSystemUpdates Co-authored-by: Daniel Arndt <arndtd@ornl.gov> * some rewording of backend updates * Training & user groups * Do not mention minimum requirements for OpenMPTarget * Fix spelling of ROCm * small wording fixes * some small wording changes in build section * use >= hopper * SRP Internships * Fix typo in CUDA language CMake version * Add missing $$ --------- Signed-off-by: Carl Pearson <cwpears@sandia.gov> Signed-off-by: Adrien Taberner <adrien.taberner@outlook.com> Signed-off-by: Jan Ciesko <jan.ciesko@gmail.com> Signed-off-by: Seyong Lee <lees2@ornl.gov> Signed-off-by: Christian Trott <crtrott@sandia.gov> Co-authored-by: Daniel Arndt <arndtd@ornl.gov> Co-authored-by: Dong Hun Lee <donlee@sandia.gov> Co-authored-by: Carl Pearson <cwpears@sandia.gov> Co-authored-by: Hariprasad KANNAN <3788945+science-enthusiast@users.noreply.github.com> Co-authored-by: Adrien Taberner <adrien.taberner@outlook.com> Co-authored-by: Jakob Bludau <bludauj@ornl.gov> Co-authored-by: Jan Ciesko <jan.ciesko@gmail.com> Co-authored-by: Seyong Lee <lees2@ornl.gov> Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com> Co-authored-by: Christian Trott <crtrott@sandia.gov> Co-authored-by: Julien Bigot <julien.bigot@cea.fr> Co-authored-by: Damien L-G <dalg24@gmail.com>
1 parent 31db85c commit b8697b7

13 files changed

+1159
-1
lines changed
64.7 KB
Loading
100 KB
Loading
7.04 KB
Loading
Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
%==========================================================================
2+
3+
\begin{frame}[fragile]
4+
5+
{\Huge Backend Updates}
6+
7+
\vspace{10pt}
8+
9+
\end{frame}
10+
11+
%==========================================================================
12+
13+
\begin{frame}[fragile]
14+
15+
{\Huge CUDA}
16+
17+
\vspace{10pt}
18+
19+
\end{frame}
20+
21+
%==========================================================================
22+
23+
\begin{frame}[fragile]{128 Bit CAS-based device atomics on $>=$ Hopper and CUDA $>=$ 12.8}
24+
25+
\begin{figure}[ht]
26+
\centering
27+
\begin{tikzpicture}
28+
\begin{axis}[
29+
%title={Atomics on $10^8$\texttt{Kokkos::complex<double>} \textbf{without} contention},
30+
width=0.8\textwidth,
31+
height=0.25\textwidth,
32+
grid=major,
33+
ymin=0,
34+
ybar,
35+
ybar=2pt,
36+
bar width=5pt,
37+
enlargelimits=0.15,
38+
legend style={at={(0.5,1.6)},
39+
anchor=north,legend columns=-1},
40+
ylabel={Speedup},
41+
symbolic x coords={add,sub,fetch\_add,fetch\_sub,fetch\_mul,fetch\_div},
42+
xtick=data,
43+
x tick label style={rotate=45,anchor=east},
44+
%nodes near coords,
45+
%nodes near coords align={vertical},
46+
]
47+
\addplot coordinates {(add,60.9383491542) (sub,61.3601695793) (fetch\_add,62.2689318092) (fetch\_sub,63.6933886833) (fetch\_mul,61.4623888556) (fetch\_div,27.1370696705)};
48+
\addplot coordinates {(add,7.5113136277) (sub,7.3840995425) (fetch\_add,7.2957874994) (fetch\_sub,7.3351913106) (fetch\_mul,7.3557821451) (fetch\_div,1.2353750828)};
49+
\legend{Nvidia H100, Nvidia 5080 (Blackwell)}
50+
\end{axis}
51+
\end{tikzpicture}
52+
\end{figure}
53+
\begin{itemize}
54+
\item Atomics on $10^8$ \texttt{Kokkos::complex<double>} \textbf{without} contention
55+
\begin{itemize}
56+
\item Speedup $\approx 60$x on H100 and $\approx 7$x on RTX5080.
57+
\item Same performance for \texttt{int128} and \texttt{Kokkos::complex<double>}.
58+
\item Division more costly, thus less effect of atomic CAS.
59+
\end{itemize}
60+
\end{itemize}
61+
62+
\end{frame}
63+
64+
%==========================================================================
65+
66+
\begin{frame}[fragile]{Effect of contention on \texttt{atomic\_add} on Hopper}
67+
68+
\begin{figure}[ht]
69+
\centering
70+
\begin{tikzpicture}
71+
\begin{loglogaxis}[
72+
%title={\texttt{atomic\_add} of \texttt{Kokkos::complex<double>} with $10^8$ workers},
73+
width=0.4\textwidth,
74+
height=0.4\textwidth,
75+
log basis x = 10,
76+
grid=major,
77+
xmin=1,
78+
xlabel={Number of target addresses},
79+
ylabel={Slowdown}
80+
]
81+
\addplot+[mark=x] coordinates {
82+
(1,430493.9171899781)
83+
(10,15898.3990852844)
84+
(100,915.3459702246)
85+
(1000,43.9645742601)
86+
(10000,3.777034824)
87+
(100000,0.9787170675)
88+
(1000000,0.8960538272)
89+
(10000000,1.0)
90+
};
91+
\addplot+[mark=o] coordinates {
92+
(1,68593.7222119811)
93+
(10,4665.2018278071)
94+
(100,314.4430296381)
95+
(1000,4.9384166747)
96+
(10000,0.7793873907)
97+
(100000,0.07931081247)
98+
(1000000,0.04049346243)
99+
(10000000,0.04297514822)
100+
};
101+
\legend{Lock-based,CAS-based}
102+
\end{loglogaxis}
103+
\end{tikzpicture}
104+
\caption{\texttt{atomic\_add} of \texttt{Kokkos::complex<double>} with $10^8$ workers}
105+
\end{figure}
106+
\vspace{-0.8cm}
107+
\begin{itemize}
108+
\item Effectiveness of CAS-based atomics reduces similar to Lock-based atomics at high contention.
109+
\end{itemize}
110+
111+
\end{frame}
112+
113+
114+
\begin{frame}[fragile]{Leverage larger Kernel Argument}
115+
116+
\begin{itemize}
117+
\item Allows to launch kernels with up to 32kB of arguments for kernels. Previously it was 4kB.
118+
\item Enables us to side-step the "Constant Cache" launch mechanism in Kokkos.
119+
\item Effects functors in the 4kB to 32kB size range. No effect on smaller or larger functors.
120+
\item This changes the synchronization behavior for functors in this range, due to elimination of an implicit necessary synchronization on constant cache buffer use.
121+
\item Does not apply to using Clang as CUDA compiler, nor for GPUs older than Volta (i.e. Compute Capabilities lower than 7).
122+
\end{itemize}
123+
124+
\end{frame}
125+
126+
%==========================================================================
127+
128+
\begin{frame}[fragile]
129+
130+
{\Huge SYCL}
131+
132+
\vspace{10pt}
133+
134+
\end{frame}
135+
136+
%==========================================================================
137+
138+
\begin{frame}[fragile]{Use unsigned integer type as \texttt{size\_type} in SYCL}
139+
140+
\begin{itemize}
141+
\item \texttt{SYCL} now uses an unsigned integer type as \texttt{size\_type}.
142+
\item Now unsigned integer type across all backends.
143+
\end{itemize}
144+
145+
\end{frame}
146+
147+
%==========================================================================
148+
149+
\begin{frame}[fragile]
150+
151+
{\Huge OpenMPTarget}
152+
153+
\vspace{10pt}
154+
155+
\end{frame}
156+
157+
%==========================================================================
158+
159+
\begin{frame}[fragile]{Allow \texttt{parallel\_scan} to start anywere in OpenMPTarget}
160+
161+
\begin{itemize}
162+
\item Previously \texttt{parallel\_scan} with a \texttt{RangePolicy} needed to start at index 0
163+
\item Now any starting index smaller than the end index is supported.
164+
\end{itemize}
165+
166+
\end{frame}
167+
168+
%==========================================================================
169+
170+
\begin{frame}[fragile]{Upcoming removal of OpenMPTarget}
171+
\begin{large}
172+
We decided to remove OpenMPTarget in an upcoming release!
173+
\end{large}
174+
175+
\begin{itemize}
176+
\item Never reached feature parity.
177+
\item Lower performance than native backends (CUDA, HIP, SYCL)
178+
\item Practically no users.
179+
\item Little interest in support by any institution.
180+
\end{itemize}
181+
\end{frame}
182+
183+
%==========================================================================
184+
185+
\begin{frame}[fragile]
186+
187+
{\Huge OpenACC}
188+
189+
\vspace{10pt}
190+
191+
\end{frame}
192+
193+
%==========================================================================
194+
195+
\begin{frame}[fragile]{Allow \texttt{parallel\_scan} to start anywere in OpenACC}
196+
197+
\begin{itemize}
198+
\item Previously \texttt{parallel\_scan} with a \texttt{RangePolicy} needed to start at index 0
199+
\item Now any starting index smaller than the end index is supported.
200+
\end{itemize}
201+
202+
\end{frame}
203+
204+
%==========================================================================
205+
206+
\begin{frame}[fragile]{Support \texttt{Kokkos\_Random} algorithms API in OpenACC}
207+
208+
\begin{itemize}
209+
\item OpenACC now supports the \texttt{Kokkos\_Random} algorithms API.
210+
\item Can be inefficient if the actual team size is different from the default team size.
211+
\end{itemize}
212+
213+
\end{frame}
214+
215+
%==========================================================================
216+
217+
\begin{frame}[fragile]{Support \texttt{partition\_space} API in OpenACC}
218+
219+
\begin{itemize}
220+
\item OpenACC now supports the \texttt{partition\_space} API.
221+
\item Execution space instances created by \texttt{partition\_space} will use OpenACC async IDs in a reserved range (from 64 to 191), which are assigned in a round-robin manner.
222+
\end{itemize}
223+
224+
\end{frame}
225+
226+
%==========================================================================
227+
228+
\begin{frame}[fragile]{Support custom scalar reduction in \texttt{parallel\_reduce} with a \texttt{RangePolicy} in OpenACC}
229+
230+
\begin{itemize}
231+
\item OpenACC now supports custom scalar reduction with \texttt{parallel\_reduce} and \texttt{RangePolicy}.
232+
\item Supports both built-in reducers with custom scalar types, and custom reducers with custom scalar types.
233+
\end{itemize}
234+
\end{frame}
235+
236+
%==========================================================================
237+
238+
\begin{frame}[fragile]
239+
240+
{\Huge HIP}
241+
242+
\vspace{10pt}
243+
244+
\end{frame}
245+
246+
%==========================================================================
247+
248+
\begin{frame}[fragile]{Improved performance and new architecture support}
249+
250+
\begin{itemize}
251+
\item Fix a performance regression introduced in 4.6 when using lightweight
252+
kernel (\texttt{Experimental::WorkItemProperty::HintLightWeight}) in \texttt{parallel\_reduce}
253+
\item Prefer smaller block sizes for \texttt{parallel\_for} when
254+
the requested parallelism is less than the available concurrency
255+
\item Use atomic builtins for \texttt{atomic\_fetch\_{min/max}} with floating
256+
point types instead of our own implementation
257+
\item Add support for \texttt{Navi4} architecture (Radeon AI PRO R9700, Radeon
258+
RX 9070 XT)
259+
\end{itemize}
260+
\end{frame}
261+
262+
%==========================================================================
263+
\begin{frame}[fragile]{ROCm 7.1}
264+
\begin{itemize}
265+
\item Avoid using ROCm 7.1 if possible: \textbf{you may get incorrect results}
266+
\item On MI100 and MI200 series, use
267+
\texttt{-DKokkos\_ENABLE\_IMPL\_HIP\_MALLOC\_ASYNC=OFF}
268+
\item On MI300 series, we cannot compile the testsuite yet. Very likely that
269+
you will also need to use \texttt{-DKokkos\_ENABLE\_IMPL\_HIP\_MALLOC\_ASYNC=OFF}
270+
\end{itemize}
271+
\end{frame}

0 commit comments

Comments
 (0)