Skip to content

Commit 19ba009

Browse files
ldh4cwpearsonmasterleinadPaulGannayJBludau
authored
Release briefing 4.7 (#137)
* Initial commit for 4_7 * 4.7: BugFixes content from Trevis Morvany Signed-off-by: Carl Pearson <cwpears@sandia.gov> * Add slides for deprecation and incompatibilities * Unify slides * Add benches results for fp16 improvements * 4.7: BugFixes content from Paul Zehner Signed-off-by: Carl Pearson <cwpears@sandia.gov> * 4.7: merge/cleanup bugfixes content Signed-off-by: Carl Pearson <cwpears@sandia.gov> * add build system update slides * Content for General Enhacements Signed-off-by: Jan Ciesko <jan.ciesko@gmail.com> * Add slides on backend updates from Rahul * Updated date * Match formatting between GeneralEnhancements and other sections Mainly with use of \texttt{} * Added starter slides on organizational section * Add coverage of Kokkos Graphs in Generan Enhancements * Applying suggestions from reviews * Add example for subview constructor fix * Minor cleanup * add new feature slides and fix typo in organizational * Move some content from General to Backend * Add graphs section to New Features * Add graphs section to New Features * Add Other Performance Improvements slide to Backend Updates * Update graph example and structured binding example * Spelling * Add one more example for host-side graph node * format * Removed a placeholder slide * Improve breaking changes section * Release 4.7: Add trademark slides * Fix an example * Prefer [[deprecated]] attribute to KOKKOS_DEPRECATED macro * Tweak to enhancements section * Add Kokkos 5 deprecation warning slide * Update partition_space slide * Update atomic slide * Other minor things * Add SiPearl attribution of SVE backend --------- Signed-off-by: Carl Pearson <cwpears@sandia.gov> Signed-off-by: Jan Ciesko <jan.ciesko@gmail.com> Co-authored-by: Carl Pearson <cwpears@sandia.gov> Co-authored-by: Daniel Arndt <arndtd@ornl.gov> Co-authored-by: Paul Gannay <paul.gannay@cea.fr> Co-authored-by: Jakob Bludau <bludauj@ornl.gov> Co-authored-by: Jan Ciesko <jan.ciesko@gmail.com> Co-authored-by: tcclevenger <tccleve@sandia.gov> Co-authored-by: Nicolas Morales <nmmoral@sandia.gov> Co-authored-by: Damien L-G <dalg24@gmail.com> Co-authored-by: Christian Trott <crtrott@sandia.gov> Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
1 parent f06aea5 commit 19ba009

File tree

9 files changed

+893
-1
lines changed

9 files changed

+893
-1
lines changed
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
%==========================================================================
2+
3+
\begin{frame}[fragile]
4+
5+
{\Huge Backend Updates}
6+
7+
\vspace{10pt}
8+
9+
\end{frame}
10+
11+
12+
%==========================================================================
13+
14+
% Examples
15+
16+
% note: always keep the [fragile] for your frames!
17+
18+
%\begin{frame}[fragile]{Title}
19+
% Contents
20+
%\end{frame}
21+
22+
%==========================================================================
23+
\begin{frame}[fragile]{CUDA and SYCL}
24+
\begin{itemize}
25+
\item CUDA: Add support for AMPERE87 architecture (Jetson Orin Nano)
26+
\item CUDA: Support RDC with Clang 17+ and use new offload driver
27+
\item SYCL: Add support for Intel DG2 GPUs such as the Arc Alchemist GPUs
28+
\item SYCL: Allow using non-trivially-copyable comparators with oneDPL
29+
\end{itemize}
30+
\end{frame}
31+
32+
%==========================================================================
33+
34+
\begin{frame}[fragile]{Improve half-float performance for CUDA and SYCL backends}
35+
\begin{itemize}
36+
\item CUDA AND SYCL: Directly use the available fp16 mathematical function instead of casting back and forth to fp32
37+
\end{itemize}
38+
\begin{tikzpicture}
39+
\begin{axis}[
40+
width = 0.85*\textwidth,
41+
height = 0.75*\textheight,
42+
major x tick style = transparent,
43+
ybar=2*\pgflinewidth,
44+
bar width=14pt,
45+
ymajorgrids = true,
46+
ylabel = {Exec time (µs)},
47+
symbolic x coords={Memory Bound Kernel, Compute Bound Kernel},
48+
xtick = data,
49+
scaled y ticks = false,
50+
enlarge x limits=0.25,
51+
ymin=0,
52+
legend style={at={(0.3,0.75)},anchor=west},
53+
]
54+
\addplot
55+
coordinates {(Memory Bound Kernel, 32) (Compute Bound Kernel, 102)};
56+
57+
\addplot
58+
coordinates {(Memory Bound Kernel, 15.8) (Compute Bound Kernel, 102)};
59+
60+
\addplot
61+
coordinates {(Memory Bound Kernel, 14.7) (Compute Bound Kernel, 63)};
62+
63+
\legend{float 32, fp16 old, fp16 new}
64+
\end{axis}
65+
\end{tikzpicture}
66+
\end{frame}
67+
68+
% Bench details:
69+
% - NVidia A100
70+
% - 2^20 (1 million) elements
71+
% - Memory bound kernel is doing:
72+
% tmp = init(i);
73+
% res(i) = sqrt(cos(tmp) + sin(tmp));
74+
% - Compute bound is doing 16 time the work of Memory Bound
75+
76+
%==========================================================================
77+
78+
\begin{frame}[fragile]{Other Performance Improvements}
79+
\begin{itemize}
80+
\item {Improving atomic performance for \texttt{op\_fetch}}
81+
\begin{itemize}
82+
\item \texttt{atomic\_op\_fetch} was not specialized as diligently as \texttt{atomic\_fetch\_op} to leverage hardware support or vendor APIs and was falling back to the compare-and-swap implementation
83+
\item \texttt{atomic\_op\_fetch} is now being expressed in terms of "\texttt{op} applied to the result of \texttt{atomic\_fetch\_op}" which means we get systematically more benefit from the specialization we had written
84+
\item The specialized \emph{atomic\_add\_fetch} is 10x to 100x faster than CAS on gpus
85+
\end{itemize}
86+
\item Passing label~\emph{by reference} in all Kokkos Tools APIs (improving performance)
87+
\end{itemize}
88+
\end{frame}
89+
90+
%==========================================================================
91+
92+
\begin{frame}[fragile]{OpenMPTarget}
93+
\begin{itemize}
94+
\item Remove support for non-llvm compilers as part of the strategy to only support LLVM compilers in the backend.
95+
\item LLVM compilers support extensions to OpenMP directives on GPU that allow \textit{grid} style kernel launches making it more suitable for GPUs and avoiding the overhead of OpenMP's fork-join model.
96+
\end{itemize}
97+
\end{frame}
98+
99+
%==========================================================================
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
%==========================================================================
2+
3+
\begin{frame}[fragile]
4+
5+
{\Huge Deprecations and other breaking changes}
6+
7+
\vspace{10pt}
8+
9+
\end{frame}
10+
11+
%==========================================================================
12+
13+
\begin{frame}[fragile]{Incompatibilities}
14+
15+
\textbf{SYCL Backend}
16+
\begin{itemize}
17+
\item The minimum required \textbf{IntelLLVM} version has been raised from \textbf{2023.0.0} to \textbf{2024.2.1}.
18+
This change aligns with the Intel HPC Toolkit used for CI testing and
19+
resolves critical issues with sorting algorithms.
20+
\end{itemize}
21+
\textbf{DualView Debugging}
22+
\begin{itemize}
23+
\item The option \texttt{Kokkos\_ENABLE\_DEBUG\_DUALVIEW\_MODIFY\_CHECK} has
24+
been deprecated and is now \textbf{always enabled}. Previously, its default
25+
value was dependent on the \texttt{Kokkos\_ENABLE\_DEBUG} option.
26+
\item \textbf{Rationale:} Enabling this check provides valuable debug information
27+
for \texttt{DualView::modify[\_\{device,host\}]} calls without a
28+
significant performance penalty. It also simplifies the configuration
29+
process for users by reducing the number of available build options.
30+
\end{itemize}
31+
\end{frame}
32+
33+
%==========================================================================
34+
35+
\begin{frame}[fragile]{Deprecations}
36+
\begin{itemize}
37+
\item Deprecate \texttt{KOKKOS\_MEMORY\_ALIGNMENT[\_THRESHOLD]} macros
38+
\item Deprecate \texttt{KOKKOS\_NONTEMPORAL\_PREFETCH\_\{LOAD,STORE\}} macros
39+
\item Deprecate \texttt{Kokkos::MemoryManaged} as alias for default memory traits
40+
\begin{code}[keywords={MemoryManaged}]
41+
using MemoryManaged [[deprecated]] = Kokkos::MemoryTraits<>;
42+
// ^^^^^^^^^^^^^^^^^^^^^^
43+
// added default template argument
44+
// to avoid spelling out the integer
45+
// value of the empty bitmask
46+
\end{code}
47+
\end{itemize}
48+
\end{frame}
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
%==========================================================================
2+
3+
\begin{frame}[fragile]
4+
5+
{\Huge Bug Fixes}
6+
7+
\vspace{10pt}
8+
9+
\end{frame}
10+
11+
%==========================================================================
12+
13+
% Examples
14+
15+
% note: always keep the [fragile] for your frames!
16+
17+
%\begin{frame}[fragile]{Example list}
18+
% \begin{itemize}
19+
% \item Item 1
20+
% \item Item 2 with some \texttt{code}
21+
% \begin{itemize}
22+
% \item Sub-item 2.1
23+
% \item Sub-item 2.2
24+
% \end{itemize}
25+
% \end{itemize}
26+
%\end{frame}
27+
28+
%\begin{frame}[fragile]{Example code}
29+
% \begin{code}[keywords={std}]
30+
% #include <iostream>
31+
%
32+
% int main() {
33+
% std::cout << "hello world\n";
34+
% }
35+
% \end{code}
36+
%\end{frame}
37+
38+
%\begin{frame}[fragile]{Example table}
39+
% \begin{center}
40+
% \begin{tabular}{l|l}
41+
% a & b \\\hline
42+
% c & d
43+
% \end{tabular}
44+
% \end{center}
45+
%\end{frame}
46+
47+
%==========================================================================
48+
49+
\begin{frame}[fragile]{General Bug Fixes}
50+
\begin{itemize}
51+
\item Fix a memory leak from an early exit when using \texttt{--kokkos-tools-help} % 8074
52+
\item Add missing fences for async Random init with unified memory % 8105
53+
\item More robust checks on subview constructor % #8210
54+
\begin{code}[keywords={std}]
55+
View<T**, LayoutLeft> a(N,N);
56+
57+
// Previously allowed, but data should have strided access.
58+
View<T*, LayoutLeft> sub_a(a, 1, ALL); // Runtime Error
59+
\end{code}
60+
61+
\end{itemize}
62+
\end{frame}
63+
64+
%==========================================================================
65+
66+
\begin{frame}[fragile]{General Bug Fixes}
67+
\begin{itemize}
68+
\item SIMD:
69+
\begin{itemize}
70+
\item Fix compile errors with \texttt{Kokkos\_ARCH\_NATVE=ON} % 7912
71+
\item Fix fallback simd masked reductions using incorrect identity elements % 8115
72+
\end{itemize}
73+
\item Compilers:
74+
\begin{itemize}
75+
\item Apply a workaround for a segfault issue in \texttt{SharedAllocationTracker} with gcc 12.2, 12.3 and 12.4 % 8223
76+
\item Fix compiling with C++23 supported compilers that provide an mdspan implementation % #8234
77+
\end{itemize}
78+
\end{itemize}
79+
\end{frame}
80+
81+
%==========================================================================
82+
83+
\begin{frame}[fragile]{Backend Bug Fixes}
84+
\begin{itemize}
85+
\item HPX: fix to constrain hpx\_thread\_buffer size used with TeamPolicy setup % #8147
86+
\item HIP and SYCL:
87+
\begin{itemize}
88+
\item A \texttt{MDRangePolicy} of rank 4 or more would be incorrectly iterated, leading to some iterations being evaluated more than once for large enough loops % 7880
89+
\end{itemize}
90+
\item HIP:
91+
\begin{itemize}
92+
\item \texttt{ConstantMemory} launch mechanism would sporadically fail due to \texttt{hipEventSynchronize} error % 8094
93+
\item Fix launch of intermediate size functors in graph % #8188
94+
\end{itemize}
95+
\item Serial: memory leak in internal instance data % 8042
96+
\item OpenMP Target and OpenACC: An out-of-bounds access would occur in \texttt{Random\_UniqueIndex} under certain circumstances % 8077
97+
\end{itemize}
98+
\end{frame}
99+
100+
101+
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
%==========================================================================
2+
3+
\begin{frame}[fragile]
4+
5+
{\Huge Build Systems Updates}
6+
7+
\vspace{10pt}
8+
9+
\end{frame}
10+
11+
%==========================================================================
12+
13+
\begin{frame}[fragile]{General Build System Upates}
14+
\begin{itemize}
15+
\item Require GCC 10.4 for C++20 builds to avoid an ISO C++20 bug %8130
16+
\item Error out for \texttt{BUILD\_SHARED\_LIBS} and \texttt{RELOCATABLE\_DEVICE\_CODE}. \\ The vendors don't support it, we just check for it now % #8196
17+
\item Support more \texttt{nvcc} arguments with \texttt{nvcc\_wrapper}: \\ \texttt{--ftz}, \texttt{--prec-div}, and \texttt{--prec-sqrt} %7930
18+
\item Add NVIDIA Blackwell architecture support to the makefiles \\ \textbf{Makefiles are officially deprecated} %8055
19+
20+
\end{itemize}
21+
\end{frame}
22+
23+
%==========================================================================
24+
\begin{frame}[fragile]{Compiler and linker flag check}
25+
\centering
26+
\textbf{We now check the compiler and linker flags at configure time with the given CXX compiler} %7891
27+
\begin{itemize}
28+
\item Uses CMake's compiler and linker checks
29+
\item Uses \texttt{CMAKE\_CXX\_FLAGS} and the flags Kokkos sets
30+
\item Not used when \texttt{kokkos\_launch\_compiler} script is used
31+
\item If you suspect a false positive please tell us
32+
\end{itemize}
33+
\end{frame}
34+
35+
%==========================================================================
36+
37+
%\begin{frame}[fragile]{Build C++20 module}
38+
% \centering
39+
% \textbf{We are working on supplying C++20 modules} %8132
40+
% \begin{itemize}
41+
% \item For now only available for \texttt{StdAlgorithms}
42+
% \item Enable with \texttt{Kokkos\_ENABLE\_EXPERIMENTAL\_CXX20\_MODULES}
43+
% \item We test it with \texttt{clang-19}
44+
% \item Ongoing integration, more code already in \texttt{develop}
45+
% \end{itemize}
46+
%\end{frame}
47+
48+
%==========================================================================

0 commit comments

Comments
 (0)