# 性能建议

## 避免全局变量

const DEFAULT_VAL = 0

global x = rand(1000)

function loop_over_global()
s = 0.0
for i in x::Vector{Float64}
s += i
end
return s
end

Note

All code in the REPL is evaluated in global scope, so a variable defined and assigned at top level will be a global variable. Variables defined at top level scope inside modules are also global.

julia> x = 1.0

julia> global x = 1.0

## 使用 @time评估性能以及注意内存分配

@time 宏是一个有用的性能评估工具。这里我们将重复上面全局变量的例子，但是这次移除类型声明：

julia> x = rand(1000);

julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;

julia> @time sum_global()
0.017705 seconds (15.28 k allocations: 694.484 KiB)
496.84883432553846

julia> @time sum_global()
0.000140 seconds (3.49 k allocations: 70.313 KiB)
496.84883432553846

julia> x = rand(1000);

julia> function sum_arg(x)
s = 0.0
for i in x
s += i
end
return s
end;

julia> @time sum_arg(x)
0.007701 seconds (821 allocations: 43.059 KiB)
496.84883432553846

julia> @time sum_arg(x)
0.000006 seconds (5 allocations: 176 bytes)
496.84883432553846

julia> time_sum(x) = @time sum_arg(x);

julia> time_sum(x)
0.000001 seconds
496.84883432553846

In some situations, your function may need to allocate memory as part of its operation, and this can complicate the simple picture above. In such cases, consider using one of the tools below to diagnose problems, or write a version of your function that separates allocation from its algorithmic aspects (see Pre-allocating outputs).

Note

For more serious benchmarking, consider the BenchmarkTools.jl package which among other things evaluates the function multiple times in order to reduce noise.

## Tools

Julia and its package ecosystem includes tools that may help you diagnose problems and improve the performance of your code:

• Profiling allows you to measure the performance of your running code and identify lines that serve as bottlenecks. For complex projects, the ProfileView package can help you visualize your profiling results.
• Unexpectedly-large memory allocations–as reported by @time, @allocated, or the profiler (through calls to the garbage-collection routines)–hint that there might be issues with your code. If you don't see another reason for the allocations, suspect a type problem. You can also start Julia with the --track-allocation=user option and examine the resulting *.mem files to see information about where those allocations occur. See Memory allocation analysis.
• @code_warntype generates a representation of your code that can be helpful in finding expressions that result in type uncertainty. See @code_warntype below.

## Avoid containers with abstract type parameters

When working with parameterized types, including arrays, it is best to avoid parameterizing with abstract types where possible.

Consider the following:

julia> a = Real[]
Real[]

julia> push!(a, 1); push!(a, 2.0); push!(a, π)
3-element Array{Real,1}:
1
2.0
π = 3.1415926535897...

julia> a = Float64[]
Float64[]

julia> push!(a, 1); push!(a, 2.0); push!(a,  π)
3-element Array{Float64,1}:
1.0
2.0
3.141592653589793

## 类型声明

### 避免有抽象类型的字段

julia> struct MyAmbiguousType
a
end

julia> b = MyAmbiguousType("Hello")
MyAmbiguousType("Hello")

julia> c = MyAmbiguousType(17)
MyAmbiguousType(17)

julia> typeof(b)
MyAmbiguousType

julia> typeof(c)
MyAmbiguousType

bc 的值具有相同类型，但它们在内存中的数据的底层表示十分不同。即使你只在字段 a 中存储数值，UInt8 的内存表示与 Float64 也是不同的，这也意味着 CPU 需要使用两种不同的指令来处理它们。因为该类型中不提供所需的信息，所以必须在运行时进行这些判断。而这会降低性能。

julia> mutable struct MyType{T<:AbstractFloat}
a::T
end

julia> mutable struct MyStillAmbiguousType
a::AbstractFloat
end

julia> m = MyType(3.2)
MyType{Float64}(3.2)

julia> t = MyStillAmbiguousType(3.2)
MyStillAmbiguousType(3.2)

julia> typeof(m)
MyType{Float64}

julia> typeof(t)
MyStillAmbiguousType

julia> typeof(t.a)
Float64

julia> t.a = 4.5f0
4.5f0

julia> typeof(t.a)
Float32

julia> m.a = 4.5f0
4.5f0

julia> typeof(m.a)
Float64

m.a 的类型是通过 m 的类型得知这一事实加上它的类型不能改变在函数中改变这一事实，这两者使得对于像 m 这样的对象编译器可以生成高度优化后的代码，但是对 t 这样的对象却不可以。 当然，如果我们将 m 构造成一个具体类型，那么这两者都可以。我们可以通过明确地使用一个抽象类型去构建它来破坏这一点：

julia> m = MyType{AbstractFloat}(3.2)
MyType{AbstractFloat}(3.2)

julia> typeof(m.a)
Float64

julia> m.a = 4.5f0
4.5f0

julia> typeof(m.a)
Float32

func(m::MyType) = m.a+1

code_llvm(func, Tuple{MyType{Float64}})
code_llvm(func, Tuple{MyType{AbstractFloat}})

### 避免使用带抽象容器的字段

julia> struct MySimpleContainer{A<:AbstractVector}
a::A
end

julia> struct MyAmbiguousContainer{T}
a::AbstractVector{T}
end

julia> c = MySimpleContainer(1:3);

julia> typeof(c)
MySimpleContainer{UnitRange{Int64}}

julia> c = MySimpleContainer([1:3;]);

julia> typeof(c)
MySimpleContainer{Array{Int64,1}}

julia> b = MyAmbiguousContainer(1:3);

julia> typeof(b)
MyAmbiguousContainer{Int64}

julia> b = MyAmbiguousContainer([1:3;]);

julia> typeof(b)
MyAmbiguousContainer{Int64}

julia> function sumfoo(c::MySimpleContainer)
s = 0
for x in c.a
s += foo(x)
end
s
end
sumfoo (generic function with 1 method)

julia> foo(x::Integer) = x
foo (generic function with 1 method)

julia> foo(x::AbstractFloat) = round(x)
foo (generic function with 2 methods)

julia> function myfunc(c::MySimpleContainer{<:AbstractArray{<:Integer}})
return c.a[1]+1
end
myfunc (generic function with 1 method)

julia> function myfunc(c::MySimpleContainer{<:AbstractArray{<:AbstractFloat}})
return c.a[1]+2
end
myfunc (generic function with 2 methods)

julia> function myfunc(c::MySimpleContainer{Vector{T}}) where T <: Integer
return c.a[1]+3
end
myfunc (generic function with 3 methods)
julia> myfunc(MySimpleContainer(1:3))
2

julia> myfunc(MySimpleContainer(1.0:3))
3.0

julia> myfunc(MySimpleContainer([1:3;]))
4

### 对从无类型位置获取的值进行类型注释

function foo(a::Array{Any,1})
x = a[1]::Int32
b = x+1
...
end

function nr(a, prec)
ctype = prec == 32 ? Float32 : Float64
b = Complex{ctype}(a)
c = (b + 1.0f0)::Complex{ctype}
abs(c)
end

the annotation of c harms performance. To write performant code involving types constructed at run-time, use the function-barrier technique discussed below, and ensure that the constructed type appears among the argument types of the kernel function so that the kernel operations are properly specialized by the compiler. For example, in the above snippet, as soon as b is constructed, it can be passed to another function k, the kernel. If, for example, function k declares b as an argument of type Complex{T}, where T is a type parameter, then a type annotation appearing in an assignment statement within k of the form:

c = (b + 1.0f0)::Complex{T}

### Be aware of when Julia avoids specializing

As a heuristic, Julia avoids automatically specializing on argument type parameters in three specific cases: Type, Function, and Vararg. Julia will always specialize when the argument is used within the method, but not if the argument is just passed through to another function. This usually has no performance impact at runtime and improves compiler performance. If you find it does have a performance impact at runtime in your case, you can trigger specialization by adding a type parameter to the method declaration. Here are some examples:

This will not specialize:

function f_type(t)  # or t::Type
x = ones(t, 10)
return sum(map(sin, x))
end

but this will:

function g_type(t::Type{T}) where T
x = ones(T, 10)
return sum(map(sin, x))
end

These will not specialize:

f_func(f, num) = ntuple(f, div(num, 2))
g_func(g::Function, num) = ntuple(g, div(num, 2))

but this will:

h_func(h::H, num) where {H} = ntuple(h, div(num, 2))

This will not specialize:

f_vararg(x::Int...) = tuple(x...)

but this will:

g_vararg(x::Vararg{Int, N}) where {N} = tuple(x...)

One only needs to introduce a single type parameter to force specialization, even if the other types are unconstrained. For example, this will also specialize, and is useful when the arguments are not all of the same type:

h_vararg(x::Vararg{Any, N}) where {N} = tuple(x...)

Note that @code_typed and friends will always show you specialized code, even if Julia would not normally specialize that method call. You need to check the method internals if you want to see whether specializations are generated when argument types are changed, i.e., if (@which f(...)).specializations contains specializations for the argument in question.

## 将函数拆分为多个定义

using LinearAlgebra

function mynorm(A)
if isa(A, Vector)
return sqrt(real(dot(A,A)))
elseif isa(A, Matrix)
return maximum(svdvals(A))
else
error("mynorm: invalid argument")
end
end

norm(x::Vector) = sqrt(real(dot(x, x)))
norm(A::Matrix) = maximum(svdvals(A))

## 编写「类型稳定的」函数

pos(x) = x < 0 ? 0 : x

pos(x) = x < 0 ? zero(x) : x

## 避免更改变量类型

function foo()
x = 1
for i = 1:10
x /= rand()
end
return x
end

• 使用 x = 1.0 初始化 x
• Declare the type of x explicitly as x::Float64 = 1
• Use an explicit conversion by x = oneunit(Float64)
• 使用第一个循环迭代初始化，即 x = 1 / rand()，接着循环 for i = 2:10

## Separate kernel functions (aka, function barriers)

julia> function strange_twos(n)
a = Vector{rand(Bool) ? Int64 : Float64}(undef, n)
for i = 1:n
a[i] = 2
end
return a
end;

julia> strange_twos(3)
3-element Array{Float64,1}:
2.0
2.0
2.0

julia> function fill_twos!(a)
for i = eachindex(a)
a[i] = 2
end
end;

julia> function strange_twos(n)
a = Vector{rand(Bool) ? Int64 : Float64}(undef, n)
fill_twos!(a)
return a
end;

julia> strange_twos(3)
3-element Array{Float64,1}:
2.0
2.0
2.0

Julia 的编译器会在函数边界处针对参数类型特化代码，因此在原始的实现中循环期间无法得知 a 的类型（因为它是随即选择的）。于是，第二个版本通常更快，因为对于不同类型的 a，内层循环都可被重新编译为 fill_twos! 的一部分。

## Types with values-as-parameters

julia> A = fill(5.0, (3, 3))
3×3 Array{Float64,2}:
5.0  5.0  5.0
5.0  5.0  5.0
5.0  5.0  5.0

julia> function array3(fillval, N)
fill(fillval, ntuple(d->3, N))
end
array3 (generic function with 1 method)

julia> array3(5.0, 2)
3×3 Array{Float64,2}:
5.0  5.0  5.0
5.0  5.0  5.0
5.0  5.0  5.0

Now, one very good way to solve such problems is by using the function-barrier technique. However, in some cases you might want to eliminate the type-instability altogether. In such cases, one approach is to pass the dimensionality as a parameter, for example through Val{T}() (see "Value types"):

julia> function array3(fillval, ::Val{N}) where N
fill(fillval, ntuple(d->3, Val(N)))
end
array3 (generic function with 1 method)

julia> array3(5.0, Val(2))
3×3 Array{Float64,2}:
5.0  5.0  5.0
5.0  5.0  5.0
5.0  5.0  5.0

Julia has a specialized version of ntuple that accepts a Val{::Int} instance as the second parameter; by passing N as a type-parameter, you make its "value" known to the compiler. Consequently, this version of array3 allows the compiler to predict the return type.

However, making use of such techniques can be surprisingly subtle. For example, it would be of no help if you called array3 from a function like this:

function call_array3(fillval, n)
A = array3(fillval, Val(n))
end

Here, you've created the same problem all over again: the compiler can't guess what n is, so it doesn't know the type of Val(n). Attempting to use Val, but doing so incorrectly, can easily make performance worse in many situations. (Only in situations where you're effectively combining Val with the function-barrier trick, to make the kernel function more efficient, should code like the above be used.)

function filter3(A::AbstractArray{T,N}) where {T,N}
kernel = array3(1, Val(N))
filter(A, kernel)
end

In this example, N is passed as a parameter, so its "value" is known to the compiler. Essentially, Val(T) works only when T is either hard-coded/literal (Val(3)) or already specified in the type-domain.

## The dangers of abusing multiple dispatch (aka, more on types with values-as-parameters)

Once one learns to appreciate multiple dispatch, there's an understandable tendency to go overboard and try to use it for everything. For example, you might imagine using it to store information, e.g.

struct Car{Make, Model}
year::Int
...more fields...
end

and then dispatch on objects like Car{:Honda,:Accord}(year, args...).

This might be worthwhile when either of the following are true:

• You require CPU-intensive processing on each Car, and it becomes vastly more efficient if you know the Make and Model at compile time and the total number of different Make or Model that will be used is not too large.
• You have homogenous lists of the same type of Car to process, so that you can store them all in an Array{Car{:Honda,:Accord},N}.

When the latter holds, a function processing such a homogenous array can be productively specialized: Julia knows the type of each element in advance (all objects in the container have the same concrete type), so Julia can "look up" the correct method calls when the function is being compiled (obviating the need to check at run-time) and thereby emit efficient code for processing the whole list.

When these do not hold, then it's likely that you'll get no benefit; worse, the resulting "combinatorial explosion of types" will be counterproductive. If items[i+1] has a different type than item[i], Julia has to look up the type at run-time, search for the appropriate method in method tables, decide (via type intersection) which one matches, determine whether it has been JIT-compiled yet (and do so if not), and then make the call. In essence, you're asking the full type- system and JIT-compilation machinery to basically execute the equivalent of a switch statement or dictionary lookup in your own code.

Some run-time benchmarks comparing (1) type dispatch, (2) dictionary lookup, and (3) a "switch" statement can be found on the mailing list.

Perhaps even worse than the run-time impact is the compile-time impact: Julia will compile specialized functions for each different Car{Make, Model}; if you have hundreds or thousands of such types, then every function that accepts such an object as a parameter (from a custom get_year function you might write yourself, to the generic push! function in Julia Base) will have hundreds or thousands of variants compiled for it. Each of these increases the size of the cache of compiled code, the length of internal lists of methods, etc. Excess enthusiasm for values-as-parameters can easily waste enormous resources.

## Access arrays in memory order, along columns

Julia 中的多维数组以列主序存储。这意味着数组一次堆叠一列。这可使用 vec 函数或语法 [:] 来验证，如下所示（请注意，数组的顺序是 [1 3 2 4]，而不是 [1 2 3 4]）：

julia> x = [1 2; 3 4]
2×2 Array{Int64,2}:
1  2
3  4

julia> x[:]
4-element Array{Int64,1}:
1
3
2
4

This convention for ordering arrays is common in many languages like Fortran, Matlab, and R (to name a few). The alternative to column-major ordering is row-major ordering, which is the convention adopted by C and Python (numpy) among other languages. Remembering the ordering of arrays can have significant performance effects when looping over arrays. A rule of thumb to keep in mind is that with column-major arrays, the first index changes most rapidly. Essentially this means that looping will be faster if the inner-most loop index is the first to appear in a slice expression. Keep in mind that indexing an array with : is an implicit loop that iteratively accesses all elements within a particular dimension; it can be faster to extract columns than rows, for example.

function copy_cols(x::Vector{T}) where T
inds = axes(x, 1)
out = similar(Array{T}, inds, inds)
for i = inds
out[:, i] = x
end
return out
end

function copy_rows(x::Vector{T}) where T
inds = axes(x, 1)
out = similar(Array{T}, inds, inds)
for i = inds
out[i, :] = x
end
return out
end

function copy_col_row(x::Vector{T}) where T
inds = axes(x, 1)
out = similar(Array{T}, inds, inds)
for col = inds, row = inds
out[row, col] = x[row]
end
return out
end

function copy_row_col(x::Vector{T}) where T
inds = axes(x, 1)
out = similar(Array{T}, inds, inds)
for row = inds, col = inds
out[row, col] = x[col]
end
return out
end

julia> x = randn(10000);

julia> fmt(f) = println(rpad(string(f)*": ", 14, ' '), @elapsed f(x))

julia> map(fmt, [copy_cols, copy_rows, copy_col_row, copy_row_col]);
copy_cols:    0.331706323
copy_rows:    1.799009911
copy_col_row: 0.415630047
copy_row_col: 1.721531501

## 输出预分配

julia> function xinc(x)
return [x, x+1, x+2]
end;

julia> function loopinc()
y = 0
for i = 1:10^7
ret = xinc(i)
y += ret[2]
end
return y
end;

julia> function xinc!(ret::AbstractVector{T}, x::T) where T
ret[1] = x
ret[2] = x+1
ret[3] = x+2
nothing
end;

julia> function loopinc_prealloc()
ret = Vector{Int}(undef, 3)
y = 0
for i = 1:10^7
xinc!(ret, i)
y += ret[2]
end
return y
end;

julia> @time loopinc()
0.529894 seconds (40.00 M allocations: 1.490 GiB, 12.14% gc time)
50000015000000

julia> @time loopinc_prealloc()
0.030850 seconds (6 allocations: 288 bytes)
50000015000000

## 点语法：融合向量化操作

Julia 有特殊的点语法，它可以将任何标量函数转换为「向量化」函数调用，将任何运算符转换为「向量化」运算符，其具有的特殊性质是嵌套「点调用」是融合的：它们在语法层级被组合为单个循环，无需分配临时数组。如果你使用 .= 和类似的赋值运算符，则结果也可以 in-place 存储在预分配的数组（参见上文）。

julia> f(x) = 3x.^2 + 4x + 7x.^3;

julia> fdot(x) = @. 3x^2 + 4x + 7x^3 # equivalent to 3 .* x.^2 .+ 4 .* x .+ 7 .* x.^3;

ffdot 都做相同的计算。但是，fdot（在 @. 宏的帮助下定义）在作用于数组时明显更快：

julia> x = rand(10^6);

julia> @time f(x);
0.019049 seconds (16 allocations: 45.777 MiB, 18.59% gc time)

julia> @time fdot(x);
0.002790 seconds (6 allocations: 7.630 MiB)

julia> @time f.(x);
0.002626 seconds (8 allocations: 7.630 MiB)

That is, fdot(x) is ten times faster and allocates 1/6 the memory of f(x), because each * and + operation in f(x) allocates a new temporary array and executes in a separate loop. (Of course, if you just do f.(x) then it is as fast as fdot(x) in this example, but in many contexts it is more convenient to just sprinkle some dots in your expressions rather than defining a separate function for each vectorized operation.)

## Consider using views for slices

In Julia, an array "slice" expression like array[1:5, :] creates a copy of that data (except on the left-hand side of an assignment, where array[1:5, :] = ... assigns in-place to that portion of array). If you are doing many operations on the slice, this can be good for performance because it is more efficient to work with a smaller contiguous copy than it would be to index into the original array. On the other hand, if you are just doing a few simple operations on the slice, the cost of the allocation and copy operations can be substantial.

An alternative is to create a "view" of the array, which is an array object (a SubArray) that actually references the data of the original array in-place, without making a copy. (If you write to a view, it modifies the original array's data as well.) This can be done for individual slices by calling view, or more simply for a whole expression or block of code by putting @views in front of that expression. For example:

julia> fcopy(x) = sum(x[2:end-1]);

julia> @views fview(x) = sum(x[2:end-1]);

julia> x = rand(10^6);

julia> @time fcopy(x);
0.003051 seconds (7 allocations: 7.630 MB)

julia> @time fview(x);
0.001020 seconds (6 allocations: 224 bytes)

## 复制数据不总是坏的

julia> using Random

julia> x = randn(1_000_000);

julia> inds = shuffle(1:1_000_000)[1:800000];

julia> A = randn(50, 1_000_000);

julia> xtmp = zeros(800_000);

julia> Atmp = zeros(50, 800_000);

julia> @time sum(view(A, :, inds) * view(x, inds))
0.412156 seconds (14 allocations: 960 bytes)
-4256.759568345458

julia> @time begin
copyto!(xtmp, view(x, inds))
copyto!(Atmp, view(A, :, inds))
sum(Atmp * xtmp)
end
0.285923 seconds (14 allocations: 960 bytes)
-4256.759568345134

## 避免 I/0 中的字符串插值

println(file, "$a$b")

println(file, a, " ", b)

println(file, "$(f(a))$(f(b))")

println(file, f(a), f(b))

## 并发执行时优化网络 I/O

using Distributed

responses = Vector{Any}(undef, nworkers())
@sync begin
for (idx, pid) in enumerate(workers())
@async responses[idx] = remotecall_fetch(foo, pid, args...)
end
end

using Distributed

refs = Vector{Any}(undef, nworkers())
for (idx, pid) in enumerate(workers())
refs[idx] = @spawnat pid foo(args...)
end
responses = [fetch(r) for r in refs]

## 性能标注

• 使用 @inbounds 来取消表达式中的数组边界检查。使用前请再三确定，如果下标越界，可能会发生崩溃或潜在的故障。
• 使用 @fastmath 来允许对于实数是正确的、但是对于 IEEE 数字会导致差异的浮点数优化。使用时请多多小心，因为这可能改变数值结果。这对应于 clang 的 -ffast-math 选项。
• for 循环前编写 @simd 来承诺迭代是相互独立且可以重新排序的。请注意，在许多情况下，Julia 可以在没有 @simd 宏的情况下自动向量化代码；只有在这种转换原本是非法的情况下才有用，包括允许浮点数重新结合和忽略相互依赖的内存访问（@simd ivdep）等情况。此外，在断言 @simd 时要十分小心，因为错误地标注一个具有相互依赖的迭代的循环可能导致意外结果。尤其要注意的是，某些 AbstractArray 子类型的 setindex! 本质上依赖于迭代顺序。此功能是实验性的，在 Julia 未来的版本中可能会更改或消失。

The common idiom of using 1:n to index into an AbstractArray is not safe if the Array uses unconventional indexing, and may cause a segmentation fault if bounds checking is turned off. Use LinearIndices(x) or eachindex(x) instead (see also Arrays with custom indices).

Note

@noinline function inner(x, y)
s = zero(eltype(x))
for i=eachindex(x)
@inbounds s += x[i]*y[i]
end
return s
end

@noinline function innersimd(x, y)
s = zero(eltype(x))
@simd for i = eachindex(x)
@inbounds s += x[i] * y[i]
end
return s
end

function timeit(n, reps)
x = rand(Float32, n)
y = rand(Float32, n)
s = zero(Float64)
time = @elapsed for j in 1:reps
s += inner(x, y)
end
println("GFlop/sec        = ", 2n*reps / time*1E-9)
time = @elapsed for j in 1:reps
s += innersimd(x, y)
end
println("GFlop/sec (SIMD) = ", 2n*reps / time*1E-9)
end

timeit(1000, 1000)

GFlop/sec        = 1.9467069505224963
GFlop/sec (SIMD) = 17.578554163920018

GFlop/sec 用来测试性能，数值越大越好。）

function init!(u::Vector)
n = length(u)
dx = 1.0 / (n-1)
@fastmath @inbounds @simd for i in 1:n # 通过断言 u 是一个 Vector，我们可以假定它具有 1-based 索引
u[i] = sin(2pi*dx*i)
end
end

function deriv!(u::Vector, du)
n = length(u)
dx = 1.0 / (n-1)
@fastmath @inbounds du[1] = (u[2] - u[1]) / dx
@fastmath @inbounds @simd for i in 2:n-1
du[i] = (u[i+1] - u[i-1]) / (2*dx)
end
@fastmath @inbounds du[n] = (u[n] - u[n-1]) / dx
end

function mynorm(u::Vector)
n = length(u)
T = eltype(u)
s = zero(T)
@fastmath @inbounds @simd for i in 1:n
s += u[i]^2
end
@fastmath @inbounds return sqrt(s)
end

function main()
n = 2000
u = Vector{Float64}(undef, n)
init!(u)
du = similar(u)

deriv!(u, du)
nu = mynorm(du)

@time for i in 1:10^6
deriv!(u, du)
nu = mynorm(du)
end

println(nu)
end

main()

$julia wave.jl; 1.207814709 seconds 4.443986180758249$ julia --math-mode=ieee wave.jl;
4.487083643 seconds
4.443986180758249

julia> f(x) = isnan(x);

julia> f(NaN)
true

julia> f_fast(x) = @fastmath isnan(x);

julia> f_fast(NaN)
false

## Treat Subnormal Numbers as Zeros

Subnormal numbers, formerly called denormal numbers, are useful in many contexts, but incur a performance penalty on some hardware. A call set_zero_subnormals(true) grants permission for floating-point operations to treat subnormal inputs or outputs as zeros, which may improve performance on some hardware. A call set_zero_subnormals(false) enforces strict IEEE behavior for subnormal numbers.

Below is an example where subnormals noticeably impact performance on some hardware:

function timestep(b::Vector{T}, a::Vector{T}, Δt::T) where T
@assert length(a)==length(b)
n = length(b)
b[1] = 1                            # Boundary condition
for i=2:n-1
b[i] = a[i] + (a[i-1] - T(2)*a[i] + a[i+1]) * Δt
end
b[n] = 0                            # Boundary condition
end

function heatflow(a::Vector{T}, nstep::Integer) where T
b = similar(a)
for t=1:div(nstep,2)                # Assume nstep is even
timestep(b,a,T(0.1))
timestep(a,b,T(0.1))
end
end

heatflow(zeros(Float32,10),2)           # Force compilation
for trial=1:6
a = zeros(Float32,1000)
set_zero_subnormals(iseven(trial))  # Odd trials use strict IEEE arithmetic
@time heatflow(a,1000)
end

This gives an output similar to

  0.002202 seconds (1 allocation: 4.063 KiB)
0.001502 seconds (1 allocation: 4.063 KiB)
0.002139 seconds (1 allocation: 4.063 KiB)
0.001454 seconds (1 allocation: 4.063 KiB)
0.002115 seconds (1 allocation: 4.063 KiB)
0.001455 seconds (1 allocation: 4.063 KiB)

Note how each even iteration is significantly faster.

This example generates many subnormal numbers because the values in a become an exponentially decreasing curve, which slowly flattens out over time.

Treating subnormals as zeros should be used with caution, because doing so breaks some identities, such as x-y == 0 implies x == y:

julia> x = 3f-38; y = 2f-38;

julia> set_zero_subnormals(true); (x - y, x == y)
(0.0f0, false)

julia> set_zero_subnormals(false); (x - y, x == y)
(1.0000001f-38, false)

In some applications, an alternative to zeroing subnormal numbers is to inject a tiny bit of noise. For example, instead of initializing a with zeros, initialize it with:

a = rand(Float32,1000) * 1.f-9

## @code_warntype

@code_warntype（或其函数变体 code_warntype）有时可以帮助诊断类型相关的问题。这是一个例子：

julia> @noinline pos(x) = x < 0 ? 0 : x;

julia> function f(x)
y = pos(x)
return sin(y*x + 1)
end;

julia> @code_warntype f(3.2)
Variables
#self#::Core.Compiler.Const(f, false)
x::Float64
y::UNION{FLOAT64, INT64}

Body::Float64
1 ─      (y = Main.pos(x))
│   %2 = (y * x)::Float64
│   %3 = (%2 + 1)::Float64
│   %4 = Main.sin(%3)::Float64
└──      return %4

Interpreting the output of @code_warntype, like that of its cousins @code_lowered, @code_typed, @code_llvm, and @code_native, takes a little practice. Your code is being presented in form that has been heavily digested on its way to generating compiled machine code. Most of the expressions are annotated by a type, indicated by the ::T (where T might be Float64, for example). The most important characteristic of @code_warntype is that non-concrete types are displayed in red; since this document is written in Markdown, which has no color, in this document, red text is denoted by uppercase.

• 函数体以 Body::UNION{T1,T2}) 开头

• 解释：函数具有不稳定返回类型
• 建议：使返回值类型稳定，即使你必须对其进行类型注释
• invoke Main.g(%%x::Int64)::UNION{FLOAT64, INT64}

• 解释：调用类型不稳定的函数 g
• 建议：修改该函数，或在必要时对其返回值进行类型注释
• invoke Base.getindex(%%x::Array{Any,1}, 1::Int64)::ANY

• 解释：访问缺乏类型信息的数组的元素
• 建议：使用具有更佳定义的类型的数组，或在必要时对访问的单个元素进行类型注释
• Base.getfield(%%x, :(:data))::ARRAY{FLOAT64,N} WHERE N

• 解释：获取值为非叶类型的字段。在这种情况下，ArrayContainer 具有字段 data::Array{T}。但是 Array 也需要维度 N 来成为具体类型。
• 建议：使用类似于 Array{T,3}Array{T,N} 的具体类型，其中的 N 现在是 ArrayContainer 的参数

## 被捕获变量的性能

function abmult(r::Int)
if r < 0
r = -r
end
f = x -> x * r
return f
end

function abmult2(r0::Int)
r::Int = r0
if r < 0
r = -r
end
f = x -> x * r
return f
end

function abmult3(r::Int)
if r < 0
r = -r
end
f = let r = r
x -> x * r
end
return f
end

let 代码块创建了一个新的变量 r，它的作用域只是内部函数。第二种技术在捕获变量存在时完全恢复了语言性能。请注意，这是编译器的一个快速发展的方面，未来的版本可能不需要依靠这种程度的程序员注释来获得性能。与此同时，一些用户提供的包（如 FastClosures）会自动插入像在 abmult3 中那样的 let 语句。

# Checking for equality with a singleton

When checking if a value is equal to some singleton it can be better for performance to check for identicality (===) instead of equality (==). The same advice applies to using !== over !=. These type of checks frequently occur e.g. when implementing the iteration protocol and checking if nothing is returned from iterate.