Occupancyの読み方 - 廃棄されたなにか

CUDA Profileを使うと次のようなログが取れるが、その中にoccupancyというのがある。

# CUDA_PROFILE_LOG_VERSION 1.4
# CUDA_DEVICE_NAME 0 GeForce GTX 280			
timestamp,method,gputime,cputime,occupancy
timestamp=[ 2155.302 ] method=[ _Z10fhaar1dwtdiPf ] gputime=[ 7.808 ] cputime=[ 74.730 ] occupancy=[ 1.000 ]

意味は次のように書かれている。

The 'occupancy' label gives the warp occupancy - percentage of the 
 maximum warp count in the GPU - for a particular method launch. 
 An occupancy of 1.000 means the chip is completely full.

occupancy:英和で調べると占有とか出てくるが、ここではwarpの占有率を意味する。
あるCUDA KernelがGPU上の最大可能warp数のうちのどれだけを利用しているかを表す。
chipの性能をフルに発揮している状態なら1.0になる。

Compute Capability 1.3であれば一つのSMに対して1024のthreadを同時に走らせることができれば1.0になる。
もちろん１つのBlockでは512個のthreadまでしか利用できないので、複数のBlockを同時に実行できるようなKernelを用意しないと1.0は出せない。

多くのthreadを走らせてoccupancyを上げたいときには、threadあたりの使用レジスタ数を抑える必要がある。
またBlockあたりのShared Memoryの量も抑える必要がある。

たとえば、Blockにthreadが512個あり、各threadのレジスタの使用が16以下でShared Memoryの使用量が8Kib以下であれば同時に2Blockを走らせることが可能になるので、この時にはoccupancy 1.0を達成できる。

Kernelのレジスタ数を制限したければnvccのオプションに--maxrregcount をつければいい。Nは１つのKernelの最大レジスタ数を指定する。

nvccのhelpには次のように書かれている

--maxrregcount <N> (-maxrregcount)
        Specify the maximum amount of registers that GPU functions can use. Until
        a function- specific limit, a higher value will generally increase the performance
        of individual GPU threads that execute this function. However, because thread
        registers are allocated from a global register pool on each GPU, a higher
        value of this option will also reduce the maximum thread block size, thereby
        reducing the amount of thread parallelism. Hence, a good maxrregcount value
        is the result of a trade-off.
        If this option is not specified, then no maximum is assumed. Otherwise the
        specified value will be rounded to the next multiple of 4 registers until
        the GPU specific maximum of 128 registers.

レジスタ数を減らせば多くのthreadを走らせられるが変わりにthreadの性能は落ちる。逆にレジスタ数を増やせばthreadの性能は上がるが、同時に走らせられるthreadの数が減るというトレードオフの関係がある。
実際にアプリケーションが最高の性能を発揮する最適な値は関数毎にトライ＆エラーで調べるしかないと思う。

Occupacyは実際にProfileを行わなくても、NVIDIA_CUDA_SDKの中にあるCUDA_Occupancy_calculator.xlsというexcelシートに値を入れていくと簡単に調べられる。
またoccupacyを制限している要因も表示される。
LinuxではhomeにNVIDIA_CUDA_SDKをインストールした場合には次の場所にある。

/home/ユーザー名/NVIDIA_CUDA_SDK/tools/CUDA_Occupancy_calculator.xls

また普通にインストールすれば次の場所にいろんなドキュメントが入っているので一度見てみると良いと思おう。
(doxygenで吐いたhtmlのドキュメントもある)

/usr/local/cuda/doc