実装して学ぶデータ処理システムその2 - union-find によるスポット検出

作図に手間取っているので、予定を変更して、スポット検出アルゴリズムについて説明する。元ネタは Kabsch, Wolfgang. "Integration, scaling, space-group assignment and post-refinement" Acta Crystallographica Section D (2010): 133-144 の 2.5 節だが、詳細不明なため、XDS の実装をそのまま説明することは不可能である。そこで、今後のテストに十分な程度に簡略化したアルゴリズムを実装する。いずれ、MOSFLM や Cheetah や CrystFEL のアルゴリズムについても説明したい。

XDS の場合、スポット検出は strong pixel の検出と spot の検出の二段階で行われる。局所的に設定される閾値を超えたピクセルが strong pixel となり、それが MINIMUM_NUMBER_OF_PIXELS_IN_A_SPOT 以上固まっているものが spot となる。さらに、スポットの重心座標と、スポット内での最も明るいピクセルのズレが計算され、これが SPOT_MAXIMUM_CENTROID 未満の spot だけが最終的に受理される。

スポットに含まれるピクセルを列挙する際、XDS は 4 近傍(スポットの上下左右のこと。斜め隣を含まない)を調べる。また、連続した前後のフレームの同じピクセルも調べる。したがって、あるピクセルと直接隣接しうるピクセルは 6 ピクセルである。隣接領域を調べるには深さ優先探索をすればよいが、XDS では Rem のアルゴリズムを用いているらしい。ざっと眺めた限りでは、union-find 木の親戚のようだ。

以下では、1つのフレーム内で、union-find を使ったスポット探索を実装してみた。MAXIMUM_CENTROID 基準は未実装である。R では引数が値渡しであるため、union-find 木の効率的な実装が難しい。ここではグローバル変数を <<- で書き換えるという汚い(?)手段を使っている。

閾値は、

each pixel value is compared with the mean value and standard deviation of surrounding pixels in the same image and classified as a strong pixel if its value exceeds the mean by a given multiple (typically 3–5) of the standard deviation.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2815666/

という記述に従って MEAN + SD * STRONG_PIXEL (デフォルトでは 3) とした。実際の XDS では、2 * NBX + 1, 2 * NBY + 1 ピクセル四方の近傍でこの処理を行うのだが、R で実装すると重いうえ、うまくいかなかった(threshold が小さくなりすぎる)ので、画像全体で共通の値とした。このテスト画像では、ほとんど背景が flat なので問題ないが、diffuse scatter や ice / lipid ring がある画像ではうまくいかないだろう。COLSPOT では BKGINIT.cbf や GAIN.cbf も使われているはずだが、その詳細も不明である。

3.2 節には、INTEGRATE 時の背景の見積もりの場合は、 strong pixel を取り除いて残った background pixel の強度分布が正規分布になるまで反復するという記述があるが、COLSPOT の閾値決定でも反復しているかどうかは不明である。強い反射を取り除くと、残されたピクセルの平均値も標準偏差を低下するので、threshold は必ず低下する。

# simple spot finding

#  readCBF と scale_img は http://d.hatena.ne.jp/biochem_fan/20150714/1436865404 で実装した
img1 <- readCBF("/mnt/data/DIALS-thaumatin/th_8_2_0001.cbf") # must be full-path
STRONG_PIXEL <- 3

# TRUE が strong pixel を意味するマスク
mask <- matrix(FALSE, nrow=nrow(img1), ncol=ncol(img1))

# mark strong pixels

nstrong <- 0
while (TRUE) { # loop until convergence
  oldn <- nstrong
  bg_pixels <- img1[mask==FALSE]
  threshold <- mean(bg_pixels) + STRONG_PIXEL * sd(bg_pixels)
  mask <- mask | (img1 > threshold)
  nstrong <- sum(mask==TRUE)
  if (oldn == nstrong) break
  cat("threshold =", threshold, "strong =", nstrong, "\n")
}

# Union-find algorithm
#  use global variables because R doesn't have pass-by-reference

get_group <- function(x) {
  if (spot_labels[x] == x) {
    return(x)
  } else {
    grp <- get_group(spot_labels[x])
    spot_labels[x] <<- grp
    return(grp)
  }
}

unite <- function(x, y) {
  spot_labels[get_group(y)] <<- get_group(x)
}

# union-find が使うラベル表
spot_labels <- (1:length(img1))
spot_labels[mask==FALSE] <- -1 

# strong spot の一覧
pos <- which(mask, arr.ind=TRUE)

# 1st pass: unite adjacent pixels
for (i in 1:nrow(pos)) {
  x <- pos[i, 1] # in R, fast scan is the 1st dimension
  y <- pos[i, 2]
  ind <- (y - 1) * NX + x # y and x are both 1-indexed in R

  if (y + 1 <= NY) {
    if (mask[x, y + 1]) {
      unite(ind, ind + NX)
    }
  }
  if (x + 1 <= NX) {
    if (mask[x + 1, y]) {
      unite(ind, ind +1)
    }
  }
}

# 2nd-pass: update group labels and count spot area and centroid
spot_area <- rep(0, length(spot_labels))
centroidx <- rep(0, length(spot_labels))
centroidy <- rep(0, length(spot_labels))

for (i in 1:nrow(pos)) {
  x <- pos[i, 1]
  y <- pos[i, 2]
  ind <- (y - 1) * NX + x
  
  grp <- get_group(ind)
  spot_area[grp] <- spot_area[grp] + 1
  centroidx[grp] <- centroidx[grp] + x
  centroidy[grp] <- centroidy[grp] + y
}
centroidx <- centroidx / spot_area
centroidy <- centroidy / spot_area

# display spot list

image(1:nrow(img1), 1:ncol(img1), scale_img(img1, 0, 10), useRaster=TRUE, asp=1,
      col=grey(seq(0, 1, length.out=256)))

MIN_AREA <- 3 # MINIMUM_NUMBER_OF_PIXELS_IN_A_SPOT
for (i in which(spot_area > MIN_AREA)) {
  cat(centroidx[i], centroidy[i], spot_area[i], "\n")
  points(centroidx[i], centroidy[i], pch=1, col=2)
}

# 見つけたスポットを指数付けしてみる。B は前回、XPARM.XDS からパースした
error <- c()
for (i in which(spot_area > MIN_AREA)) {
  hkl_raw <- t(B) %*% (xy2S(centroidx[i], centroidy[i]) - S0)
  hkl <- round(hkl_raw)
  error <- c(error, max(abs(hkl_raw - hkl)))
}
sum(error > 0.3) / length(error) 
# 約 9 割が正しいスポットだった。なお、本データを XDS で処理すると、
# 94% 程度が正しく指数付けできるスポットである