performance - Clustering a large, very sparse, binary matrix in R -
i have large, sparse binary matrix (roughly 39,000 x 14,000; rows have single "1" entry). i'd cluster similar rows together, initial plan takes long complete:
d <- dist(inputmatrix, method="binary") hc <- hclust(d, method="complete")
the first step doesn't finish, i'm not sure how second step fare. approaches efficiently grouping similar rows of large, sparse, binary matrix in r?
i've written rcpp code , r code works out binary/jaccard distance of binary matrix approx. 80x faster dist(x, method = "binary")
. converts input matrix raw matrix transpose of input (so bit patterns in correct order internally). used in c++ code handles data 64 bit unsigned integers speed. jaccard distance of 2 vectors x , y equal x ^ y / (x | y)
^
xor operator. hamming weight calculation used count number of bits set if result of xor
or or
non-zero.
i've put code on github @ https://github.com/niknakk/binarydist/ , reproduced 2 files below. i've confirmed results same dist(x, method = "binary")
few random datasets.
on dataset of 39000 rows 14000 columns 1-5 ones per row, took 11 minutes. output distance matrix 5.7 gb.
bdist.cpp
#include <rcpp.h> using namespace rcpp; //countbits function taken https://en.wikipedia.org/wiki/hamming_weight#efficient_implementation const uint64_t m1 = 0x5555555555555555; //binary: 0101... const uint64_t m2 = 0x3333333333333333; //binary: 00110011.. const uint64_t m4 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ... const uint64_t h01 = 0x0101010101010101; //the sum of 256 power of 0,1,2,3... int countbits(uint64_t x) { x -= (x >> 1) & m1; //put count of each 2 bits 2 bits x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits 4 bits x = (x + (x >> 4)) & m4; //put count of each 8 bits 8 bits return (x * h01)>>56; //returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ... } // [[rcpp::export]] int countbitsfromraw(rawvector rv) { uint64_t* x = (uint64_t*)raw(rv); return(countbits(*x)); } // [[rcpp::export]] numericvector bdist(rawmatrix mat) { int nr(mat.nrow()), nc(mat.ncol()); int nw = nr / 8; numericvector res(nc * (nc - 1) / 2); // access raw data unsigned 64 bit integers uint64_t* data = (uint64_t*)raw(mat); uint64_t a(0); // work through each possible combination of columns (rows in original integer matrix) (int = 0; < nc - 1; i++) { (int j = + 1; j < nc; j++) { uint64_t sx = 0; uint64_t = 0; // work through each 64 bit integer , calculate sum of (x ^ y) , (x | y) (int k = 0; k < nw; k++) { uint64_t o = data[nw * + k] | data[nw * j + k]; // if (x | y == 0) (x ^ y) 0 if (o) { // use hamming weight method calculate number of set bits = + countbits(o); uint64_t x = data[nw * + k] ^ data[nw * j + k]; if (x) { sx = sx + countbits(x); } } } res(a++) = (double)sx / so; } } return (res); }
r source
library("rcpp") library("plyr") sourcecpp("bdist.cpp") # converts binary integer vector packed raw vector, # padding out @ end make input length multiple of packwidth packrow <- function(row, packwidth = 64l) { packbits(as.raw(c(row, rep(0, (packwidth - length(row)) %% packwidth)))) } as.packedmatrix <- function(x, packwidth = 64l) { usemethod("as.packedmatrix") } # converts binary integer matrix packed raw matrix # padding out @ end make input length multiple of packwidth as.packedmatrix.matrix <- function(x, packwidth = 64l) { stopifnot(packwidth %% 8 == 0, class(x) %in% c("matrix", "matrix")) storage.mode(x) <- "raw" if (ncol(x) %% packwidth != 0) { x <- cbind(x, matrix(0l, nrow = nrow(x), ncol = packwidth - (ncol(x) %% packwidth))) } out <- packbits(t(x)) dim(out) <- c(ncol(x) %/% 8, nrow(x)) class(out) <- "packedmatrix" out } # converts integer matrix as.matrix.packedmatrix <- function(x) { out <- rawtobits(x) dim(out) <- c(nrow(x) * 8l, ncol(x)) storage.mode(out) <- "integer" t(out) } # generates random sparse data testing main function makerandomdata <- function(nobs, nvariables, maxbits, packed = false) { x <- replicate(nobs, { y <- integer(nvariables) y[sample(nvariables, sample(maxbits, 1))] <- 1l if (packed) { packrow(y, 64l) } else { y } }) if (packed) { class(x) <- "packedmatrix" x } else { t(x) } } # reads binary matrix file or character vector # borrows first bit of code read.table readpackedmatrix <- function(file = null, text = null, packwidth = 64l) { if (missing(file) && !missing(text)) { file <- textconnection(text) on.exit(close(file)) } if (is.character(file)) { file <- file(file, "rt") on.exit(close(file)) } if (!inherits(file, "connection")) stop("'file' must character string or connection") if (!isopen(file, "rt")) { open(file, "rt") on.exit(close(file)) } lst <- list() <- 1 while(length(line <- readlines(file, n = 1)) > 0) { lst[[i]] <- packrow(as.integer(strsplit(line, "", fixed = true)[[1]]), packwidth = packwidth) <- + 1 } out <- do.call("cbind", lst) class(out) <- "packedmatrix" out } # wrapper c++ code binarydist <- function(x) { if (class(x) != "packedmatrix") { x <- as.packedmatrix(x) } dst <- bdist(x) attr(dst, "size") <- ncol(x) attr(dst, "diag") <- attr(dst, "upper") <- false attr(dst, "method") <- "binary" attr(dst, "call") <- match.call() class(dst) <- "dist" dst } x <- makerandomdata(2000, 400, maxbits = 5, packed = true) system.time(bd <- binarydist(x))
from original answer:
other things consider doing prefiltering of comparisons between 2 rows single ones since distance either 0 duplicates or 1 other possibility.
a couple of relatively straightforward options might faster without needing code vegdist
function vegan package , dist
function amap package. latter quicker if have multiple cores , take advantage of fact supports parallelisation.
Comments
Post a Comment