python - difference between happybase table.scan() and hbase thrift scannerGetList() -
i have 2 version of python script scans table in hbase 1000 rows in while loop. 1st 1 using happybase in https://happybase.readthedocs.org/en/latest/user.html#retrieving-rows
while variable: key, data in hbase.table(tablename).scan(row_start=new_key, batch_size=1000, limit=1000): print key new_key = key
the 2nd 1 using hbase thrift interface in http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/
scanner_id = hbase.scanneropenwithstop(tablename, '', '', []) data = hbase.scannergetlist(scanner_id, 1000) while len(data): dbpost in data: print row_of_dbpost data = hbase.scannergetlist(scanner_id, 1000)
rows in database numbers. problem in row weird happening:
happybase prints(rows):
... 100161632382107648 10016177552 10016186396 10016200693 10016211838 100162138374537217 (point of interest) 193622937692155904 193623435597983745...
and thrift_scanner prints(rows):
... 100161632382107648 10016177552 10016186396 10016200693 10016211838 100162138374537217 (point of interest) 100162267416506368 10016241167 10016296927 ...
and happening not in point of next 1000 rows (row_start=new_scan or next data=scannergetlist), in middle of batch. , happens every time.
i 2nd script scannergetlist doing right.
why happybase doing differently? considering timestamps or other inside happybase/hbase logic? scan whole table, in different order?
ps. know happybase version scan , print 1000th row 2 times, , scannergetlist ignore first row in next data. not point, magic happening in middle of 1000 row batch.
i'm not sure data, loops not identical. thrift version uses single scanner, while happybase example repeatedly creates new scanner. also, happybase version imposes scanner limit, while thrift version not.
with thrift need bookkeeping, , need duplicate code (the scannergetlist()
call) loop, perhaps that's causing confusion.
the right approach happybase this:
table = connection.table(tablename) key, data in table.scan(row_start=new_key, batch_size=1000): print key if some_condition: break # cleanly close scanner
note: no nested loops here. other benefit happybase close scanner when you're done it, while thrift version not.
Comments
Post a Comment