python - difference between happybase table.scan() and hbase thrift scannerGetList() -


i have 2 version of python script scans table in hbase 1000 rows in while loop. 1st 1 using happybase in https://happybase.readthedocs.org/en/latest/user.html#retrieving-rows

while variable:     key, data in hbase.table(tablename).scan(row_start=new_key, batch_size=1000, limit=1000):         print key     new_key = key 

the 2nd 1 using hbase thrift interface in http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/

scanner_id = hbase.scanneropenwithstop(tablename, '', '', []) data = hbase.scannergetlist(scanner_id, 1000)  while len(data):     dbpost in data:         print row_of_dbpost     data = hbase.scannergetlist(scanner_id, 1000) 

rows in database numbers. problem in row weird happening:

happybase prints(rows):

... 100161632382107648  10016177552  10016186396  10016200693  10016211838  100162138374537217 (point of interest)  193622937692155904  193623435597983745... 

and thrift_scanner prints(rows):

... 100161632382107648  10016177552  10016186396  10016200693  10016211838  100162138374537217 (point of interest) 100162267416506368  10016241167  10016296927 ... 

and happening not in point of next 1000 rows (row_start=new_scan or next data=scannergetlist), in middle of batch. , happens every time.

i 2nd script scannergetlist doing right.

why happybase doing differently? considering timestamps or other inside happybase/hbase logic? scan whole table, in different order?

ps. know happybase version scan , print 1000th row 2 times, , scannergetlist ignore first row in next data. not point, magic happening in middle of 1000 row batch.

i'm not sure data, loops not identical. thrift version uses single scanner, while happybase example repeatedly creates new scanner. also, happybase version imposes scanner limit, while thrift version not.

with thrift need bookkeeping, , need duplicate code (the scannergetlist() call) loop, perhaps that's causing confusion.

the right approach happybase this:

table = connection.table(tablename) key, data in table.scan(row_start=new_key, batch_size=1000):     print key     if some_condition:         break  # cleanly close scanner 

note: no nested loops here. other benefit happybase close scanner when you're done it, while thrift version not.


Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -