StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets. StringZilla uses a heuristic so simple it's almost stupid... but it works. It matches the first few letters of words with hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms. The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Features

  • Collection-Level Operations
  • Low-Level Python API
  • String libraries, splitting, sorting, and shuffling large textual dataset
  • JavaScript docs
  • Python docs
  • Substring Search

Project Samples

Project Activity

See All Activity >

Categories

JSON

License

Apache License V2.0

Follow StringZilla

StringZilla Web Site

Other Useful Business Software
MongoDB Atlas runs apps anywhere Icon
MongoDB Atlas runs apps anywhere

Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
Start Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of StringZilla!

Additional Project Details

Programming Language

C++

Related Categories

C++ JSON Software

Registered

2023-10-18