Pragmatic CS #6 Bloom filters, protecting and finding secrets, Serverless Supercomputer
Embrace the era of serverless supercomputing
As you’d know from last week’s issue, I’ve been pretty excited about Rust recently. Microsoft has also said that Rust is the industry’s best chance at safe systems programming. With generics in Rust playing an important role in its popularity, we are now seeing Go experimenting with the addition of generics.
Using a Bloom Filter to Build A Static Full-Text Search Engine
A Bloom filter stores a 'fingerprint' (a number of hash values) of all input values instead of the raw input. The result is a low-memory-footprint data structure.
The Bloom Filter is a space-efficient way to tell with some confidence whether an element is in a set, without storing the elements themselves. The tradeoff is that there is an error rate that allows for a number of false positives. This implementation uses Rust and WebAssembly, allowing you to play with these up and coming technologies. Add this to your dev blog!
Best practices for managing and storing secrets
The nature of git means that if a secret gets overlooked in history it is compromised forever as anyone with access to the repository can find this secret in previous revisions of the codebase.
Avoid wildcard git commands like git add . (I know many developers who use this!). Instead, add each file by name when making a commit and use git status to list tracked and untracked files. Add sensitive files to .gitignore (but don’t rely solely on this practice as you might accidentally let some secrets slip). Use automated secrets scanning on git repositories (these might miss secrets in bytecode though)
Other approaches to consider:
use solutions like git secret to encrypt secrets in git repositories
local environment variables
secrets management systems such as Hashicorp Vault or AWS Key Management Service
whitelist IP addresses
opt for short-lived secrets where possible
Decompiling Python Bytecode in Public Repos to Find Secrets
Carrying on the story on protecting secrets, thousands of GitHub repositories have been found to contain secrets hidden inside their bytecode.
Cached bytecode is a low-level internal performance optimization, which is the kind of thing Python was supposed to free us from having to think about! The contents of
.pyc
files are inscrutable without special tools like a disassembler or decompiler. And when these files are buried inside__pycache__
, they’re easy to overlook. Many text editors and IDEs hide these folders and files from the source tree to avoid cluttering up the screen, making it easy to forget that they even exist.
Using The Idea of a Serverless Supercomputer to Achieve Near-interactive Completion Times
I always thought serverless architecture was just about rapid and convenient scaling, but this article really struck me about the paradigm shift possibility with serverless.
Imagine an application that would normally take 1 hour to run on a single machine… What if instead you could spin up 3600 lambdas that each run for one second to return near instantaneous results
Any processes that can be chunked up and processed in pieces in a distributed fashion is suitable for this massive parallel runtime – software compilation, software testing, data visualisation, image & video encoding, analysis and compression etc.