Data Lake is an excellent way to store and query structured and unstructured data. In our case market data from various providers. Rather than pulling our hair out confirming the exact schema our market data should be in, then spending our valuable life time transforming data and thinking how to host it, let’s just dump the data in S3 and let the Data lake technology do the heavy lifting for us.
Welcome to AWS Lake Formation
How does it work? Let’s go through the steps.
Upload Market Data¶
Query any market data provider and stick the data straight into S3, no need for long discussion about schemas.
FX trading exchange. More infomation here.
We have a tool that download minutely FX Spot data.
python epython/marketdata/utils/fxcm.py --help
Usage: fxcm.py [OPTIONS] Options: --years TEXT comma separated list of years [required] --pairs TEXT comma separated list of pairs [required] --help Show this message and exit.
Run for 1 year and 1 ccy:
python epython/marketdata/utils/fxcm.py --year 2018 --ccy USDJPY
Or knock ourselves out with data:
python epython/marketdata/utils/fxcm.py \ --years=2014,2015,2016,2017,2018 \ --pairs=EURUSD,GBPUSD,NZDUSD,USDCAD,USDCHF,USDJPY
All the data would now go to
they would be placed into an s3 path with some partitioning syntax. Below is one example.
Notice they are just gzipped csv files exactly as they were downloaded from the the FXCM.
Tokyo based BitCoin exchange
python -m epython.marketdata.utils.bitflyer --help Usage: bitflyer.py [OPTIONS] COMMAND [ARGS]...
Options: --help Show this message and exit. Commands: download Download BitFlyer marketdata from TAY Global Ltd to dest folder. upload Upload to epyton-marketdata from local src folder
Remember our data is large (years of minutely data for different pairs) and it is distributed. We can run complicated queries on that same data fast, really fast.
Athena queries are stored in an s3 bucket epython-athena.
This means repeated query would return very quickly, but it would also mean that the bucket would progressively get larger.
We could solve this by adding a lifecycle rule to the bucket.
Apply to all objects in the bucket
We can transition data into cheaper storage if there is a regulatory requirement. But in this case we do not need that.
Now we choose to expire and delete objects from this bucket after 7 days.
Acknowledge that object in this bucket would live only for 7 days.