This is work needed after milestone 2.
Lets start by discussing what queries we want our system to support.
We are going to create fake "metadata" associated with our 1000 time series, so that we can test a workflow in which we query timeseries by their metadata in a relational database, get results, and look for similar other time series. There are 2 ways this can play out:
We'll take the following steps to set up a REST API to support our queries, both on the relational databse, as well as on the socket-server database you are writing.
(1) keeping the ids of the 1000 timeseries from previous project missives, add random metadata for a floating point field blarg
which takes continuous values on [0,1] drawn from a Uniform distribution.
(2) randomly drawn from letters A-F to create a field level
associated with each id. This is kinda like throwing a dice: there are discrete enumerated values A/B/C/D/E/F that can result
(3) throw this information along with the mean (mean
) and standard deviation(std
) of the time series into a relational database. Thus the main table in this database has 5 columns: the id
being the primary keys, and 4 others: mean
, std
, blarg
, and level
. You can get things working locally and test with sqlite, but for production you want to use postgres on your EC2 instance.
(4) support an additional query supported by your custom database server which returns a time-series for an id. Expose this to your command line program: you will need to support an additional operation in your protocol. You should use your custom storage-manager that you wrote in part 8 of the project to fetch time series for this purpose from disk. At this stage make sure that when you store time series into the database, you use the storage manager interface as well. (upto now there has been no notion of storing the time series in the "database", you have probably just come up with your own scheme to store them on disk in some folder).
Now we will write a REST API that queries this relational database, the custom similarity database that you wrote, and the Storage Manager.
Endpoints you must support:
/timeseries
GET
should send back a json with metadata from all the time series/timeseries
POST
adds a new timeseries into the database given a json which has a key for an id and a key for the timeseries, and returns the timeseries/timeseries/id
should send back metadata and the timeseries itself in a JSON payload/timeseries?mean_in=1.5-1.53
type queries should now be supported and send back only metadata. For continuous variables you only need to support range queries with string mean_in=1.5-1.53
whereas for discrete variables(level
here) you need to support equals and in level_in=A,B,C
or level=A
. For simpilicity we wont support equals or less-than/greater-than type queries on continuous variables. Finally we'll support only one query at a time (so no combined SELECTs on the database). This will enable you to develop either in DBAPI2 or SQLAlchemy (your choice, although I think that using Flask-SQLAlchemy is a bit easier)./simquery
GET
should take a id=the_id
querystring and use that as an id into the database to find the timeseries that are similar, sending back the ids of (say) the top 5./simquery
POST
should take a timeseries as an input in a JSON, carry out the query, and return the appropriate ids as well. This is an unusual use of POST
.This concludes part 10. Now you can see how a query might be made: use 4. to make a filter. Use 3 to draw the timeseries on the results page. Provide an additional link for each result that uses 5 to compute the similarity, showing the results using 3 again. We can also support a separate uploaded query using 6 and show the results. (The only remaining thread there is how to set things up to display the query timeseries which is NOT part of the database. The answer is to use the same code on the browser side that you used to display the time series earlier: this will ensure that your JS code is also modular.)