Hierarchically cached word counter over the public internet page(HTML document), exposed as an API via gRPC service.
This program access a specific location in the web and count a word in HTML document. Except,
- when user asks internal resource like
file://etc/fstab
orhttps://cluster.local
- when user asks too big page such as
https://download.ubuntu.org/installation_image.iso
- (check
simplewc/simplewc/tests/test_security.py
for more cases.)
Internally, this program reuses and caches many parts including most recent results and recently acquired HTML documentations.
- You can call the API on same HTML resource multiple times. It will
generate only one HTML documentation download from the target public web host
in most cases.
- You can request multiple word counts on a single document. The results will be streamed to you.
- Most recent accessed query and past retrieved HTML documents can have separate data storages so you can have cheap storage as a documentation storage and in-memory cache server as a request cache server
- Transport Layer
- gRPC
- Service Layer
- gRPC servicer implementation using our model
- Data Model Layer
- class
HTMLDocument
- class
- Data storage layer
- Multiple databases
Have a pytest
> python -m pip install pytest
Run tests
> python -m pytest
To write additional unittest, please note we provide MockDocumentStorage
and MockQueryCache
- Deploy test DBs
- Insecure Redis running at
localhost:50001
- Insecure MongoDB running at
localhost:27017
- Insecure Redis running at
- Install package (or set
PYTHONPATH
/pipenv
/venv
if you wish)> pip install ./simplewc
- Run server program
> python -m simplewc
- (in another context/terminal,) Run example_client
> python simplewc/simplewc/example_client.py
TODO: regression tests can be included in test
.
Use gRPC service rpc CountWords (WordCountRequest) returns (stream WordCount)
in your favorite language.
CountWords
returns gRPC stream ofWordCount
WordCountRequest
contains one URI to the HTML document and word(s) to countWordCount
contains one URI, one word, and its appearance- Error code and messages are handled in gRPC standard error code
Example client script is provided in simplewc/simplewc/example_client.py
.
The simplest example would be,
channel = grpc.insecure_channel(f'localhost:50051')
stub = WordCountServiceStub(channel)
response_stream = stub.CountWords(WordCountRequest(
uri='https://virtusize.jp', words=['fit', 'size', 'virtusize']))
for r in response_stream:
print(f'\tAt {r.uri}, word {r.word} appears {r.count} time(s)')
API is provided in a form of gRPC rpc
.
/* WordCountService services word counting based on WordCountRequest message.
* * Note on caching:
* - The implementation of this service may contain internal caching on
* HTML document.
* - Request multiple word count in a single uri rather than calling
* any services multiple times. */
service WordCountService {
/* Service each word's occurrence in a certain uri.
* If error happens, it will cut a stream and send gRPC error code with
* detailed message instead of WordCount stream */
rpc CountWords (WordCountRequest) returns (stream WordCount);
}
messages are provided in a form of protobuf
message. For the details, please
refer to proto file at
simplewc/simplewc/protos/wc.proto
, which looks like ...
/* WordCountRequest represents a word count query. You can specify multiple
* words at the same time */
message WordCountRequest {
string uri = 1;
repeated string words = 2;
}
/* WordCount represents a word and a occurrence of it in uri */
message WordCount {
string word = 1;
string uri = 2;
uint32 count = 3;
}
The example client runs as follows: ```text Try to find 3 different words in a URL At https://virtusize.jp, word fit appears 4 time(s) At https://virtusize.jp, word size appears 0 time(s) At https://virtusize.jp, word virtusize appears 4 time(s)
Try to find nothing
No word found
Inaccessible host: non existing https://virtusize.co.jp
RPC Error <_Rendezvous of RPC that terminated with:
status = StatusCode.INTERNAL
details = "We could not reach a server of requested URI"
debug_error_string = "{"created":"@1553340540.602000000","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"We could not reach a server of requested URI","grpc_status":13}"
>
Inaccessible host: 127.0.0.1
RPC Error <_Rendezvous of RPC that terminated with:
status = StatusCode.PERMISSION_DENIED
details = "You cannot access Local URI"
debug_error_string = "{"created":"@1553340540.603000000","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"You cannot access Local URI","grpc_status":7}"
>
Inaccessible host: file:///etc/apt/sources.list
RPC Error <_Rendezvous of RPC that terminated with:
status = StatusCode.PERMISSION_DENIED
details = "You can only access ('http', 'https') protocol"
debug_error_string = "{"created":"@1553340540.604000000","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"You can only access ('http', 'https') protocol","grpc_status":7}"
>
```
- User send a request, (uri, multiple words)
- Check if it's safe request
- Open a stream
- In every word,
- Check if a (uri/word) combination is in result cache
- Do not update TTL of cache. Return the result
- If not, check local memory if we already loaded a HTML document.
- If we have a document in local memory, return the result and update query cache
- If not, check document storage in local network,
- If we have a document in a storage, update recent query cache and return the result
- If we don't even have it, get it over the internet
- Store both HTML document and recent result
- Check if a (uri/word) combination is in result cache
- Close a stream if,
- Met the last result
- Found an error
- Send error code and detailed message
As we are not using authentication, use simplewc.servicer:serve_insecure
. you may want to refer to
simplewc/simplewc/__main__.py
for the test purpose server launch.
- Requirements
- Insecure Redis running at
localhost:50001
- Insecure MongoDB running at
localhost:27017
- Insecure Redis running at
Modify simplewc.config
to configure these
Also, MockDocumentStorage
and MockQueryCache
are provided to run without Redis and MongoDB.
Currently Helm package, compose file, or even Dockerfile
is not provided. I don't expect you to use this in
production. But if you are interested...
Currently this program is configured via Python file.
simplewc.config
is configured as,
ALLOWED_PROTOCOLS = ('http', 'https')
MAX_CONTENT_SIZE = 2 ** (10 + 10 + 4) # 16.0 MiB
MAX_GRPC_SERVER_THREADS = 16
INSECURE_HOST = 'localhost'
INSECURE_PORT = 50001
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
REDIS_DB = 0
CACHE_EXPIRE = '600'
MONGO_HOST = 'localhost'
MONGO_PORT = 27017
MONGO_DB = 'wc_doc_cache'
MONGO_COLLECTION = 'wc_doc_collection'
MONGO_TTL = 3600
You may want to edit this with getenv
, such as getenv('REDIS_HOST')
, to configure with env file. Or edit directly in
build time for the immutable infrastructure pattern.
Currently this program expects insecure internal communication. We don't expect privilege check on databases.
For example, Redis singleton is created as the following.
RedisQueryCache(REDIS_HOST, REDIS_PORT, REDIS_DB)
RedisQueryCache
class (and MongoDB too) has extra options to configure security. Update this to secure internal
connections
Prepare grpc_tools
and mypy-protobuf
on your dev environment, then
> python -m grpc_tools.protoc -Isimplewc/simplewc/protos --python_out=simplewc/simplewc/protos --grpc_python_out=simplewc/simplewc/
protos --mypy_out=simplewc/simplewc/protos wc.proto
, on Windows, add
--plugin=protoc-gen-mypy=path\to\mypy-protobuf\python\protoc_gen_mypy.bat
- If encoding is not specified in HTML, this only works with UTF-8 encoded
pages.
- We can improve with encoding guessing. There are good oss
implementations such as
cchardet
- We can improve with encoding guessing. There are good oss
implementations such as
- User only can find a word maximum length of 4MB
-
Web page cache
- Counting all the words and one word takes same time complexity, O(n).
- So we save all word count of a document
- Average HTML document size is ~ 30KB
- Google expects 60 Billions web page exist
- We can cache 0.00005% of web pages per 1 GB.
- We can store 1% of web pages only with 20TB storage
- But we are not building Google scale API here.
- Or, we can store ~ 2 Million web pages for 64GB of storage
- it takes ~60 seconds to fill up 64 GB storage with 10G internet
connection, w/o any overhead.
- But, HTML document size may vary greatly by each web pages, thus let's create a strategy to expire data to maintain 50% usage of disk
- Minimum lifetime of each cache record will be (storage size)/
(internet speed)
- NOTE: (bytes) / (bytes/time) = time
- Maximum lifetime of each cache record must be set by the user
- To maintain the cache size ~ storage size/2,
- as (v_create - v_expire) * dt = delta_storage, we can solve ODE
- or in a very rough approximation, we can use ((1-t) max - t min), where t = min(current storage usage / (storage size/2), 1)
- We can cache 0.00005% of web pages per 1 GB.
- Google expects 60 Billions web page exist
- Counting all the words and one word takes same time complexity, O(n).
-
Query result cache
- Each record will take ~ 4KB (1KB of URL, 1KB of word, 4bytes of count +
overhead)
- We can save ~4 millions of query result in 16GB memory.
- We can set the TTL of cache, and keep them in LRU fashion.
- Each record will take ~ 4KB (1KB of URL, 1KB of word, 4bytes of count +
overhead)
-
Choice of Database Solution
- Web page cache
- Cheap storage(disk-based), TTL supported, document database: MongoDB
- Query result cache
- In-memory, fast membership check, LRU support: Redis
- Web page cache