- provide a fast and persistent data pipeline for documents
- be as open as possible for modern data analytics/machine learning/AI frameworks
- make use of those frameworks, do not try to reproduce them
- make them comparable by storing the results in a comparable format
- gather as much information as possible about documents
- Metadata
- Authorization
- inner structure
- relations
- categories
- be open source
- use kotlin
- use kotlin coroutines for modern development
- data is streaming data, 't is a fact
- rely on Apache Kafka, 't is a beast
- have a well designed solution domain
- Start a Kafka Cluster
I have provided a docker-compose.yml file, which allows you to run a confluent kafka stack as a docker swarm with the command
docker stack deploy -c docker-compose.yml my-confluentwhere you replace the value cixin with the name of your current host
If you go this way, you will have to find out the container id of the image running the
confluentinc/cp-kafka-rest:latest image ( e.g. docker ps) and then execute the commands
provided in create_topics.txt in this directory
- Initialise an IntelligencePipeline
val pipeline = IntelligencePipeline(bootstrapServers, stateDir)where bootstrapServers is something like localhost:9092 and stateDir should point to a local directory (used by state stores)
3. register your ingestors:
pipeline?.registerIngestor(DirectoryIngestor("src/test/resources"))- register your MetaDataProducers:
pipeline?.registerMetadataProducer(TikaMetadataProducer())
pipeline?.registerMetadataProducer(HashMetadataProducer())- start the pipeline
launch {
pipeline?.run()
}- make sure to set advertised.host.name in the bootstrap config
- Use case 1: Two Metadata-Producer produce independent and maybe conflicting data ==> there has to be a possibility to know about those conflicting results and act accordingly (e.g. choose one, or an average)
- Use case 2: To detect sentence boundary, a ChunkProducer makes an educated guess. Then a human makes some saner choices. Those choices are kept when the document content changes.
- Use case 3: a quite dump ChunkProducer makes a first pass. Then a better system optimized upon it. This might be usefull for systems where the second, better producer cannot work efficiently with large documents (e.g. StanfordNLP)
- give each producer a weight: the higher the more important it is
- why does exactly once not work?
- how to handle deletes?
- robust exception handling: test with a rogue MetadataProducer
- Quickstart Guide
- Finer Domain Model: Sentences, Token, etc
- Participiants, Capabilities and requirements
- check that it is running with local cluster
- Keep represenation off kafka (simple file handling)
- make it available on Github
- Basic result reporting
- acceptable test coverage
- very basic GUI
- simple pipeline with in memory map
- simple regex based keyword extraction
- make it available online somewhere
- Ingestion is on schedule
- Large file strategy: many metadata producers take too long, e.g. stanfordnlp => tokenize this
- Intelligent reprocessing of Metadata
- better GUI
- coroutines for MetadataProducers and where it makes sense
- Update to Kafka 2.x, check its testing infrastructure
- Multiple IP in parallel
- IP can be rescheduled
- uberjar or something similar
- Possibility for external process to attach via batch
- more Ingestors
- Ingestors based on Kafka Connect
- One ouptclient (DBMS or Graph DB) based on Kafka Connect
- (optional) REST API
Have different frameworks compete and have means to select/evaluate the best possibility
#Design Considerations
How can the inner structure (i.e. sections, paragraphs, sentences, words, tokens) be designed in a streaming enivonrment?
==> at the moment: Chunks
==> do not keep the whole text within kafka as one block, but use messages to describe the structure (e.g. "ADD SENTENCE")