Merge branch 'master' of gitlab.geant.org:gn4-3-wp8-t3.1-soc/soctools

56408ac7 · Arne Øslebø · ed318249 · e3028708 · 56408ac7 · 56408ac7
Commit 56408ac7 authored 4 years ago by Arne Øslebø
--- a/README.md
+++ b/README.md
@@ -13,7 +13,8 @@ SOCTools aims at being easy to install and that all components should be fully i

 * [Architecture](doc/architecture.md)
 * [Installation](doc/install.md)
-* Example use case
+* [Data ingestion](doc/dataingestion.md)
+* [Example use case](doc/usecase.md)

 ## License


--- a/doc/architecture.md
+++ b/doc/architecture.md
-# Architecture
\ No newline at end of file
+# Architecture
+
+SOCTools is a collection of tools for collecting, enriching and analyzing logs and other security data, threat information sharing and incident handling. Many SOCs will already have some tools in place that they want to continue to use. One main feature of SOCTools is therefore to have a flexible architecture where it is simple to integrate existing tools even if they are not directly supported by SOCTools. It is also easy to select which components of SOCTools to install.
+
+## High level architecture
+<img src="images/high_level_arch.png" width=640>
+
+The high level architecture is shown in the figure above and consists of the following components:
+* Data sources - the platform supports data from many common sources like system logs, application logs, IDS etc. It is also simple to add support for other sources. The main method for sending data into SOCTools is through Filebeat.
+* High volume data sources - while the main platform is able to scale to high traffic volumes, it will in some cases be more convenient to have a separate setup for very high volume data like Netflow. Some NRENs might also have an existing setup for this kind of data that they do not want to change. Data sources like this will have its own storage system. If real time processing is done on the data, alerts from this can be shipped to other components in the architecture.  
+* Data transport - [Apache Nifi](https://nifi.apache.org/), the key component that collects data from data sources, normalize it, do simple data enrichment and then ship it to one or more of the other components in the architecture. 
+* Storage - in the current version all storage is done in [Elasiticsearch](https://opendistro.github.io/for-elasticsearch/), but it is easy to make changes to the data transport so that data is sent to other log analysis tools like Splunk or Humio. 
+* Manual analysis - In the current version [Kibana](https://opendistro.github.io/for-elasticsearch/) is used for manual analysis of collected data.
+* Enrichment - This component enriches the collected data either before or after storage. In the current version this is done as part of the data transport component before data is sent to storage. 
+* Threat analysis - collects and analyzes threat intelligence data. Typical source for enrichment data. The current version uses [MISP](https://www.misp-project.org/).
+* Automatic analysis - this is automatic real time analysis of collected data and will be added to later versions of SOCTools. It can be simple scripts looking at thresholds or advanced machine learning algorithms.
+* Incident response - [The Hive and Cortex](https://thehive-project.org/) is used for this and new cases can be created automatically from manual analysis in Kibana. 
+
+### Authentication
+
+SOCTools uses [Keycloak](https://www.keycloak.org/) to provide single sign on to all web interfaces of the various components.
+
+## NiFi pipeline
+
+The main job of Nifi is to collect data from various sources, enrich it and send it to storage which currently is Elasticsearch. The pipeline in Nifi is organized into two main prcoess groups, "Data processing" and "Enrichment data".
+
+### Enrichment data
+This process group is basically a collection of "cron jobs" that runs regularly to update various enrichment data that is used by "Data processing" to enrich collected data. The current version supports the following enrichment data:  
+* Umbrella top 1 million domains - http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
+* Alexa top 1 million - http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
+* Tor exit nodes - https://check.torproject.org/torbulkexitlist
+* MaxMind GeoLite2-City database - Requires a free account. https://dev.maxmind.com/geoip/geoip2/geolite2/
+* Misp - NiFi automatically downloads new IOCs from the Misp instance that is part of SOCTools. IP addresses and host names are then enriched to show if they are registered in Misp.
+
+### Data processing
+The processing group is divided into 3 parts:
+* Data input - receives data, normalizes it and converts it to JSON. This also adds attributes to the data that specifies which filed names to enrich.
+* Enrichment - enriches the data. It currently supports enriching IP addresses, domain names and fully qualified domain name (FQDN).
+* Data output - sends data to storage. In future version data will also be sent to other tools doing real time stream processing of the data.  
+
+Each group contains a process group called "Custom ..." where it is possible to add new processors to the pipeline that will not be overwritten when upgrading to newer versions of SOCTools.
--- a/doc/dataingestion.md
+++ b/doc/dataingestion.md
+# Data ingestion
+
+SOCTools monitors itself which means that there is already support for receiving and parsing the following data:
+* Misp
+* Haproxy
+* Kibana
+* Keycloak
+* Mysql
+* Zookeeper
+* Nifi
+* Elasticsearch
+
+In addtion there is also support for:
+* Suricata EVE logs
+* Zeek logs
+
+Additional logs can be sent to the SOCTools server on port 6000 using Filebeat. The typical configuration is:
+
+```
+filebeat.inputs:
+- type: log
+  paths:
+      - /opt/nifi/nifi-current/logs/nifi-app.log
+   fields:
+    log_type: nifi
+
+output.logstash:
+  hosts: ["soctools.example.com:6000"]
+  workers: 3
+  loadbalance: true
+  ```
+
+The extra filed log_type tells Nifi how it should route the data to the correct parser. The following values are currently supported:
+* elasticsearch
+* haproxy
+* keycloak
+* kibana
+* misp
+* mysql
+* nifi
+* suricata
+* zeek
+* zookeeper
+
+Support for shipping logs over TLS will be added in a future version of SOCTools.
+
+## New log types
+
+New unsupported log types can be sent to SOCTools port 6006 using Filebeat. Similar configuration as above. By default new data types will be sent to the index logs-custom-unknown. Proper parsing of new log types can be added to the process group "Custom data inputs".  
+
+To specify fields that should be enriched, the following attributes can be added to the flow records:
+* enrich_ip1 and enrich_ip2
+* enrich_domain1 and enrich_domain2
+* enrich_fqdn1 and enrich_fqdn2
+
+Each attribute should be set to the [NiFi RecordPath](https://nifi.apache.org/docs/nifi-docs/html/record-path-guide.html) of the field to be enriched.
+
+### Enrichment example
+Assume you have the following log data:
+
+```
+{
+    "timestamp" : "2021-02-05T10:05:09.000Z",
+    "client" : {
+      "ip" : "172.22.0.1"
+    },
+    "verb" : "GET"
+}
+```
+
+You want to enrich the client IP so you set the attribute enrich_ip1 to the value "/client/ip". To see more example and to see how logs are parsed, take a look at the process group "Data processing"->"Data input"->"SOCTools" in the NiFi GUI.