Observability #
Background #
In order to grasp the distributed system status, observe running state of the cluster is a new challenge. The point-to-point operation mode of logging into a specific server is not suitable for large number of distributed servers. Telemetry through observable data is the recommended operation and maintenance mode in such cases. Tracking, metrics and logging are important ways to obtain observable data of system status.
APM (application performance monitoring) monitors and diagnoses the performance of the system by collecting, storing and analyzing the observable data of the system. Its main functions include performance index monitoring, call stack analysis, service topology, etc.
DBPlusEngine is not responsible for gathering, storing and demonstrating APM data, but provides the necessary information for the APM. In other words, DBPlusEngine is only responsible for generating valuable data and submitting it to relevant systems through standard protocols or plug-ins. Tracing is to obtain the tracking information of SQL parsing and SQL execution. DBPlusEngine provides support for SkyWalking, Zipkin, Jaeger and OpenTelemetry by default. It also supports users to develop customized components through plug-in.
- Use Zipkin or Jaeger
Just provides correct Zipkin or Jaeger server information in the agent configuration file.
- Use OpenTelemetry
OpenTelemetry was merged by OpenTracing and OpenCencus in 2019. In this way, you only need to fill in the appropriate configuration in the agent configuration file according to OpenTelemetry SDK Autoconfigure Guide.
- Use SkyWalking
Enable the SkyWalking plugin configuration file and configure the SkyWalking apm-toolkit.
- Use SkyWalking’s automatic monitor probe
In cooperation with the Apache SkyWalking team, the DBPlusEngine team has created ShardingSphere
automatic monitor probe to automatically send performance data to SkyWalking
. Note that automatic probe cannot be used together with DBPlusEngine plugin probe.
Metrics used to collect and display statistical indicator of cluster. DBPlusEngine supports Prometheus by default.
Challenges #
Tracing and metrics need to collect system information through event tracking. Lots of events tracking make kernel code messy, difficult to maintain, and difficult to customize extend.
Goal #
The goal of the DBPlusEngine observability module is providing as many performance and statistical indicators as possible and isolating kernel code and embedded code.
Core Concept #
Agent #
Based on bytecode enhance and plugin design to provide tracing, metrics and logging features. Enable the plugin in agent to collect data and send data to the integrated 3rd APM system.
APM #
APM is the abbreviation for application performance monitoring. It works for performance diagnosis of distributed systems, including chain demonstration, service topology analysis and so on.
Tracing #
Tracing data between distributed services or internal processes will be collected by agent. It then will be sent to APM system.
Metrics #
System statistical indicator which collected from agent. Write to time series databases periodically. 3rd party UI can display the metrics data simply.
Usage Norms #
Compile source code #
Download DBPlusEngine from GitHub,Then compile.
git clone --depth 1 https://github.com/apache/shardingsphere.git
cd shardingsphere
mvn clean install -Dmaven.javadoc.skip=true -Dcheckstyle.skip=true -Drat.skip=true -Djacoco.skip=true -DskipITs -DskipTests -Prelease
Output directory: shardingsphere-agent/shardingsphere-agent-distribution/target/apache-shardingsphere-${latest.release.version}-shardingsphere-agent-bin.tar.gz
Agent configuration #
Directory structure
Create agent directory, and unzip agent distribution package to the directory.
mkdir agent
tar -zxvf apache-shardingsphere-${latest.release.version}-shardingsphere-agent-bin.tar.gz -C agent
cd agent
tree
.
├── conf
│ ├── agent.yaml
│ └── logback.xml
├── plugins
│ ├── shardingsphere-agent-logging-base-${latest.release.version}.jar
│ ├── shardingsphere-agent-metrics-prometheus-${latest.release.version}.jar
│ ├── shardingsphere-agent-tracing-jaeger-${latest.release.version}.jar
│ ├── shardingsphere-agent-tracing-opentelemetry-${latest.release.version}.jar
│ ├── shardingsphere-agent-tracing-opentracing-${latest.release.version}.jar
│ └── shardingsphere-agent-tracing-zipkin-${latest.release.version}.jar
└── shardingsphere-agent.jar
- Configuration file
agent.yaml is a configuration file. The plug-ins include Jaeger, opentracing, Zipkin, opentelemetry, logging and Prometheus. Remove the corresponding plug-in in ignoredpluginnames to start the plug-in.
applicationName: shardingsphere-agent
ignoredPluginNames:
- Jaeger
- OpenTracing
- Zipkin
- OpenTelemetry
- Logging
- Prometheus
plugins:
Prometheus:
host: "localhost"
port: 9090
props:
JVM_INFORMATION_COLLECTOR_ENABLED : "true"
Jaeger:
host: "localhost"
port: 5775
props:
SERVICE_NAME: "shardingsphere-agent"
JAEGER_SAMPLER_TYPE: "const"
JAEGER_SAMPLER_PARAM: "1"
Zipkin:
host: "localhost"
port: 9411
props:
SERVICE_NAME: "shardingsphere-agent"
URL_VERSION: "/api/v2/spans"
SAMPLER_TYPE: "const"
SAMPLER_PARAM: "1"
OpenTracing:
props:
OPENTRACING_TRACER_CLASS_NAME: "org.apache.skywalking.apm.toolkit.opentracing.SkywalkingTracer"
OpenTelemetry:
props:
otel.resource.attributes: "service.name=shardingsphere-agent"
otel.traces.exporter: "zipkin"
Logging:
props:
LEVEL: "INFO"
- Parameter description:
Name | Description | Value range | Default value |
---|---|---|---|
JVM_INFORMATION_COLLECTOR_ENABLED | Start JVM collector | true、false | true |
SERVICE_NAME | Tracking service name | Custom | shardingsphere-agent |
JAEGER_SAMPLER_TYPE | Jaeger sample rate type | const、probabilistic、ratelimiting、remote | const |
JAEGER_SAMPLER_PARAM | Jaeger sample rate parameter | const:0、1, probabilistic:0.0 - 1.0, ratelimiting: > 0, Customize the number of acquisitions per second,remote:need to customize the remote service addres,JAEGER_SAMPLER_MANAGER_HOST_PORT | 1(const type) |
SAMPLER_TYPE | Zipkin sample rate type | const、counting、ratelimiting、boundary | const |
SAMPLER_PARAM | Zipkin sampling rate parameter | const:0、1, counting:0.01 - 1.0, ratelimiting: > 0, boundary:0.0001 - 1.0 | 1(const type) |
otel.resource.attributes | opentelemetry properties | String key value pair (, split) | service.name=shardingsphere-agent |
otel.traces.exporter | Tracing expoter | zipkin、jaeger | zipkin |
otel.traces.sampler | Opentelemetry sample rate type | always_on、always_off、traceidratio | always_on |
otel.traces.sampler.arg | Opentelemetry sample rate parameter | traceidratio:0.0 - 1.0 | 1.0 |
Used in DBPlusEngine-Proxy #
- Startup script
Configure the absolute path of shardingsphere-agent.jar to the start.sh startup script of shardingsphere proxy.
nohup java ${JAVA_OPTS} ${JAVA_MEM_OPTS} \
-javaagent:/xxxxx/agent/shardingsphere-agent.jar \
-classpath ${CLASS_PATH} ${MAIN_CLASS} >> ${STDOUT_FILE} 2>&1 &
- Launch plugin
bin/start.sh
After normal startup, you can view the startup log of the plugin in the DBPlusEngine proxy log, and you can view the data at the configured address.
Using the Agent Log Collection Function in DBPlusEngine-Driver #
Slow-Query-Log #
Background
The Slow-Query-Log feature is used to log SQL statements that take longer than a certain amount of time to execute, allowing DBAs and developers to identify potentially problematic SQL statements, and is a great source of reference for database and SQL management.
Parameter explanation
- Slow-query-log: Whether to enable slow query logging, the default value is true.
- Long-query-time: Slow Query Time Threshold, SQL statements that take longer than this threshold to execute will only be logged in the Slow Query Log. This is configured in milliseconds (ms) and the default value is 5000.
Prerequisite
- The Agent uses SLF4J for log bridging output, so it requires the application to have an SLF4J dependency and the relevant log output configuration.
- The slow-query-logging feature is based on Agent technology and requires the application to be configured with javaagent to enable the Agent at startup.
Example configuration
- Agent Configuration
conf/agent.yaml
is the Agent configuration file and the default configuration is as follows.
plugins:
logging:
BaseLogging:
props:
slow-query-log: true
long-query-time: 5000
Where slow-query-log and long-query-time are configured with default values, indicating:
- turn on slow-query-logging;
- record slow-query-log when SQL execution takes more than 5000 milliseconds.
- Application logging configuration
Take the commonly used logback as an example of logging output and configure it as follows.
<configuration>
<property name="log.context.name" value="project-using-DBPlusEngine-Driver" />
<property name="log.charset" value="UTF-8" />
<property name="log.pattern" value="[%-5level] %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %logger{36} - %msg%n" />
<contextName>${log.context.name}</contextName>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder charset="${log.charset}">
<pattern>${log.pattern}</pattern>
</encoder>
</appender>
<logger name="SLOW-QUERY" level="info" additivity="false">
<appender-ref ref="STDOUT" />
</logger>
<root>
<level value="INFO" />
<appender-ref ref="STDOUT" />
</root>
</configuration>
Description: The logger name of the slow log output is SLOW-QUERY
Slow-query-log format
db: {database name} query_time: {query time} sql_type: {sql type}
{sql}
- db: the name of the database;
- query_time: the time taken for SQL execution, unit in ms;
- sql_type: type of SQL, (SELECT, INSERT, UPDATE, DELETE, OTHER other type);
- sql: the specific SQL statement to be executed;
Example:
[WARN ] 2023-01-04 14:55:04.035 [http-nio-8888-exec-7] SLOW-QUERY - db: sharding_db query_time: 21 sql_type: SELECT
SELECT id,user_id,uuid,status,create_time,update_time,is_deleted AS deleted FROM t_order
General-Query-Log #
Background
General-query-log, means that when this function is enabled, the system will record all the executed SQL statements and contain information such as the database corresponding to the statement, execution time consumption, SQL type, etc., so that enterprises can easily carry out audit operations.
Parameter explanation
- General-query-log: General-query-log has only one parameter, with a value of true enabling full logging and a value of false disabling the feature.
Prerequisite
- The Agent uses SLF4J for log bridging output, so it requires the application to have a SLF4J dependency and the relevant log output configuration.
- General-query-log functionality is based on Agent technology and requires the application to be configured with javaagent to enable the Agent at startup.
Example configuration
- Agent Configuration
conf/agent.yaml is the Agent configuration file and the default configuration is as follows.
plugins:
logging:
BaseLogging:
props:
general-query-log: true
Description: general-query-log is true to enable general query logging, false to disable general query logging. The value is configured at startup.
- Application logging configuration
Take the commonly used logback as an example of logging output and configure it as follows.
<configuration>
<property name="log.context.name" value="project-using-DBPlusEngine-Driver" />
<property name="log.charset" value="UTF-8" />
<property name="log.pattern" value="[%-5level] %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %logger{36} - %msg%n" />
<contextName>${log.context.name}</contextName>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder charset="${log.charset}">
<pattern>${log.pattern}</pattern>
</encoder>
</appender>
<logger name="GENERAL-QUERY" level="info" additivity="false">
<appender-ref ref="STDOUT" />
</logger>
<root>
<level value="INFO" />
<appender-ref ref="STDOUT" />
</root>
</configuration>
Description: The logger name of the general-query-log output is GENERAL-QUERY
General-query-log format
db: {database name} query_time: {query time} sql_type: {sql type}
{sql}
- db: database name;
- query_time: the time taken for SQL execution, unit in ms;
- sql_type: type of SQL, (SELECT, INSERT, UPDATE, DELETE, OTHER or other type);
- sql: the specific SQL statement to be executed;
Example:
[INFO ] 2023-01-04 14:55:04.035 [http-nio-8888-exec-7] GENERAL-QUERY - db: sharding_db query_time: 21 sql_type: SELECT
SELECT id,user_id,uuid,status,create_time,update_time,is_deleted AS deleted FROM t_order