vtune

2024年4月19日2024年3月6日 by icyhearts

Table of Contents

collect

# /opt/intel/oneapi/vtune/latest/bin64/vtune -help collect
Intel(R) VTune(TM) Profiler Command Line Tool
Copyright (C) 2009 Intel Corporation. All rights reserved.

-c, -collect=<string>         Choose an analysis type.

 Perform a data collection of the specified analysis type.
 See the list of available analysis types below.

Action Options:

-allow-multiple-runs | -no-allow-multiple-runs (default)
                              Enable multiple runs to achieve more precise
                              results for hardware event-based collections.
                              When disabled, the collector multiplexes events
                              running a single collection, which lowers result
                              precision.
-analyze-kvm-guest | -no-analyze-kvm-guest (default)
                              Enable to analyze KVM guest OS running on the
                              system. This option is applicable to hardware
                              event-based analysis types only.
-analyze-system | -no-analyze-system (default)
                              Enable to analyze all processes running on the
                              system. When disabled, only the attached process
                              and its children are analyzed. This option is
                              applicable to hardware event-based analysis types
                              only.
-app-working-dir=<string>     Specify a directory where the application will be
                              run.
-auto-finalize | -no-auto-finalize
                              The option is deprecated. Please use
                              -finalization-mode=none instead. Turn on/off
                              automatic result finalization after data
                              collection/import. --no-auto-finalize option also
                              turns off the summary report (--no-summary).
-call-stack-mode=user-only | user-plus-one | all
                              Choose how to show system functions in the stack.
-cpu-mask=<string>            Specify CPU(s) to collect data on (for example:
                              2-8,10,12-14). This option is applicable to
                              hardware event-based analysis types only.
-custom-collector=<string>    Provide a command line for launching an external
                              collection tool. You can later import custom
                              collection data (time intervals and counters) in
                              a CSV format to the VTune Profiler result.
-data-limit=<integer> (1000)  Limit the amount of raw data to be collected by
                              setting the maximum possible result size (in MB).
                              VTune Profiler starts collecting data from the
                              beginning of the target execution and ends when
                              the limit for the result size is reached. For
                              unlimited data size, specify 0.
-discard-raw-data | -no-discard-raw-data (default)
                              Discard raw collector data from the result upon
                              finalization.
-d, -duration=<string>        Specify a duration for collection (in seconds).
                              Required for system-wide collection. Can also be
                              'unlimited'.
-finalization-mode=full | fast | deferred | none
                              Define finalization mode: full - perform full
                              finalization; fast (default) - reduce loaded
                              sample count to speed up post-processing;
                              deferred - calculate only binary checksum for
                              finalization on another machine; none - skip
                              finalization.
-finalization-mode=full | fast | deferred | none (fast)
                              Finalization may take significant system
                              resources. For a powerful target system, select
                              full mode to apply immediately after collection.
                              Otherwise, shorten finalization or defer it to
                              run on another system (compute checksums only).
-follow-child (default) | -no-follow-child
                              Collect data on processes launched by the target
                              process (recommended for applications launched by
                              a script).
-inline-mode=on | off         Choose to show or hide inline functions in the
                              stack.
-k, -knob=<string>            Set knob value for selected analysis type as
                              -knob knobName=knobValue. For a list of knobs
                              available for an analysis, enter: -help collect
                              <analysis_type>.
-kvm-guest-kallsyms=<string>  Specify a local path to the /proc/kallsyms file
                              copied from the guest OS for proper symbol
                              resolution.
-kvm-guest-modules=<string>   Specify a local path to the /proc/modules file
                              copied from the guest OS for proper symbol
                              resolution.
-loop-mode=loop-only | loop-and-function | function-only
                              Choose to show or hide loops in the stack.
-mrte-mode=auto | native | mixed | managed (auto)
                              Select a profiling mode. The Native mode does not
                              attribute data to managed source. The Mixed mode
                              attributes data to managed source where
                              appropriate. The Managed mode tries to limit
                              attribution to managed source when available.
-r, -result-dir=<string> (r@@@{at})
                              Specify result directory path. The default name
                              for a result directory is r@@@{at}, where @@@ is
                              the incremented number of the result, and {at} is
                              a two- or three-letter abbreviation for the
                              analysis type.
-resume-after=<double>        Specify time (in seconds, with fractions allowed)
                              to delay data collection after the application
                              starts. For example, 1.56 is 1 sec 560 msec.
-return-app-exitcode | -no-return-app-exitcode (default)
                              Return the target exit code instead of the
                              command line interface exit code.
-ring-buffer=<double> (0)     Limit the amount of raw data to be collected by
                              setting the timer enabling the analysis only for
                              the last seconds before the target or collection
                              is terminated. For unlimited data size, specify
                              0.
-search-dir=<string>          Specify search directories for binary and symbol
                              files. When the files are in multiple
                              directories, use the search-dir option multiple
                              times so that all the necessary directories are
                              searched.
-source-search-dir=<string>   Specify search directories for source files. When
                              your source files are in multiple directories,
                              use the source-search-dir option multiple times
                              so that all the necessary directories are
                              searched.
-start-paused                 Start data collection paused.
-strategy=<string>            Specify details for parent and child processes
                              analysis.
                              Format:<process_name1>:<profiling_mode>,<process_
                              name2>:<profiling_mode>,... Available profiling
                              mode values are: trace:trace, trace:notrace,
                              notrace:notrace, notrace:trace. This option is
                              not applicable to hardware event-based analysis
                              types.
-summary (default) | -no-summary
                              Turn on/off showing the summary report after data
                              collection/import.
-target-duration-type=veryshort | short | medium | long (short)
                              Estimate the application duration time. This
                              value affects the size of collected data. For
                              long running targets, sampling interval is
                              increased to reduce the result size. For hardware
                              event-based analysis types, the duration estimate
                              affects a multiplier applied to the configured
                              Sample after value.
-target-install-dir=<string>  Specify a path to VTune Profiler on the remote
                              system. If the default location is used, this
                              path is automatically supplied.
-target-pid=<unsigned integer>
                              Attach collection to a running process specified
                              by process ID.
-target-ports=<string>        Specify a network port used by the target
                              collector on the remote system.
-target-process=<string>      Attach collection to a running process specified
                              by process name.
-target-system=<string>       Define target system for remote collection.
                              Supported <string> values:
                              android - for Android systems.
                              ssh:user@target - for Linux systems, where <user>
                              is a user name and <target> is a network name of
                              the remote system accessed via SSH (usually IP
                              address).
-target-tmp-dir=<string>      Specify a directory on the remote system where
                              performance results are temporarily stored. By
                              default, /tmp directory is used.
-trace-mpi | -no-trace-mpi (default)
                              Configure collectors to trace MPI code, and
                              determine MPI rank IDs in case of a non-Intel MPI
                              library implementation.

Global Options:

-q, -quiet                    Suppress non-essential messages
-user-data-dir=<string>       Specify the base directory for result paths
                              provided by --result-dir option. By default, the
                              current working directory is used.
-v, -verbose                  Print additional information

Examples:

 1) Perform the hotspots collection on the given target.

    vtune -collect hotspots a.out

 The default naming template for result directories is r@@@{at}, where:
 @@@ is an increasing numeric sequence automatically assigned by vtune;
 {at} is an abbreviation of the analysis type.

 2) Collect the results into the 'r001tr' result directory.

    vtune -collect threading -r r001tr a.out

 Use '-help collect <analysis type>' for more information about each analysis type.

Available Analysis Types:

 Want to characterize and identify relevant analysis types for your workload?

   performance-snapshot
      Get a quick snapshot of your application performance and identify next
      steps for deeper analysis.

 Want to find out where your application spends time and optimize your algorithms?

   hotspots
      Identify the most time consuming functions and lines of source code.

   anomaly-detection
      Preview feature - should we keep it, change it, or drop it? Send us your
      comments: mailto:parallel.studio.support@intel.com?subject=VTune
      Profiler: Anomaly Detection - preview feedback. Identify performance
      anomalies by profiling critical code at the microsecond level. Anomaly
      Detection uses Intel Processor Trace technology for fine-grained
      analysis.

   memory-consumption
      Analyze memory consumption by your application, its distinct memory
      objects and their allocation stacks.

 Want to see how efficiently your code is using the underlying hardware?

   uarch-exploration
      Analyze CPU microarchitecture bottlenecks affecting the performance of
      your application.

   memory-access
      Measure a set of metrics to identify memory access related issues.

 Want to assess the compute efficiency of your multi-threaded application?

   threading
      Discover how well your application is using parallelism to take advantage
      of all available CPU cores.

   hpc-performance
      Analyze performance aspects of compute-intensive applications, including
      CPU and GPU utilization. Get information on OpenMP efficiency, memory
      access, and vectorization.

 Want to see how efficiently your code is using I/O?

   io
      Analyze utilization of IO subsystems, CPU, and processor buses.

 Want to explore GPU/FPGA usage for your application?

   gpu-offload
      Explore code execution on various CPU and GPU cores on your platform,
      estimate how your code benefits from offloading to the GPU, and identify
      whether your application is CPU or GPU bound.

   gpu-hotspots
      Analyze the most time-consuming GPU kernels, characterize GPU utilization
      based on GPU hardware metrics, identify performance issues caused by
      memory latency or inefficient kernel algorithms, and analyze GPU
      instruction frequency per certain instruction types.

   fpga-interaction
      Analyze CPU/FPGA interaction issues through these ways: 1. Focus on the
      kernels running on the FPGA. 2. Identify the most time-consuming kernels.
      3. Look at the corresponding metrics on the device side (like Occupancy
      or Stalls). 4. Correlate with CPU and platform profiling data.

 Want to explore CPU, GPU and power usage for your application/system?

   system-overview
      Analyze general behavior of Linux or Android target system and correlate
      power and performance metrics with IRQ handling.

   graphics-rendering
      Preview feature. Analyze the CPU/GPU utilization of your code running on
      the Xen virtualization platform. Explore GPU utilization per GPU engine
      and GPU hardware metrics that help understand where performance
      improvements are possible.

   platform-profiler
      Platform Profiler collects coarse-grained, system-level metrics for
      extended profiling of minutes to hours. Software architects can identify
      workloads or phases of workloads that use hardware inefficiently and need
      tuning. Infrastructure architects can see if the current hardware
      configuration is a good match for most workloads.

 Want to profile applications using Intel Transactional Synchronization Extensions or run on systems with Intel Software Guard Extensions?

   tsx-exploration
      Analyze Intel Transactional Synchronization Extensions (Intel TSX) usage.

   tsx-hotspots
      Analyze hotspots inside transactions for systems with the Intel
      Transactional Synchronization Extensions (Intel TSX) feature enabled.

   sgx-hotspots
      Analyze hotspots inside security enclaves for systems with the Intel
      Software Guard Extensions (Intel SGX) feature enabled.

memory-access

# /opt/intel/oneapi/vtune/latest/bin64/vtune -help collect memory-access
Intel(R) VTune(TM) Profiler Command Line Tool
Copyright (C) 2009 Intel Corporation. All rights reserved.

 Measure a set of metrics to identify memory access related issues (for
 example, specific for NUMA architectures). This analysis type is based on
 the hardware event-based sampling collection.

 To modify the analysis type, use the configuration options (knobs) as
 follows:
 -collect memory-access -knob =
 Multiple -knob options are allowed and can be followed by additional collect
 action options, as well as global options, if needed.

sampling-interval

  Specify an interval (in milliseconds) between CPU samples.

  Default value: 5
  Possible values: numbers between 0.01 and 1000

analyze-mem-objects

  Enable the instrumentation of dynamic memory allocation/de-allocation and
  map hardware events to such memory objects. This option may cause
  additional runtime overhead due to the instrumentation of all system memory
  allocation/de-allocation API.

  Default value: false
  Possible values: true false

mem-object-size-min-thres

  Specify a minimal size of dynamic memory allocations to analyze. This
  option helps reduce runtime overhead of the instrumentation.

  Default value: 1024
  Possible values: numbers between -2147483648 and 2147483647

dram-bandwidth-limits

  Evaluate maximum achievable local DRAM bandwidth before the collection
  starts. This data is used to scale bandwidth metrics on the timeline and
  calculate thresholds.

  Default value: true
  Possible values: true false

analyze-openmp

  Instrument and analyze OpenMP regions to detect inefficiencies such as
  imbalance, lock contention, or overhead on performing scheduling, reduction
  and atomic operations.

  Default value: false
  Possible values: true false

hotspots

  # /opt/intel/oneapi/vtune/latest/bin64/vtune -help collect hotspots
Intel(R) VTune(TM) Profiler Command Line Tool
Copyright (C) 2009 Intel Corporation. All rights reserved.

 Identify the most time consuming functions and drill down to see time spent
 on each line of source code. Focus optimization efforts on hot code for the
 greatest performance impact.

 To modify the analysis type, use the configuration options (knobs) as
 follows:
 -collect hotspots -knob =
 Multiple -knob options are allowed and can be followed by additional collect
 action options, as well as global options, if needed.

sampling-mode

  User-Mode Sampling(sw) mode use for: profiles longer than a few seconds,
  profiling a single process or a process-tree, profiling Python and Intel
  runtimes. Hardware Event-Based Sampling(hw) mode use for: profiles shorter
  than a few seconds, profiling all processes on a system, including kernel.

  Default value: sw
  Possible values: sw hw

sampling-interval

  Specify an interval (in milliseconds) between CPU samples for the Hardware
  sampling mode. Sampling interval for the Software sampling mode is fixed
  (10ms).

  Default value: 5
  Possible values: numbers between 0.01 and 1000

enable-stack-collection

  Enable collection of call stacks.

  Default value: false
  Possible values: true false

stack-size

  Specify the size of a raw stack (in bytes) to process. Zero value in
  command line means unlimited size. You may set arbitrary stack size value
  in the custom analysis configuration.

  Default value: 1024
  Possible values: 0 1024 2048 4096

enable-characterization-insights

  Get additional performance insights such as the efficency of hardware usage
  and vectorization, and learn next steps. Note: this option collects CPU
  events in the counting mode.

  Default value: true
  Possible values: true false

快速命令

memory-access

  /opt/intel/oneapi/vtune/latest/bin64/vtune -collect memory-access -v -d 20 -r /media/disk1/fordata/web_server/like12/package/vtune/results/gt441-memory-access-gameid-v2-numa-round5-debug -knob sampling-interval=0.03   -data-limit=3072 -call-stack-mode=all -inline-mode=on  -finalization-mode=full

通用

Leave a Comment 取消回复