Skip to content

Commit

Permalink
Update dataset README
Browse files Browse the repository at this point in the history
  • Loading branch information
zhujiem committed Aug 21, 2023
1 parent a469b6e commit fd2654e
Show file tree
Hide file tree
Showing 18 changed files with 71 additions and 21 deletions.
5 changes: 5 additions & 0 deletions Android/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
## Android_v1
Android (https://www.android.com) is a popular open-source mobile operating system and has been used by many smart devices. However, Android logs are rarely available in public for research purposes. We provide some Android log files, which were collected by Android smartphones with heavily instrumented modules installed. The Android architecture comprises of five levels, including the Linux Kernel, Libraries, Application Framework, Android Runtime, and System Applications. We provide a sample log file printed by the Application Framework.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
+ Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics](https://arxiv.org/abs/2008.06448). IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.
Expand All @@ -12,6 +15,8 @@ If you use this dataset from loghub in your research, please cite the following
## Android_v2
Android (https://www.android.com) is a popular open-source mobile operating system and has been used by many smart devices. However, Android logs are rarely available in public for research purposes. We provide some Android log files, which were collected by Android smartphones with heavily instrumented modules installed. The logs cover two types of issues, and each type has over 10 duplicate issue logs. However, due to the high complexity of Android's multi-threading system, it is difficult to pinpoint the abnormal log points.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
Expand Down
3 changes: 3 additions & 0 deletions Apache/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ Apache HTTP Server (https://httpd.apache.org) is one of the most popular web ser

For more detailed information, please visit the Public Security Log Sharing Site: http://log-sharing.dreamhosters.com.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
+ Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics](https://arxiv.org/abs/2008.06448). IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.
3 changes: 3 additions & 0 deletions BGL/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ BGL is an open dataset of logs collected from a BlueGene/L supercomputer system

For more detailed information, please visit the project page: https://www.usenix.org/cfdr-data#hpc4.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following paper.
+ Adam J. Oliner, Jon Stearley. [What Supercomputers Say: A Study of Five System Logs](http://ieeexplore.ieee.org/document/4273008/), in Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2007.
Expand Down
9 changes: 9 additions & 0 deletions HDFS/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ We have preprocessed the dataset for easy use in research, including:
+ Event_occurrence_matrix.csv
+ HDFS.npz

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use the HDFS_v1 dataset from loghub in your research, please cite the following papers.
+ Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan. [Detecting Large-Scale System Problems by Mining Console Logs](https://people.eecs.berkeley.edu/~jordan/papers/xu-etal-sosp09.pdf), in Proc. of the 22nd ACM Symposium on Operating Systems Principles (SOSP), 2009.
Expand All @@ -24,6 +27,9 @@ HDFS (http://hadoop.apache.org/hdfs) is the Hadoop Distributed File System desig

The log set was collected by aggregating logs from the HDFS system in our lab at CUHK for research purpose, which comprises one name node and 32 data nodes. The logs are aggregated at the node level. However, three nodes have been repaired and unfortunately some logs are lost. The logs have a huge size (over 16GB) and are provided as-is without further modification or labelling, which may involve both normal and abnormal cases.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
+ Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics](https://arxiv.org/abs/2008.06448). IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.
Expand All @@ -44,6 +50,9 @@ The data are collected through instrumenting the HDFS system. We have converted

For more detailed information, please visit the dataset project: https://mtracer.github.io/TraceBench.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following paper.
+ Jingwen Zhou, Zhenbang Chen, Ji Wang, Zibin Zheng, and Michael R. Lyu. [TraceBench: An Open Data Set for Trace-oriented Monitoring](http://zbchen.github.io/Papers_files/cloudcom2014.pdf), in Proceedings of the 6th IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2014.
Expand Down
4 changes: 3 additions & 1 deletion HPC/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@

HPC is an open dataset of logs collected from System 20 of the high performance computing cluster at the [Los Alamos National Laboratories](http://www.lanl.gov/). But the link (http://institutes.lanl.gov/data/fdata/) to the original data has been out of service. The log has been used for benchmarking log parsing methods in the following papers, where you may find more details about the usage of this dataset.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.

+ Adetokunbo Makanju, A. Nur Zincir-Heywood, Evangelos E. Milios. [Clustering Event Logs Using Iterative Partitioning](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.503.7668&rep=rep1&type=pdf), in Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2009.
+ Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. [An Evaluation Study on Log Parsing and Its Use in Log Mining](http://jiemingzhu.github.io/pub/pjhe_dsn2016.pdf), in Proc. of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2016.
+ Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics](https://arxiv.org/abs/2008.06448). IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.
3 changes: 3 additions & 0 deletions Hadoop/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ Each application has been run for several times, simulating both normal and abno

We provide the labeled abnormal/normal job IDs in `abnormal_label.txt`.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following paper.
+ Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, Xuewei Chen. [Log Clustering Based Problem Identification for Online Service Systems](http://ieeexplore.ieee.org/document/7883294/), International Conference on Software Engineering (ICSE), 2016.
Expand Down
2 changes: 2 additions & 0 deletions HealthApp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

HealthApp is a mobile application for Andriod devices. We collected the application logs from an Android smartphone after 10+ days of use.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
Expand Down
5 changes: 2 additions & 3 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
The datasets are freely available for research or academic work, subject to
the following conditions: Any usage or distribution of the loghub datasets
should [cite the loghub paper](https://github.com/logpai/loghub/blob/master/CITATION)
or refer to the repository https://github.com/logpai/loghub.
the following condition: For any usage or distribution of the loghub datasets,
please refer to the loghub repository https://github.com/logpai/loghub.
3 changes: 3 additions & 0 deletions Linux/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

Linux logs are usually located at `/var/log/`. The dataset was collected from `/var/log/messages` on a Linux server over a period of 260+ days, as part of the [Public Security Log Sharing Site](http://log-sharing.dreamhosters.com/) project.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
+ Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics](https://arxiv.org/abs/2008.06448). IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.
Expand Down
4 changes: 3 additions & 1 deletion Mac/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
## Mac

We collected the MacOS logs from `/var/log/system.log` on a Macbook after 7 days of use.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
+ Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics](https://arxiv.org/abs/2008.06448). IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.
Expand Down
3 changes: 3 additions & 0 deletions OpenSSH/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

OpenSSH is the premier connectivity tool for remote login with the SSH protocol. We collected the log from an OpenSSH server in our lab over a period of 28+ days.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
+ Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics](https://arxiv.org/abs/2008.06448). IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.
Expand Down
3 changes: 3 additions & 0 deletions OpenStack/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ OpenStack (https://www.openstack.org) is a cloud operating system that controls

For the usage of this dataset, please refer to an example: https://github.com/nailo2c/deeplog/blob/master/example/preprocess.py

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
+ Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar. [DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning](https://acmccs.github.io/papers/p1285-duA.pdf), in Proc. of ACM Conference on Computer and Communications Security (CCS), 2017.
Expand Down
3 changes: 3 additions & 0 deletions Proxifier/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

Proxifier (https://www.proxifier.com) is a software program, allowing network applications that do not support working through proxy servers to operate through a SOCKS or HTTPS proxy and chains. We collected the Proxifier logs from a desktop computer in our lab.

### Download
The raw logs are available for downloading at https://github.com/logpai/loghub.

### Citation
If you use this dataset from loghub in your research, please cite the following papers.
+ Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics](https://arxiv.org/abs/2008.06448). IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,30 +5,30 @@ Loghub maintains a collection of system logs, which are freely accessible for AI

**Logs currently available**:

| Dataset | Description | Labeled | Time Span | #Lines | Unzipped Size | Contributed By |
| Dataset | Description | Labeled | Time Span | #Lines | Raw Size | Contributed By |
| :---------------------------- | :--------| :--------: | --------: | ---------: | ------: | :------: |
|<tr><th colspan=7 align="center">:open_file_folder: **Distributed systems**</th></tr>|
| [HDFS_v1](./HDFS#hdfs_v1) | Hadoop distributed file system log | :heavy_check_mark: | 38.7 hours | 11,175,629 | 1.47GB | [link](https://www.sigops.org/sosp/sosp09/papers/xu-sosp09.pdf) |
| [HDFS_v1](./HDFS#hdfs_v1) | Hadoop distributed file system log | :heavy_check_mark: | 38.7 hours | 11,175,629 | 1.47GB | [Link](https://www.sigops.org/sosp/sosp09/papers/xu-sosp09.pdf) |
| [HDFS_v2](./HDFS#hdfs_v2) | Hadoop distributed file system log| | N.A. | 71,118,073 | 16.06GB | |
| [HDFS_v3](./HDFS#hdfs_v3_tracebench) | Instrumented HDFS trace log (TraceBench) | :heavy_check_mark: | N.A. | 14,778,079 | 2.96GB | [link](http://zbchen.github.io/Papers_files/cloudcom2014.pdf) |
| [Hadoop](./Hadoop) | Hadoop mapreduce job log | :heavy_check_mark: | N.A. | 394,308 | 48.61MB | [link](http://ieeexplore.ieee.org/document/7883294/) |
| [HDFS_v3](./HDFS#hdfs_v3_tracebench) | Instrumented HDFS trace log (TraceBench) | :heavy_check_mark: | N.A. | 14,778,079 | 2.96GB | [Link](http://zbchen.github.io/Papers_files/cloudcom2014.pdf) |
| [Hadoop](./Hadoop) | Hadoop mapreduce job log | :heavy_check_mark: | N.A. | 394,308 | 48.61MB | [Link](http://ieeexplore.ieee.org/document/7883294/) |
| [Spark](./Spark) | Spark job log || N.A. | 33,236,604 | 2.75GB | |
| [Zookeeper](./Zookeeper) | ZooKeeper service log | | 26.7 days | 74,380 | 9.95MB | |
| [OpenStack](./OpenStack) | OpenStack infrastructure log | :heavy_check_mark: | N.A. | 207,820 | 58.61MB | [link](https://acmccs.github.io/papers/p1285-duA.pdf) |
| [OpenStack](./OpenStack) | OpenStack infrastructure log | :heavy_check_mark: | N.A. | 207,820 | 58.61MB | [Link](https://acmccs.github.io/papers/p1285-duA.pdf) |
|<tr><th colspan=7 align="center">:open_file_folder: **Super computers**</th></tr>|
| [BGL](./BGL) | Blue Gene/L supercomputer log | :heavy_check_mark: | 214.7 days | 4,747,963 | 708.76MB | [link](http://ieeexplore.ieee.org/document/4273008/) |
| [HPC](./HPC) | High performance cluster log | | N.A. | 433,489 | 32.00MB | [link](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.503.7668&rep=rep1&type=pdf) |
| [Thunderbird](./Thunderbird) | Thunderbird supercomputer log | :heavy_check_mark: | 244 days | 211,212,192 | 29.60GB | [link](http://ieeexplore.ieee.org/document/4273008/) |
| [BGL](./BGL) | Blue Gene/L supercomputer log | :heavy_check_mark: | 214.7 days | 4,747,963 | 708.76MB | [Link](http://ieeexplore.ieee.org/document/4273008/) |
| [HPC](./HPC) | High performance cluster log | | N.A. | 433,489 | 32.00MB | [Link](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.503.7668&rep=rep1&type=pdf) |
| [Thunderbird](./Thunderbird) | Thunderbird supercomputer log | :heavy_check_mark: | 244 days | 211,212,192 | 29.60GB | [Link](http://ieeexplore.ieee.org/document/4273008/) |
|<tr><th colspan=7 align="center">:open_file_folder: **Operating systems**</th></tr>|
| [Windows](./Windows) | Windows event log | | 226.7 days | 114,608,388 | 26.09GB | |
| [Linux](./Linux) | Linux system log | | 263.9 days | 25,567 | 2.25MB | [link](http://log-sharing.dreamhosters.com) |
| [Linux](./Linux) | Linux system log | | 263.9 days | 25,567 | 2.25MB | [Link](http://log-sharing.dreamhosters.com) |
| [Mac](./Mac) | Mac OS log | | 7.0 days | 117,283 | 16.09MB | |
|<tr><th colspan=7 align="center">:open_file_folder: **Mobile systems**</th></tr>|
| [Android_v1](./Android#android_v1) | Android framework log | | N.A. | 1,555,005 | 183.37MB | |
| [Android_v2](./Android#android_v2) | Android framework log | | N.A. | 30,348,042 | 3.37GB | |
| [Android_v2](./Android#android_v2) | Android framework log | | N.A. | 30,348,042 | 3.38GB | |
| [HealthApp](./HealthApp) | Health app log | | 10.5 days | 253,395 | 22.44MB | |
|<tr><th colspan=7 align="center">:open_file_folder: **Server applications**</th></tr>|
| [Apache](./Apache) | Apache web server error log | | 263.9 days | 56,481 | 4.90MB | [link](http://log-sharing.dreamhosters.com) |
| [Apache](./Apache) | Apache web server error log | | 263.9 days | 56,481 | 4.90MB | [Link](http://log-sharing.dreamhosters.com) |
| [OpenSSH](./OpenSSH) | OpenSSH server log | | 28.4 days | 655,146 | 70.02MB | |
|<tr><th colspan=7 align="center">:open_file_folder: **Standalone software**</th></tr>|
| [Proxifier](./Proxifier) | Proxifier software log | | N.A. | 21,329 | 2.42MB | |
Expand All @@ -37,11 +37,11 @@ Loghub maintains a collection of system logs, which are freely accessible for AI
### Datasets download
We host only a small sample (2k lines) of each log dataset on Github. If you are interested in these raw datasets, please download them [via Zenodo](https://doi.org/10.5281/zenodo.1144100).

:telescope: We proudly announce that the loghub datasets have been downloaded [**90000+**](https://zenodo.org/record/3227177) times by more than [**380+ organizations**](https://github.com/logpai/loghub/wiki/Loghub) from both industry and academia.
:bell: We proudly announce that the loghub datasets have attained [**90000+ total downloads**](https://doi.org/10.5281/zenodo.1144100) by more than [**450 organizations**](https://github.com/logpai/loghub/wiki/Loghub-download-list) from both industry and academia.


### Citation
:bell: Please cite the following paper if you use the loghub datasets for research.
Please cite the following paper if you use the loghub datasets for research.
+ Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics](https://arxiv.org/abs/2008.06448). IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.


Expand All @@ -68,12 +68,12 @@ We host only a small sample (2k lines) of each log dataset on Github. If you are
| WWW'23 | Liming Wang, Hong Xie, Ye Li, Jian Tan, John C.S. Lui. [Interactive Log Parsing via Light-weight User Feedback](https://arxiv.org/abs/2301.12225). ACM Web Conference, 2023. |
| TSC'23 | Siyu Yu, Pinjia He, Ningjiang Chen, Yifan Wu. [Brain: Log Parsing with Bidirectional Parallel Tree](https://ieeexplore.ieee.org/document/10109145). IEEE Transaction on Severice Computing, 2023. |

:bell: If you use loghub datasets in your paper, please feel free to make a PR to add your paper to the table.
:bulb: If you use loghub datasets in your paper, please feel free to make a PR to add your paper to the table.

### Discussion
Welcome to join our WeChat group for any question and discussion. Also, you can [open a discussion here](https://github.com/logpai/loghub/discussions/new/choose).
Welcome to join our WeChat group for any question and discussion. Alternatively, you can [open a discussion here](https://github.com/logpai/loghub/discussions/new/choose).

![Scan QR code](https://cdn.jsdelivr.net/gh/logpai/logpai.github.io@master/img/wechat.png)

### License
The datasets are freely available for research or academic work, subject to the condition that usage or distribution of loghub datasets should [cite the loghub paper](https://github.com/logpai/loghub/blob/master/CITATION) or refer to [the loghub repository](https://github.com/logpai/loghub).
The datasets are freely available for research or academic work. For any usage or distribution of the datasets, please refer to the loghub repository https://github.com/logpai/loghub.
Loading

0 comments on commit fd2654e

Please sign in to comment.