Data Collection Pipeline¶

Screenshot

About The Project¶

This is downloading framework that is extensible and allows the user to add new source without much code changes. For each new source user need to write a scrapy spider script and rest of downloading and meta file creation is handled by repective pipelines. And if required user can add their custom pipelines. This framework automatically transfer the downloaded data to a Google cloud bucket automatically. For more info on writing scrapy spider and pipeline one can refer to the documentation. Data Collection Pipeline’s developer documentation is meant for its adopters, developers and contributors.

The developer documentation helps you to get familiar with the bare necessities, giving you a quick and clean approach to get you up and running. If you are looking for ways to customize the workflow, or just breaking things down to build them back up, head to the reference section to dig into the mechanics of Data Collection Pipeline.

Data Collection Pipeline is based on an open platform, you are free to use any programming language to extend or customize it but we prefer to use python to perform smart scraping.

The Developer documentation provides you with a complete set of guidelines which you need to:

Install Data Collection Pipeline
Configure Data Collection Pipeline
Customize Data Collection Pipeline
Extend Data Collection Pipeline
Contribute to Data Collection Pipeline

Built With¶

We have used scrapy as the base of this framework. * Scrapy

Summary¶

This summary mentions the key advantages and limitations of this smart crawler service.

Youtube Crawler

Key Points and Advantages:
- Get language relevant channels from YouTube and download videos from them.(70%-80% relevancy with language - based on Manual Analysis)
- Can fetch channels with Creative Commons video and download the videos in them as well.(70% relevancy with language)
- Can download using file mode(manually filled with video Ids) or channel mode.
- Youtube-dl can fetch N number of videos from a channel and download them.
- YouTube crawler downloads files at a rate of maximum of 2000 hours per day and minimum of 800 hours per day.
- Youtube crawler is more convenient and it’s a main source of Creative Commons data that can be accessed easily.
- It can be deployed in cloud service called zyte used for scraping/crawling.
- License information of videos are available in metadata.
Limitations:
- Youtube-api cannot return more than 500 videos per channel.(when using YOUTUBE_API mode in configuration)
- Youtube-api is restricted to 10000 tokens per day in free mode.
  1. 10000 tokens can be used to get license info of 10000 videos.(in any mode)
  2. 10000 tokens can be used to get 5000 channels.(in YOUTUBE_API mode)
  3. Youtube-dl can be used to get all videos freely.(in YOUTUBE_DL mode)
- Cannot fetch data from specific playlist. (Solution: Fetch videos Ids of a playlist using YouTube-dl and put them in a file and download in file mode.)
- Rare cases in which you might get Too many requests error from Youtube-DL. (Solution: Rerun the application with same sources.)
- Cannot download videos which require user information and private videos.

Web Crawler

Key Points and Advantages:
- Web crawler can download specific language audio but with around 50 - 60% relevance.
- Web crawler downloads files at a rate of at least 2000 hours per day.
- It is a faster means of downloading data.
- Creative Commons license of videos can be identified if available while crawling websites.
Limitations:
- Web crawler is not finely tuned yet, so downloaded content might have low language relevance.
- It cannot be deployed in zyte service free accounts and can be only deployed in zyte service paid accounts where docker container creation can be customised.
- License information of videos in web crawler cannot be automatically identified but requires some manual intervention.

Getting Started¶

To get started install the prerequisites and clone the repo to machine on which you wish to run the framework.

Prerequisites¶

Install ffmpeg library using commands mentioned below.
- For any linux based operating system (preferred Ubuntu):
```
sudo apt-get install ffmpeg
```
- For Mac-os:
```
brew install ffmpeg
```
- Windows user can follow installation steps on https://www.ffmpeg.org
Install Python Version = 3.6
Get credentials from google developer console for google cloud storage access.

Installation¶

Clone the repo using

git clone https://github.com/Open-Speech-EkStep/data-acquisition-pipeline.git

Go inside the directory
```
cd data-acquisition-pipeline
```
Install python requirements

``` pip install -r requirements.txt

Install gcloud utils

Download from: https://cloud.google.com/sdk/docs/install#linux

> gcloud init

Usage¶

This framework allows the user to download the media file from a websource(youtube, xyz.com, etc) and creates the respective metadata file from the data that is extracted from the file.For using any added source or to add new source refer to steps below.It can also crawl internet for media of a specific language. For web crawling, refer to the web crawl configuration below.

Common configuration steps:¶

Setting credentials for Google cloud bucket¶

You can set credentials for Google cloud bucket in the credentials.json add the credentials in given manner It can be found in the project root folder.

{"Credentials":{ YOUR ACCOUNT CREDENTIAL KEYS }}

Note: All configuration files can be found in the following path data-acquisition-pipeline/data_acquisition_framework/configs/

Bucket configuration¶

Bucket configurations for data transfer in storage_config.json

"bucket": "ekstepspeechrecognition-dev",          Your bucket name
"channel_blob_path": "scrapydump/refactor_test",  Path to directory where downloaded files is to be stored
"archive_blob_path": "archive",                   Folder name in which history of download is to be maintained
"channels_file_blob_path": "channels",            Folder name in which channels and its videos are saved 
"scraped_data_blob_path": "data_to_be_scraped"    Folder name in which CSV for youtube file mode is stored

Note:
1. The scraped_data_blob_path folder should be present inside the channel_blob_path folder.
2. The CSV file used in file mode of youtube and its name must be same as source_name given above. 
3. (only for datacollector_urls and datacollector_bing spiders) To autoconfigure language parameter to channel_blob_path from web_crawler_config.json, use <language> in channel_blob_path.  
    "eg: for tamil : data/download/<language>/audio - this will replace <language> with tamil."
4. The archive_blob_path and channels_file_blob_path are folders that will be autogenerated in bucket with the given name.

Metadata file configurations¶

Metadata file audio_id: null cleaned_duration: null num_of_speakers: null language: Hindi has_other_audio_signature: False type: 1. 2.

Data Collection Pipeline¶

About The Project¶

Built With¶

Summary¶

Getting Started¶

Prerequisites¶

Installation¶

Usage¶

Common configuration steps:¶

Setting credentials for Google cloud bucket¶

Bucket configuration¶

Metadata file configurations¶

Youtube download configurations¶

Youtube API configuration¶

Web Crawl Configuration¶

Adding new spider¶

Running services¶

Youtube spider in channel mode:¶

Youtube spider in file mode:¶

Bing Spider¶

Urls Spider¶

Selenium google crawler¶

Selenium youtube crawler for file mode and api mode¶

Tutorials Reference¶

Contributing¶

License¶

Git Repository¶

Contact¶

Acknowledgements¶