Tuesday 8 November 2011

Web scraping and data automation

Collection of data at one place from diversified sources like web sites, email, ftp etc., transforming, cleansing this data and arranging it in proper format is the primary need in various industries.

All the companies like Google Analytics and similar online marketting companies who generates or provides report online need to be downloaded and process data on day to day basis. This involve in automated data scraping by some automation script or an automated ETL process using some or other ETL tool.

Perl/Selenium/iMacros and Talend are handy in doing this.

Automated web scraping involves fetching data from various web sources eg. Web sites, email accounts, Web APIs, FTP servers, shared locations etc. Various tools which are used for this web scraping are
  1. Perl scripts
  2. iMacros
  3. Selenium Perl
  4. Various Talend components could be successfully used to fetch data from API, email and FTP
I have encountered enumerous scenarios for this and implemented those successfully

1) Fetch data from web services API
2) Fetch data from FTP Server
3) Fetch data from google mail account
4) Fetch data by passing parameters to the custom made web sites
5) Fetch regular data from ftp server
6) Fetch encrypted data from ftp server and decrypt
7) Fetch data from shared locations like network drive S3 bucket
8) Fetch data from excel sheet in various
9) Fetch data from csv/tsv

Once you download data from these sources, Talend open studio which is an open source ETL tool can help you to transform that data and load into again open source database say mySQL.

Using above idea, one can really build their own data warehous free of cost.

Reports could be generated using again another open source report development tools like BIRT/Jasper/Pentaho etc and could be used to deliver those reports on day to day basis to stakeholders via email.

In short above can be concluded in following steps
1) Automate process of data collection
2) Transform and load data into database
3) Create warehouse
4) Generate pdf reports
5) Circulate reports to stake holders

This is a very much common requirement in industry which would save hell lot of work and can give you better insight of what is happening with your data.

Who can benefit using this
1) All digital marketting companies
2) All companies who generate reports via web sites
3) All who generates web site log data
4) All who needs to receive reports of their data on day to day basis.

Please let me know in case if you need further information on this.