This project has moved. For the latest updates, please go here.

Upload the Historian data into Hadoop HDFS

Jun 25, 2012 at 11:28 AM

HI,

 

I have installed OPenPDC on window machine and installled Hadoop on another machine runing SUSE OS. I want to upload the Historian Data (c:\OpenPDC\archive) into HDFS. How i can do it?

Need help.

 

Many thanks

 

Coordinator
Jun 29, 2012 at 9:57 PM

Lot's of fun in this exercise - suggest you start this journey here:

Thanks!
Ritchie

Jul 2, 2012 at 3:42 PM

Many thanks Ritchie,

But i want directly store the data into HDFS. As i have mentioned that OpenPDC software runing on one machine and hadoop runing on another machine.

Do you know when configuring the OpenPDC, there is option remote server for historian data. How to configure so that the historian data store on remote computer.

Will highly appricitae your response.

 

Many thanks again

Coordinator
Jul 4, 2012 at 4:26 PM

Ah - yes. There are two components that work together to make this happen:

  1. The Java based HDFS bridge code - this "accepts" files onto Hadoop using a modified version of FTP that will validate the checksum of the uploaded file:
  2. The C#.NET based local openHistorian Hadoop replication provider which will auotmatically push .D files to the HDFS bridge and retry if checksum doesn't match:

The Hadoop replication provider can be enabled in the openPDC.exe.config file as it is not enabled by default. You can use the "XML Configuration Editor" to modify the config file and adjust settings for the Hadoop provider and local historian.

Relevant settings:

  • ArchiveLocation - specifies the local paths for .D files to push to Hadoop, multiple paths can be separated by a semi-colon
  • ReplicaLocation - specifes the FTP style URI of the Hadoop HDFS bridge

Basically the local historian archives data to a local .D as normal - once this local file is "full" and gets timestamped (i.e., rolled over and closed), it will automatically get pushed to Hadoop (i.e., replicated to Hadoop over the HDFS bridge).

You can make the local historian files very small (again settings in the config file) so that they roll over very often and you can set the maximum number of files you want to maintain locally for posterity,

It is recommended you maintain at least some cache of the files locally so you can validate the data is working properly using the provided historian tools, e.g., the Historian Viewer and Historian Playback Utility.

Once the files are on Hadoop you will need to use Map/Reduce jobs to query, analyze and extract data. There is source code provided to do map/reducing of the .D files, and there have been several tools developed by other parties such as the following one to "extract" .D archived data once stored on Hadoop, e.g.:

https://pdihistorian.svn.sourceforge.net/svnroot/pdihistorian/PDI_Historian/

Hope that helps!

Ritchie

Jul 5, 2012 at 7:44 PM
Dear Ritchie Carroll
Many thanks for valuable information and really appreciated what you have posted to me. As you know that i have just jump into Hadoop and OpenPDC and can little swimming ( i mean not much knowledge about the hadoop and openpdc).
 i have to use both method combinaly or can use only one method to upload the data. if only one method can solve the problem then I have to choose the 2nd option (HadoopReplication) for my problem.
I have made the following changes in OpenPDC Configuration Editor. But does not work.
StatHadoopReplication:
(I)   ArchiveLocation: c:\program files\openPDC\statics\stat_arch.d (where my historian data are stored)
(II)  Deletingorginalfiles: True
(III) Enabled: True
(IV) ReplicaLocation: hdfs://134.83.35.24/user/mukhtaj/input/ then try hdfs://134.83.35.24:9000/user/mukhtaj/input/ (IP address is the address of machine where  
       hadoop install and /user/mukhtaj/input/ is location where hdfs files are stored)
I did not making any changes anywhere.
Could you please let me know guide step-by-step how i can solve this problem.
Coordinator
Jul 5, 2012 at 8:43 PM
Edited Jul 5, 2012 at 8:52 PM

You do have to use both components in order to move files to Hadoop.

The HDFS bridge code defines a special "FTP" service for Hadoop that will validate the checksum of historian files that were uploaded using the openPDC Hadoop replication provider.

You will need to run the HDFS bridge to start listening for historian files to be transfered from the openPDC. This is a java application that will need to be started on Hadoop, we typically run this application on the head node. I'm not going to be the best person to provide detailed instructions on how to get things going on Hadoop however.

Note that it will not transfer a file until the local openPDC fills up one historian file (i.e., files will not transfer until the .D file "rolls-over"). This could take a while with the default settings and not many reporting PMU devices.

The default archive size is 100MB, if you want to test the transfer quicker make sure the default openPDC archive file size is very small (say 5 megabytes or so). This setting can be found in the openPDC.exe.config file under settings/ppaArchiveFile/FileSize.

Thanks!
Ritchie

Jul 5, 2012 at 9:59 PM
Many thanks Richie

 
If i have to run HDF Bridge on head node then i have problem that my hadoop is running on one machine (say computer1) and OpenPDC running another machine (say computer2). How the program will connect to computer2 to instruct the HadoopReplicator. where i have to make the changes (i.e. i need to make changes in the HDFSBridge program?)

2nd things is that in the last post i have made changes in configuration editor, could you  please check that it is correct?

3rd thing is that I have another idea is that if i can store the historian data into MySQL database and then used some application which is possible to upload the data into HDFS directly. Could you please let me know How I can store the historian data into MySQL instead of c:\program files\OpenPDC\archive folder. There are some hadoop application that can be use to upload the data into HDFS from database.

 

many thanks

Coordinator
Jul 5, 2012 at 10:22 PM
Edited Jul 5, 2012 at 10:27 PM
  1. The ReplicaLocation setting specifes the URI of the Hadoop HDFS bridge (e.g., ftp://134.83.35.24:25) - the HadoopReplicator running on the openPDC will connect to the HDFS Bridge running on Hadoop using this URL. Again you will need to run the HDFS bridge Java based application on the Hadoop system - this then will "listen" for files from the openPDC running on another computer.
  2. Your settings look good except for ReplicaLocation (see item 1)
  3. You have options here. You can certainly change your local historian to save data to MySQL instead of openHistorian then use other tools to migrate and query MySQL data to Hadoop. Be forwarned however, the openHistorian is designed to handle archving of high-speed PMU data - you can only get so much speed when archiving through MySQL (see this post and search for MySQL in discussions list for more details on archiving to MySQL)

Thanks!
Ritchie

Jul 12, 2012 at 4:19 PM

I have installed the hdfs-over-ftp mentioned (http://openpdc.codeplex.com/wikipage?title=Using Hadoop (Developers)&referringTitle=Documentation#copy_data_into_hadoop) . The hdfs-over-ftp is running as process on hadoop machine. On OpenPDC site, i have enabled HadoopReplicationProvider, set the archive location (c:\program files\openpdc\statistics\stat_archive.d  and replication location (ftp:134.83.35.24:9000). But still files are not uploaded directly to hadoop.

Any body have any suggestion?

 

Many thanks

 

Jul 15, 2012 at 11:21 PM

Hi,

 

Any one know about the following statment (HadoopReplication)

 

string[] credentials = replicaUri.UserInfo.Split(':');

Here mean the username and password?

 

Many thanks

Aug 26, 2013 at 3:47 PM
Dear Sirs,

We are at the beggining of a study where we expect to try out AWS Elastic Map Reduce to help us process OpenPDC data. So we´d like to know if anybody else is doing the same thing so we can exchange info on this.
The big challenge is to understand how to process the .d file, stored on S3 bucket, using hive on a EMR cluster.
We are trying some simplifications of this, turning the .d file onto .csv file, so hive can understand it and we can do some queries on that.

All best,

Sergio Mafra
Coordinator
Aug 27, 2013 at 2:32 PM
Edited Aug 27, 2013 at 2:33 PM
Should be easy enough to convert and automate .D files into CSV data - there will be a sizable increased requirement for disk space however.

You can also check out the openPDC source code the Map-reduce code that will read and extract data from .D files:

<openPDC source root>\Hadoop\Current Version

I would think keeping the data in .D form will save processing time and disk space. We used the map-reduce code to analyze and extract data from Hadoop as needed.

Also, suppose I would be amiss in not mentioning that you start investigating the openHistorian 2.0 as well (openhistorian.codeplex.com)...

Thanks!
Ritchie