Asked — Edited
Resolved Resolved by WBS00001!

Reading Through An Rss File To Get Specific Data

I have been able to get and save the attached file to my hard drive. I can read it but I can not figure out how to pull just the data I want. In this case it is a stock quote for VZ. I want to just pull out the symbol and stock price.

I would appreciate any examples of how to accomplish that.

Thank You


Upgrade to ARC Pro

With Synthiam ARC Pro, you're not just programming a robot; you're shaping the future of automation, one innovative idea at a time.


You can load the file into an EZ_builder variable like this:

$FileText =FileReadAll(filename)

filename would be either a literal string (in quotes) or a variable set to the file name (and path) you wish to read.

Then, with the entire file in the variable $FileText, you can search for what you want by using the IndexOf function:

$StartPos =IndexOf($FileText,"VZ")

This assumes there is nothing else that starts with the capital letters VZ. Not likely. If it was a possibility you could try a space at the end of the search phrase like this: "VZ " instead of "VZ"

If "VZ" is found, you can then start getting the text from there on by using the SubString( string1, start, length ) function Like this:

if($StartPos >0) #"VZ" was found
  $VZText  = SubString( $FileText, $StartPos, Length($FileText) -$StartPos )

That will put everything from VZ on from the text into the $VZText variable.

But you want to have just the VZ and the price. That will depend on how it is structured. If it is something like this: "VZ closed at $22.55."

then you would do another search in the new $VZText variable looking for the dollar sign:

$StartPos =IndexOf($VZText,"$" )

What you do then depends on how the price is structured. Is there always a period after it or a space? If it's a period, then look for the first period after the dollar sign. To do that you will have to first further reduce the contents of $VZText. Something like this:

$VZText :=SubString($VZText,$StartPos,10)

That will get the dollar sign and the next 9 characters. How many characters it gets isn't all that critical. Just so long as it gets enough to be sure to get the entire price every time.

Then you search for the period:

$EndPos =IndexOf($VZText,".")

  #Then you do this:
$FinalQuote ="VZ = "+SubString($VZText,$StartPos,EndPos -$StartPos)

Or However you want to phrase it. You don't need to pull the symbol as such since you already know what it is. Only the price.

You may have to tweak the $StartPos and/or $EndPos variables to get just what you want. Adding 1 or subtracting i as needed.

There are parts above that could be eliminated or combined. They are in this explanation so as to take things step by step.

Let me know if you need further explanation.


It may be that the file has some weird characters in it, or my coding is bad. But I cant get your example to locate any text. The $startpos never gets greater then 0. I have attached the file.


Ah, I see. It's an HTML file. I was picturing it as a simple text file with the VZ thing being only one small part of it. VZ is all over this file. I guess in a repeating kind of thing. I've only looked it over in a cursory pass so I don't know yet. You should have gotten something other than 0 from IndexOf though. Assuming you did not do the adding a space thing to it ("VZ " as opposed to "VZ"). It appears there is no instances of "VZ ".

Having said that, the fact it is HTML can be a plus because that format tends to spell out the various parts of what is displayed. Now that I have an example I can give you a better response.


The script language cannot work with the file because it has quote characters in it. There are bugs in the language which cause operations using strings which already contain quotes to throw errors or simply not work as expected. That is what is happening when I try to perform operations on even small portions of the file by simple copy and paste. If I get rid of the quote characters, it can then be processed properly. There is a function which will get rid of the tab characters (ToLine) as well as, any other unreadable characters, so that's not a problem.

To make it work within the script language the file would first have to be scrubbed of all quote characters. Then it should be able to be read into a variable by a File reading command and processed. The quote characters in the file are not necessary for processing the data to get the stock information from it.

Perhaps whatever you are using to get the data and place it into a file could be modified to get rid of the quote characters in the process? Otherwise you will have to perform a separate process to do that.

I may be able to help you further if I knew the process by which you get the data into a file on your hard drive.


I am using the following line to get the data and place it on my harddrive.


I found it in another script from another post. It works, gets the file and puts it on my computer, but that's all I can do with it.

Thanks for all your time and effort.


I see. Interesting. But you could bypass the writing to file part like this:

$RSSText =HTTPGet("" )

That will put the contents from the web site directly into a variable. Of course, that still leaves us with the same problem concerning the quote characters. There is no way I know of to scrub the quote marks from within the script since it won't work with strings with quote characters in it already. If there were only 2 quote marks, one at the beginning and one at the end, it would be fine, but that's not how it is.

The only way I can think of off hand would be to go with the original command that you posted causing the data to go to a disk file. Then, using the Exec command, call a separate program which, upon starting, will read the file in and scrub all the quote characters, then write the result back to the same file. Then bring that back into ARC by a FileReadAll command. A very circuitous route but one that is workable. I can write a simple program to do that. I'll do that and see how it goes and get back to you. Maybe in the meantime someone can post here with a better solution. Good or bad, at least what I will do will work.

Question though, what do you want to do with the data once it's extracted?


Pass it to the robot to follow the market. Right now if I try to have it spoken, there is just to much other stuff in the file.

There may be another source for the data.


Thanks for all the help. I will keep trying to see if there are other options.


I have a solution for you. Not great but workable. I have attached a zip file to this post. It contains 2 files: HTMLScrub.exe and HTMLScrub.ini. Put them in whatever directory you like. Also I have uploaded a project with a script for reading the data from the website that you used before to get the quote data. it is called "HTML Reader". In it you will find a script called "Test". This script will read a modified file derived from the website download. It first calls the website and downloads the data like you did before. It then calls the HTMLScrub.exe program which will modify the file and make it ready for reading by the script. Then the HTMLScrub program will automatically close itself. It only takes a couple of seconds to do it's thing.

The rest of the script reads the modified file, line by line and looks for the data to read out, which it also does as it finds it.

At the moment it says the data to the PC via SayWait commands. To make it go to the EZB4 you will need to change those SayWaits to SayEzbWait.

As you go through the script you will see places you can modify the text to say more what you wish for each category.

Before you can make it all do anything, however, you will need to put in the path to where the VZ.txt file should go. You will need to do this in two places.

One is in the HTMLScrub.ini file. Open it with Notepad and you will see a line which currently says "FilePath=C:\VZ.txt". Change the path to wherever you want the VZ.txt file to go. I believe you had it going to "C:\Users\Public\Documents" before. That will be fine here too. You can make the name of the file whatever you wish as well. So you could change the "C:\VZ.txt" to "C:\Users\Public\Documents\VZ.txt. Then save the file.

That's all that is needed as far as the HTMLScrub.exe program is concerned. It will modify the file and create an output file which will have the prefix "Mod" attached to it. In this case it would be ModVZ.txt. It will be the ModVZ.txt file that the script will use.

Then you need to modify the filepaths in the script to match as follows:

In the script are the following lines:

#Enter the full path and name of the files to be used
$VZFile ="C:\VZ.txt"
$ModVZFile ="C:\ModVZ.txt"
  #The full path and name of the HTML scrubber program.
$ScrubProg ="F:\HTML_ScrubD7\HTMLScrub.exe" 

You will need to change the paths of the various parts to whatever you will use. If you change the name of the file that will have to be reflected in the script as well.

For instance you would do the following if you put C:\Users\Public\Documents\VZ.txt in the HTMLScrub.ini file:

$VZFile ="C:\VZ.txt"
$ModVZFile ="C:\ModVZ.txt"

     #Would be changed to:

$VZFile ="C:\Users\Public\Documents\VZ.txt"
$ModVZFile ="C:\Users\Public\Documents\ModVZ.txt"

The "$ScrubProg ="F:\HTML_ScrubD7\HTMLScrub.exe" line would be changet to wherever you put the HTMLScrub.exe and HTMLScrup.ini file. If you put them in "C:\Program Files\HTML_Scrub", for example, then that is what you would put in the script as in:

$ScrubProg ="C:\Program Files\HTML_Scrub\HTMLScrub.exe"

And that's it. If all went well, it should get the latest data fom the website then say what the values are when you run the script.

Let me know if you have trouble and/or questions as to how it all works.


Just to let you know, I revised the project (HTML Reader) to make the code more efficient and uploaded it again.

EDIT I forgot to mention that the HTMLScrub.exe file had to also be modified to go along with the revised project file. I have attached it to this post. Please unzip the executable and put in the directory you placed the other one. You might want to save the old HTMLScrub.exe file just in case. Maybe rename it to something else and leave it in the directory. Then copy the new one to that directory. Same with the HTML Reader project file. Rename the current one you have and download the new one. So, basically, new project, new executable. Old project, old executable.

If you use the new one you will have to change the settings in the Test script as well. Like you did with the old one. There is a new setting in the new project file to make it easier to change from sending the speech to the PC or the robot. At the top are the lines:

$Silence =0
$SendSpeechToPC =1
$SendSpeechToRobot =2
$SendSpeechTo =$SendSpeechToPC #Default. Change as needed

Just set $SendSpeechTo to whatever you wish to do.


Back from Labor Day weekend, will give it a try.


It all works, with the exception of the Percent Change line. It reads some of the control characters before it gets to the data. But I may be able to figure it out by editing the script. thanks for all your hard work, hopefully this will be useful for others also.


another question. I am using another file, I can locate the string I am looking for but not the value. Do you count the lines excluding the line I found the string on or do I include it?


Sorry you're having problems with the script. I'll take the second question first. Yes, you include the line the string is on in the count.

What is happening in the Percent Change line. is they changed the color in the HTML code from red to green and I was keying in on the phrase 'red>' which is now 'green>'. So, to get it back for the moment you can change that part in the script. Obviously that is not a good long term solution since it will change again. I was afraid of this and had a backup ( read 'more complicated' :) ) plan to overcome this.

In that regard, I am working out a different method for getting to the data desired. The data are bounded by a > and < symboi pair, or a '>' and the '&nbsp' phrase (&nbsp=no break space). There is one which has just '&nbsp' phrases bounding it.

The problem is there may be multiple such symbols and phrases in the incoming lines of text. So the method I will employ will allow for multiple characters or phrases to be sent to the line processing routine (:ExtractVal) so that the method can take the text away in chunks until it gets to the right one. Sort of like Pacman taking bites of the text until it gets to the data you want. This will be a better long term solution since it uses the basic structure of the line instead of any specific text in it.

To do that you would have to count the '>' or the '&nbsp' phrases (or whatever may be used in the future) ahead of the one that is just in front of the data to be spoken. Then, what is left of the line is processed as before. In this case the line looks like this:

&lt;TD nowrap align=right width=50%&gt;&lt;font color=green&gt;2.37%&lt; font&gt;&amp;nbsp

  #Here, we want to use the '&gt;' symbol
  #After the first bite is taken out of it, the result will be:
&lt;font color=green&gt;2.37%&lt; font&gt;&amp;nbsp

The next IndexOf for the '>' symbol that is done on the line will now find the '>' just ahead of the 2.37% value we are looking for.

To do this, I will change the $StartStr lines to include multiple symbols, or phrases, separated by the Line character( | ). It is usually found on the same key as the backslash ( \ ). In this case it would be:

$StartStr =&quot;&gt;|&gt;&quot; 

  #In other cases it could be like this:
$StartStr =&quot;&gt;|&gt;|&gt;&quot; #Two '&gt;' symbols ahead of the one we want.

In other cases there are no extra symbols ahead of the one we want like this one:

&lt;TD nowrap&gt;&amp;nbsp As of: 8 Sep 2015 16:00:00 EDT&amp;nbsp 

Here we key in on a &nbsp phrase and there is only one ahead of the data, so the $StartStr will simply be:

$StartStr =&quot;&amp;nbsp&quot;

No separation characters needed.

I should have this ready sometime today and will simply put the revised script code directly in the post for you to copy and paste into your script, along with instructions.


Before you put to much work into this. I started thinking about this and I think it would be better to get a status on the overall market then individual stocks. So I found a website that provides data on market index. It is

I have attached the two files in a zip.

Let me know what you think.


I'm afraid you didn't attach the files to the post. Nonetheless, the same principles will apply regardless of what website is used, so the same code can be used.

What is it you would like to have read out from the site? If it is all the indexes, that can be done without much modifications. Perhaps put in a loop for the reading of the text from the website, index by index, for code efficiency. If it is specific indexes that can be done by creating a string containing all the appropriate index names.


No I don't think all the indexes (although someone else might want them all. I would like DJIA, NASDAQ, and NYA. Let me try attaching one more


While doing the revisions I found more characters that the script language doesn't like so I was forced to revise the HTMLScrub program yet again. So I'm afraid you will have to overwrite the existing one with the one I have attached to this post. At least this time though it will be compatible with the previous script. I tested it with many sites and the ARC took all the modified results with no problem so maybe this will be the last revision because of that.

Also I have attached the new script for the new web site. It contains extensive revisions so you may want to just use it and change the appropriate lines again. These are:

$VZFile =&quot;C:\VZ.txt&quot;
$ModVZFile =&quot;C:\ModVZ.txt&quot;
  #The full path and name of the HTML scrubber program.
$ScrubProg =&quot;F:\HTML_ScrubD7\HTMLScrub.exe&quot; 

Change the paths to where the files are on your computer.

There is a new variable:

$IndexesToGet =&quot;DJIA,COMP,NYA&quot; #Note: COMP is NASDAQ

It contains the indexes you mentioned. If you want to add more just add them to the list with a comma in between each. The names placed in the list will, however, need to be the ones in the actual web site code as in the example shown (COMP is NASDAQ). They also need to be the same case as in the web code. Looks like they are all uppercase.

Basically the script works by looping through the $IndexesToGet string, extracting each name in turn (plus a prefix specific to the area of code we are looking for) and searching for that name in the modified web code. Once found, the lines which follow are short and simple to get the data from. The outer repeatwhile extracts the names of the indexes to find, while the inner repeatwhile loop actually finds it and processes the data.

Let me know if you have problems or don't understand something.


I used the script in the zip file, modified all the location variables. but the htmlscrub is not creating the vzmod file. I searched my entire hard drive just to be sure I didn't misplace it, but is is not anywhere. Thoughts?


never mind. figured it out. I removed the config file, added it back. it all works. will test and see if any issues.


another question. At this link

Are reqular stock market updates. Would it be possible to set up a script that would check and if a new update issued, maybe read off just the 1st paragraph? Not the whole update, way to much.


I was running some tests on the Yahoo web site before posting a response and found out it is nearly a mega-byte in size. That makes for a huge file to process. In addition, it has tons of characters which must be scrubbed for the Script Language to accept it. It takes 2 to 3 minutes to scrub the whole thing, depending on the speed of the computer used. In this case particularly, that is unnecessary since we only need a small portion of the code at the beginning. By comparison, the Marketwatch site was only about 70K total in size.

I decided the easiest way to address this problem would be to to place a limit on the file size processed. This is done by setting a number in the HTMLScrub.ini file. Currently this number will be set to 100K but it can be changed to whatever is desired by changing the number in the ini file. What it does is take up to 100K bytes of whatever web site is used. If you need it to take more, you can change the number in the ini file. Just keep in mind the more it has to take in, the longer it will take to scrub it for use by ARC.

When you run the new HTMLScrub.exe program for the first time, it will automatically add those two lines to the ini file so you don't have to do anything in that regard.

I will address your question in the next post.

The new HTMLScrub.exe file is attached to this post.


I have a question before I use the new htmlscrub file, will it cause a problem with the older version? The older version is working like a champ and I have tweaked the script and it does exactly what I want. Let me know your thoughts before I download and install.


It should be fine. The web site that relates to the current script is well under 100K in size. Of course it's always wise to save the older executable file just in case, but I can't think of anything this change would affect with what you have now.


Ok, I've been looking at the code for the Yahoo site and I have to say I have never seen so much stuff in the code for a site ever. You have to go through over 300K of code before you get to the stuff you want. This, even though on the web page, it looks like the data is just a little ways down the page. This means, as the scrub program stands now, it has to load about 350K of the site to get to that data. That, in turn means it takes about 30 seconds to scrub the data. And that means an additional 30 second delay from the time you issue the command to cause the script to fetch and read the data until you hear it being spoken. There is, however, a way around that but it will require that the scrubber program is a bit more clever about what it does.

What would happen is that the script would send out some data to a disk file before starting the HTMLScrub program. When the program starts it would look in this file to see if anything is there (or if it exists at all). If so, it would use that data to filter the incoming code from the web site such that it would only process a relatively small amount of it. Enough to contain the required stuff and not much more.

The script would send two things to the file. A string to be used for a starting point for the filter and a number which represents how much data to get after that point. In this way, the program would only have to scrub a small amount of data/code. It would be much faster than it is currently, even with the existing site being used now.

It would not take much of a change to the script. Just a few lines added to the part just before the call to the scrubber. So none of what you have done would have to be redone. If you send me what you have now, I can add it to it for you and send it back. I've been working with this file transfer thing in another project so I already have tested code ready to go for it.

Basically it would not take much to add, yet give the scrubber a large speed boost, effectively causing it to take about the same brief amount of time to do it's thing regardless of the size of the web site file. The new executable required would also be fully compatible with your script as it stands now, so no other changes would be needed.

BTW, on an unrelated topic, I can make the scrubber program come up in minimized state so you would not see it at all while it is doing it's thing if you like.


Try this site. I think it may not be so much stuff to sift through.

Will send the script in another message.


Here is my script I'm using. eventually will use the quotedate and quotetime variables. But not ready

If you see any efficiencies that can be added or fixed within the script feel free to do so.

Thanks for all your help would've never gotten this far without your assistance.


also, I am going to open another request for help. It has to do with the calendar one that I opened earlier.

take a look at this site.

I would like to take this file, have the computer scan the dates, compare it to the current date, if within a week, start sending reminders verbally up until the date has past, then it can forget about it.

I don't care about times, just dates. Is it possible to scan the file and pull the information, date, and read the status/info text?


opening another request, since it is different. also a different/better website.

let me know your thoughts. I think we could use a lot of what we've done already. the only part missing I think is the date comparison/look up.


WBS00001 - I want to thank you for all your help. I couldn't have ever gotten this far without your assistance. I know that you have spent a lot of time helping me out. I believe your HTMLSCRUB file should be able to be used by many other users who didn't know how to get past all the bad characters in those HTML files. I am not sure if you had a chance to look at the modified script I sent you, but if you did, did you see anything that I should change to make it better?

I am still trying to study and learn the script you provided for searching through the modified HTML file and finding the data that is useful. It is confusing but will keep trying.

At this point, please forget my other requests. I really do appreciate and understand the time you devoted to helping me. I do believe others will benefit from your work.

Thanks again.



You're very welcome. Glad to help, really. Besides, I find what you're trying to do interesting. While figuring out how to do these things I learn stuff too. Useful stuff. "Scraping" things off web pages (which is what this technique is often called) can be applied to a lot of things.

I have given your code a cursory once over and it appears you are learning well. What you have written shows that you are catching on to what it is going on in the script and the overall concept of getting the data you want. When I get the chance I'll look it over in more detail, but it seems fine for now.

Mostly I'm modifying the scrub program such that it can handle pretty much any size web page. It will be compatible with what you have now concerning the Marketwatch site so there should be no problem there. It will, however be much faster in the scrubbing process for all sites. The code in the script to make this happen will be minimal with just few lines added to the beginning. A few lines added to the HTMLScrub.ini file as well, but I will add them for the current sites before I send it and the new executable to you. Of course I will explain all that in a post when I upload the new files.

Anyway, once this is done and working on your end I'll be glad to help you with the other things you want to do concerning the time and date things.The requests you made have inspired me to look into making a script for doing those sorts of functions. It would be something that would be called with Command Control functions for ease of use. I feel sure it would be helpful to many. Myself included. :)


Look forward to trying out the modified Scrub program. How do you make that Scrub program? I know it is done outsize of ARC.


I use a software development system called "Delphi." It's quite popular in Europe but not as much in the US. Michelin USA uses it. I know because I introduced them to an early version of it while I worked for them in their central engineering dept. Last I heard they were using more up to date versions still. It's kind of like the Microsoft development systems, but it is not married to any one platform like the Microsoft stuff is. When I left Michelin and struck out on my own, I bought the Enterprise version to do independent software development. An investment of several thousand dollars. Ironically, one of my first jobs as a contractor was for "The Children's Place" chain of stores, using the very early version of the system again. They used it as their Point of Sales software and I was able to keep it running with new devices for a few more years.

Ok, that's more than you ever wanted to know, I'm sure. Suffice it to say, it's a rapid prototyping development system which allows me to develop programs fairly quickly. Though this one is taking a while because I can't devote as much time to it as I would like. Still, I expect to have it ready in a couple of days, along with something with the Yahoo site.

I've managed to come up with a method that will allow the ARC script to find out where the HTMLScrub program is and use that information to get to the HTMLScrub.ini file to find out the rest of what it needs to know to run in conjunction with it. There will be no need to enter the file path to the HTMLScrub program manually in the scripts. So the Scrub program can be placed anywhere at any time and the script will find it. That also allows strings to be stored in the HTMLScrub.ini file that can't be used in the scripts.

Additionally, the program now has a Setup and an Error mode. It can be called using a script so as to bring it up in Setup Mode to make changes to the options and the other things in it. In other words you don't have to access the HTMLScrub.ini file directly to make changes and additions to it.

The Error mode will halt the program if there is an error in the scrubbing process so that you can see what the problem is. That's optional, however.

Finally, I've managed to have it do it's thing without it popping up to be seen in the process as it does now. It still pops up but it is invisible so there will be no indication anything even happened. That is also optional but I think it will be best to leave it in invisible mode. Even if it were visible, it goes so fast now, even on large files, it's there and gone in a moment.


I look forward to testing out the new HTMLScrub file. I checked out Delphi, I think I will keep that for my phase 2 learning. Thanks.


Sorry it's taking so long for the update. I ran into a couple of snags. Overcame them, but then I decided to separate out the common parts of the code from the 3 websites I've been working with. Not too big a deal right now, but, as more and more sites of various sorts are added it will be best for the long run. In doing so, I've had to introduce the concepts of CommandControl being used to make "function calls" and the use of init scripts. The latter is probably something you are already familiar with at least. :)

In any event, the basic ideas are still the same and the code to doing it hasn't really changed. Just that what was done with local Gotos before is now being done with CommandControl instructions. I've made a project with it all in it and that is what I will upload to the cloud. Plus, of course, the usual update of the scrub program attached in the usual way.

Since the scrub program is a bit more involved now I will be sending the next version in an installer to simplify the process of putting it all in. I've put all the files I test with in the same directory as you use so you won't have to make any changes to the settings in that regard in the future. All in all, it should be pretty much plug and play.


How do you plan on handling different web sites, different data? What I have been doing is placing the scrub file in different directories and updating the ini with that data. Works great, have it working across a few sites. I used the script you uploaded and just modified it to handle the different sites, path names, and the varying data. No hurry on the update, I am using the other and learning as I go.


also, I cant find your HTML script in the cloud anymore. Did you move it or rename it? Need a clean copy so I can test out some things, I modified the one I originally downloaded.


I'm not quite sure what you are doing. From what I read it seems like you are placing a copy of the HTMLScrub.exe file and the HTMLScrub.ini in different directories. Then modifying the ini file to suit the needs for each website. Then using one script to access them all, calling each from it's own directory. Is that it? If not, please clarify.

What I'm doing basically is using a different script for each site, but always the same HTMLScrub.exe file and the HTMLScrub.ini files in the same place. All the website scripts are placed in one Script Manager I call "Main". If I want to call them from yet another script, I would simply call the various other website scripts via CommandControl instructions. Such as:


Each website script has different needs as far as getting the appropriate data is concerned and has a few lines at the beginning to determine which site to scrub.

Each Website script also sends a name to the scrub program indicating which site it is to scrub. This is done via a file called XFer.cfg. The scrub program uses that name to determine how to handle the raw file generated by the script concerning Starting Position and how much of the raw file to scrub. All that information for each website is in the one ini file.

I'm going on the assumption that fresh data from the website is desired every time it is accessed so there is no need to store data on a per-website basis. However, that could be done using the name sent in the XFer.cfg file. Currently, all the files related to the operation will be stored in the same directory as the Scrub program. They will also have more generic names now as well. The raw data from the website generated by the script by the HTTPGet command is called, appropriately enough, RawData.txt. and the scrubbed file, ModData.txt. instead of the VZ names since that is just a holdover from the past. You can even call them whatever you wish though, since the initialization script for the Main script manager gets those names from the ini file now. They no longer need be declared explicitly in the scripts. Nor do the paths to them (or anything else). That's all handled automatically.

Just saw your latest post. I'll put the old project back up for you to download. I took it down because it is outdated now.


@WBS, your third party apps are neat - have you considered creating plugins instead? It's much easier than you think, and since microsoft made .Net free, it doesn't cost you anything. I also publish all my plugins open-source which you can view for additional examples.

Here's a tutorial on step by step to create a plugin:



Thank you for the kudos. I finally took a few to really look at the tutorial you suggested and found it to be very well detailed. Looks like I don't have to know all about C# to at least try the steps listed and see how it all comes together for myself. I've had the Community version for some months now and played around with it some, but only using VB. No doubt I can use C# as well, still the same components and the like after all. And an algorithm is an algorithm regardless of the language.

Anyway I'll give it a go when I finish my latest round of software stuff I'm doing now. Still, I wonder if I could twist the interface you have for Microsoft .NET and make it work with Delphi? A DLL is a DLL also, after all. Just need to get the calling conventions right. I've used C++ DLLs in my Delphi code from time to time. The newer versions can even work with .NET as well. Might look into that too.


if you can add the ARC and ezb as a reference to your project in Delphi, then it will work I suspect. I haven't used Delphi since around 1998 or so...

I assure you c# is incredibly powerful, very popular and really easy to use. Think you will like it :)


WBS....I have been trying not to bother you, but I have a problem that I cnat seem to overcome. I have been using the HTMLSCRUB program to look through data on this website my goal is to search by date and then output information for the current date. I have successfully done this with other sites using versions of your code. When I do it on this site, I can print out the line of data, when it finds the string I am looking for. But the moment I try to use the substring command to go through and pick out the information I want I get the error, Input string was not in the correct format. Any thoughts about what I am doing wrong. thanks.


No problem. You're not bothering me at all. Could you put one of the input strings which is causing the error message in your next post? That's one error I've never gotten for the SubString instruction.

I really need to turn loose of the code I have for the update, I just can't seem to stop tinkering with it. :)


This code is not complete. right now just playing with it, but if you run it you will get the error, if you could tell me what i'm doing wrong that would be awesome.

$PrintDebugOn =False $Silence =0 $SendSpeechToPC =1 $SendSpeechToRobot =2 $skipit=0 $stockstatus=0 $foundit=-1 $point=0 $SendSpeechTo =$SendSpeechToPC #Default. Change as needed'

if ($month=9) $mymonth="Sep" endif

$todaydate = $mymonth+" "+$day $thedate = $todaydate

Enter the full path and name of the files to be used

$VZFile ="C:\users\public\documents\calendar\asteroid.txt" $ModVZFile ="C:\users\public\documents\calendar\Modasteroid.txt"

The full path and name of the HTML scrubber program.

$ScrubProg ="C:\users\public\documents\calendar\HTMLScrub.exe" $IndexesToGet ="<li>" #Note: COMP is NASDAQ

$IndexesToGet ="DJIA,COMP,NYA" #Note: COMP is NASDAQ

FileDelete($VZFile) #Clears the file and prevents adding to it needlessly FileWrite($VZFile,HTTPGet("" )) FileReadClose($ModVZFile) Sleep(2000) Exec($Scrubprog) #Modifies $VZFile.txt and creates $VZModFile.txt Sleep(5000) #May need to be adjusted depending on file size. FileReadReset($ModVZFile)

$TheIndexesToGet =$IndexesToGet+"," #So we don't mess up the original

$Prefix is used to get the correct line with the index name.

This needed because the index name is used in many lines.

$Prefix ="<td class=symb-col><a href= investing index " repeatwhile($TheIndexesToGet !="") $IndexPhrase =$Prefix+Split($TheIndexesToGet,",",0) #First index if ($PrintDebugOn) Print($IndexPhrase) endif FileReadReset($ModVZFile) #Always start from the beginning $ExitLoop =False repeatwhile(FileReadEnd($ModVZFile)=0 AND $ExitLoop =False) $TheText =FileReadLine($ModVZFile)

$StartPos =IndexOf($TheText,&quot;Asteroid&quot;)
$foundit = $startpos

if ($foundit &gt; -1)

  $caldate = substring($TheText,5,6)
  $data = substring($theText,10,10)
  $thedate = $caldate

  if ($todaydate = $thedate)
  sleep (5000)
   print ($todaydate)

  # if ($v1&gt;0)
  # print ($line)
  # $stopreading=$stopreading+1
  # sleep (3000)
  # endif

  # if ($thisdate=$closedate and $ready=&quot;yes&quot;)
  # $arraycounter=$arraycounter+1
  # $objectname[$arraycounter]=$asteroidname
  # $objectdist[$arraycounter]=$distance

  # endif

  # $x=FileReadEnd(&quot;C:\users\public\documents\calendar\ModAsteroid.txt&quot;)

endrepeatwhile endrepeatwhile


sort test

repeatuntil($sortcounter = $arraycounter)

print ($sortcounter+" "+$arraycounter)

if ($objectdist[$thenumber]< $objectdist[$sortcounter+1])







say ("There are "+$arraycounter+" objects approaching Earth today")

say("The object approaching closest is "+$objectname[$thenumber]+" and it will be within "+$objectdist[$thenumber]+" myles of Earth")

sleep (5000) filereadclose("c:\users\public\documents\calendar\ModAsteroid.txt") #halt


Ok. I found out what it is. At first I thought it might be the ampersand or the colon, but after eliminating those it still did it. I had to use a process of binary elimination to finally get the string down to what it actually was. Turns out it appears to be another bug. Anytime there is a capital "E" in a string, it seems to be interpreted as an Exponent reference IF it is followed by a number.

So E2, E0, E1234, whatever, will cause the same error. This will be somewhat more difficult to eliminate. I hate to have the scrub process have to take out all occurances of E0, E1, E2 ... E9. That will take a lot of extra time. I could have a space inserted any place there is a capital E, but that would cause problems with actual words beginning with a capital E as well. Probably I'll have to do a special search for the E-Number combination and deal with that as opposed to the simple search and replace the program does now. Hopefully it will still be nearly as fast, even on a large file. I think I'll allow for an update to a line in the ini file to tell the program what characters to eliminate or replace. That way you can add to it as needed rather than my having to send a revised executable every time. Of course, the problem with that is that it would not have helped in this particular case. Maybe I can figure a way to include this sort of problem as well. We'll see.

But before I can send a replacement executable file to replace the one you have, I need to know if the ini file you are using has the following lines in it:

[FileLimits] MaxFileSize=100000

The number may be different but that doesn't matter. That way I'll know which of the earlier versions of the program I need to revise to solve this problem.

Also, where this site is concerned, are you wanting to be able to search the whole site or just that part that deals with the current month and year? If just the current month or so, the scrub process could be made much faster by dealing with considerably less of the code from the webpage.


Glad to know it wasn't something I did wrong. I looked for 3 days until I gave up.

Yes, I have the file limits set 100000. Also in the INI is the path settings.

I am not sure it is worth it to do that extra work. Is there really anyone else that is ever going to run into this problem.

What if you replaced E with EE would that fix the problem of being considered an exponent? That could be edited back out with the script to change it back to E if needed. But again, not sure worth the effort.

I am just looking at the current month and year.


I'm afraid EE won't work. Even if the E is preceded by other letters (or symbols), the same problem occurs. So EE0 still causes the error, as will ASEFGHeE01.

That said, I believe I have found an answer that will not require extra programming. Changing E to E' does not seem to have any effect on the pronouncing/saying of the words, so it looks like there is an easy solution after all.


Here is an update to the version of program you have. I tried it with your script and the directories you used before (C:\Users\Public\Documents\calendar) so it should be ready to go. Seems to locate the date just fine now. Let me know if you have any problems with it.

BTW, when the script crashes, it can leave files open. That can make it impossible to do things like save a modified version of the Modasteroid.txt file if you do any modifications by hand for testing. Here is a short script for closing the files used:


You may want to copy and paste it into a script called "CloseFiles" or some such. That way you can close the main files involved, by running the script, should there be a problem. I ran into that issue several times while tracking down this problem.


Ok, good idea. I have had the open files many times while working out the different scripts. Will take your advice and make a close file script.

Will let you know how the new file works.

Thanks for the hard work. Will be in touch.


The new scrub file changed the layout of the modified file. so am going back and reworking my scripts to use the new files. Not a big problem, certainly not a complaint. But I do have a favor, that original script file that pulled the market indexes does not work anymore. I cant quite figure that one out, would you be willing to take a look and repost it using the new scrub program?

Other then that things are working good.



Sure. Be glad to have a look at them. Sorry the latest change caused problems. I guess that extra character on the Es made more of a change than one would think.

All in all, maybe now might be a good time to get with the new version of things I have been working on. What I have works with the two Yahoo sites and the Marketwatch site. So you could start those from that basis if you like. The whole thing can be totally separate from what you have now as well. No overwriting of what you have.

The way the code goes is pretty much like before. The only major difference is that, instead of Gotos to go to parts of the code that is common to many functions of the script, now CommandControl instructions are used. Like the Goto to ExtractVal. Now it's CC($_Main,ScriptStartWait,$_ExtractVal). That way you don't have to repeat the same Goto script in each script that needs that bit of code. Nor do you have to update it in every one when it comes time to make a change to one of the Goto routines.

What I can do is integrate whatever you have now into the new system so you won't have to climb that hill at least. So send whatever you have finished or not and I can integrate it and get it working to that point at least.

While on the subject, I was wondering why you are using separate directories and multiple copies of the scrub program? Perhaps there is something I need to do to make the program setup more versatile?


Let me answer your last question first. I was keeping them separate for 2 reasons. The first reason if I got something working, I didn't want to take any chances on messing it up. The second reason was that as I was working on the script I would like to pull up the modified file to check the layout, so I kept it separate. After thinking about it, I probably didn't need to do that.

I was never able to get the MarketWatch site to work to my satisfaction. Also, I will try working on the Calendar site. What do you mean when you say send what I have? The scripts I have set up that are using the HTMLScrub program. I made a couple on my own. I also have the modified HTML script that is not working, but I tweaked for my use. I am not at the computer with the scripts on it. So I will send you another post with those scripts in an hour or so.


I have attached a zip file with 2 scripts. One is the modified market index script that I modified but stopped working with the new HTML program. The other is an ISS pass script that I am working on. Hopefully these are what you wanted to review. The market index script is the one I need to get fixed, but following your script, which works well, is complicated for


Finally got to looking at the files, The problem I found was the in the line:

$Prefix ="#<td class=symb-col><a href= investing index "

It should be:

$Prefix ="<td class=symb-col><a href=/investing/index/"

This is likely because of the latest executable I sent to you. I know in one version I had replaced the front slash symbol ( / ) with a space during the scrubbing run. I later realized that was not necessary and stopped doing it in the program to give the scrub process a little speed boost. Also because I found it necessary in some cases. If you're using the new program file with this script now, that would be why it suddenly didn't work. Fortunately that seems to be the only search string like that in that script. Sorry for the problem, but at least you will now know what to look for in any other scripts which may give you faults with the new executable. The latest executable that I'm using with the latest version of the whole thing I'm working on now is also that way, so any changes made will still be good to go with it as well.


Just a quick note to let you know I haven't forgotten about you or given up. Simply haven't had much time to work on the scrubber and associated scripts this past week. This week should be better.


WBS00001. No problem. I went back to v2 of the scrubber for the stock market indexes and it is working fine. Take your time, I appreciate your support and help.


As Wall-E would say, "ta-da"

I have FINALLY uploaded a project with all the new stuff. Assuming you even care at this point. I wouldn't blame you if you didn't. It's been a MONTH after all. I'm very sorry for letting you down like that. It took so long because I didn't have as much time to work on it as I thought I would and I went wild with new paradigms, going through iteration after iteration until I boiled it all down into something sane. At least I think it's sane. The insane are always the last to know they're insane.

I've attached the new scrub program which has ancillary programs along with it for displaying errors to aid in troubleshooting. Also there is a much expanded ini file and an editing program to go along with it. The installer will place a new icon on your desktop from which to access the editor. It can also be accessed from within the "Main" script manager by pressing the Start button for the "ScrubIniEditor" script in the TextScrub script manager.

The attached file is a zip file containing an installation program to simplify the installation process. I've made the default installation path based on what you had earlier but you can changes it to whatever you wish. However, if you go ahead and use the default directory at least we will both have the same setup. That should make troubleshooting easier. Anyway, putting it into an entirely different directory will keep the new stuff from interfering with the old stuff.

I plan to add much more to the methodology I have started with this project but I am going to freeze this project as it is now. No matter what else I may do in the future I will be able to call this back up for making changes independantly of anything else. That way you won;t have to worry with changing your scripts again, once adapted to this way of doing things.

In addition is a project I have uploaded to the cloud which contains all the latest and greatest. It's called "Scraping Data From Web Pages." It has several Script Managers, but the one labeled "Main" is the one you will be interested in most. It contains the scripts dealing with the stock market webpages. The ones to the left contain scripts to support the ones in the Main Script Manager. In each Script Manager is a help script. CommonHelp, MainHelp, and TextScrubHelp. The one you will use to start off is MainHelp. Starting it will bring up a file with what you need to get started (that is, after you have installed the program files from the attached file).

The MainHelp file discuses things such as the Parameter Passing scheme and Stack usage, but mostly it goes over the changes brought about by the "Function Calling" methods used now and why this was done. It also goes over the layout of the example project.

I'm sure you will have questions galore, feel free to ask away. Or .. curse me and the very ground I walk on for changing it yet again. :) Seriously though, I believe in the end you will actually find it easier to use. It still does things in the same general way as before, just that the actual tasks are carried out by other scripts. This is done through a pseudo parameter passing scheme and CommandControl instructions instead of direct Goto commands within the same script, as is the case now. Overall, it should require minimal changes to your existing scripts.


WBS00001 - The HTMLScrubber program no longer works with the latest version of exbuilder. I think the HTTPGET writes the file out differently. Any thoughts?


Send me an example and I'll see if anything had changed that would affect it


Do you mean the code? Via email?


Dj. There are 3 files. 1 is the script. the 2 other files must be placed in the directory public documents and folder test on the c drive.

This script runs with the older version of ARC but not the new. Something with either FileWrite or HTTPGet might be affecting it.

Run this on a pc with the older version and then try on a pc with the newer version.

Please let me know if you need more info.


And any suggestions for script improvement would be appreciated. :D


I see what happened - i'll post an update today for you


Use the latest ARC and your code will work without needing to strip any characters. I modified your EZ-Script a little...


$VZFile =&quot;C:\temp\test.txt&quot;


$todaydate=($year + &quot;-&quot; + $month + &quot;-&quot; + $day)


$url1 = &quot;; + $todaydate + &quot;&amp;end_date=&quot; + $todaydate + &quot;&amp;api_key=3jMUBZIcgvFXVAMouSZqlWnbjg5N1sPmgsHiEyCD&quot;

if (FileExists($VZFile))
  FileDelete($VZFile) #Clears the file and prevents adding to it needlessly


$endOfFile = FileReadEnd($VZFile)


  $line = FileReadLine($VZFile)

  $field1 = Indexof($line,&quot;name:&quot;)
  $field2 = Indexof($line,&quot;absolute_magnitude_h: &quot;)
  $field3 = IndexOf($line,&quot;feet:&quot;)
  $field4 = indexof($line,&quot;is_potentially_hazardous_asteroid: &quot;)
  $field5 = indexof($line,&quot;close_approach_date:&quot;)
  $field6 = Indexof($line,&quot;miles_per_hour:&quot;)
  $field7 = indexof($line,&quot; miss_distance:&quot;)

  if ($field1 &gt; -1)
    $field1a = Length($line)
    $name = (substring($line,$field1+5,$field1a-7))

  if ($field2 &gt; -1)
    $field2a = Length($line)
    $mag = (substring($line,$field2+22,$field2a-24))

  if ($field3 &gt; -1)
    $line = FileReadLine($VZFile)
    $field3a = IndexOf($line,&quot;estimated_diameter_min: &quot;)

    $line = FileReadLine($VZFile)
    $field3b = IndexOf($line,&quot;estimated_diameter_max: &quot;)


  if ($field4 &gt; -1)
    $haz = substring($line,$field4+35,3)

  if ($field5 &gt; -1)
    $field5a = indexof($line,&quot; ,&quot;)
    $closedate = substring($line,$field5+20,$field5a-21)


  if ($field6 &gt; -1)

    $MPH = substring($line,$field6+15,7)


  if ($field7 &gt; -1)
    $line = FileReadLine($VZFile)
    $line = FileReadLine($VZFile)
    $line = FileReadLine($VZFile)
    $line = FileReadLine($VZFile)
    $field7a = Length($line)
    $miles = (substring($line,7,$field7a-8))
    $ready = 1
    $field7 = -1

    # print($name)
    # print($mag)
    # print($diamin)
    # print($diamax)
    # print($haz)
    # print($Closedate)
    # print($MPH)
    # print($miles)
    # sleep(3000)

  if ($ready = 1 )

    if ($ahaz[$arraycounter] = &quot;fal&quot;)
      $hazard = $hazard
      $hazard = $hazard + 1



  $endOfFile = FileReadEnd($VZFile)



repeatuntil($sortcounter = $arraycounter)

  if ($amiles[$thenumber]&lt; $amiles[$sortcounter+1])


if ($hazard = 0)
  $hazardstatement =(&quot; None of these encounters are rated as potentially hazardous.&quot;)
  $hazardstatement = (&quot;with &quot; + $hazard + &quot; rated as potentially hazardou.&quot;)

saywait(&quot; There are &quot; + $arraycounter + &quot; Near Earth Asteroid Fly buys &quot; + $hazardstatement)

if ($arrayCounter &gt; 0)
  saywait(&quot; The asteroid designated&quot; + $aname[$thenumber] + &quot; will safely pass by earth at a distance of &quot; + round($amiles[$thenumber]) + &quot; Miles.&quot;)
  saywait(&quot;It is estimated to be between &quot; + round($adiamin[$thenumber]) + &quot; feet and &quot; + round($adiamax[$thenumber])+ &quot; feet in diameter and, is traveling At a speed of &quot; + (round($aMPH[$thenumber])) + &quot; MPH.&quot;)


DJ. Did you actually get the script to work? I downloaded and installed the latest version of ARC. Used your script and it downloads and writes the file, but cant find any info for today and I know there is. Am I doing something wrong?


Ok, I see. I need to reparse through the data to find my values. Much work. I have probably 5 others I need to redo. I did get this one working though. Thanks.


I'm a bit late to this party. Yea! the left paren issue has been resolved! Unfortunately there are other things commonly found in the strings which make up a web page which will cause problems and crashes so filtering will still be needed in other cases. For example quote marks, or LF and CR in cases in which the user wishes to take in the entire web page contents at once into a single variable.

If you look at the output file you will see that the search terms like name: and Absolute_magnitude_h: are now (without filtering) name : and absolute_magnitude_h : with a space before the colon which the filter program must have previously taken out. So you would need to change those search words, as well as the others, to include the space for it to work properly. That's why they are not being found currently.

EDIT: Ah, I see dbeard, you have found the problem while I was checking things out and creating this reply.


WBS00001, are there other things your program was clearing out that is going to cause a problem? With the update your Scrub program no longer works.

I have got the script working after fixing a lot of the parsing. :)


@dbeard In what way is the Scrub program not working?

The Scrub program does some pre-emptive character replacement in addition to the left paren removal. It also deletes double quotes and changes E to e. It gets rid of quotes because strings in the Scripts are bounded by them. So any in a string read from a web page will cause a crash or truncated string. This isn't a bug though, just a fact of life. Lower casing E prevents the problem when an uppercase E is followed directly by a number. The script interpreter will try to treat it as an E-notation value and crash in so doing. Also a backslash can sometimes cause problems so they are removed, as well as the semi-colon.

As your project currently stands it seems to be free from any of those problems, given the web page you are using.

I have a new version of the scrub program which is more versatile than what you are currently using should you be interested. It is used pretty much like you are using what you have now. Switching to it would not be too difficult. If you post your project to the cloud I could download it and modify it to use the new stuff, then upload it back.


Thanks WBS, the capital E has been resolved for the next release


WBS00001 - I will let you know once in the cloud.


OK, wbs00001, it is out there. if you want to talk via email, it is


String quotes are working as well for next release. Need a few more tests before i put it online

User-inserted image


@dbeard After giving it more thought, and the fact that DJ will soon be fixing the quote problem and the capital E problem, I'm thinking you will be better off without the Scrub program at all. Just do things like DJ last showed and let it go at that. Your only real problem was with the left paren anyway. At least with the script you posted. I see you have other scripts for other web pages as well. I haven't checked them out.

The only other thing the Scrub program can do for you is to limit the amount of text gotten from the web page, That can be helpful in that you can avoid having to wade through many lines of text to get to the text with the information you want. That can speed up response time. If there was an instruction to set the file pointer to a specific position in the file, that too could, effectively, be done in the script.

Of course, it can still provide substitutions, exchanging a character or group of characters to another character or group of characters. That includes non-printable characters. But I think now most, if not all, substitutions can be made in the script instead. Though there is still the problem of LF or CR characters crashing the script if loaded into a variable. That would happen if you decided to load the entire web page at once into a single variable using the FileReadAll script instruction. Reading it line by line avoids that, however.


Is that a feature request for the httpget command? Is limit the amount of data?


WBS0001 - Ok, thanks for all the help. DJ, can you make the data limit something the user can decide, like an added parameter to the command. Sometimes good to get all the data and sometimes not. Also, DJ, could you delete my app from the cloud. please.


@DJSures If it were a feature, it would be good if there were 2 parameters associated with the HTTPGet function. Right now the Scrub program can limit the data gathered from the web page in 2 ways. One is by supplying an integer specifying a start position from which to begin and another integer specifying how much to get starting at that point. If the amount to get is beyond the end of the web page text, it simply gets whatever text there is instead.

Alternatively, two strings can be specified. The Scrub program uses the first string as a search parameter to locate the start position that way, and the other to specify an end position to indicate where to end. Additionally, the search for the end position does not begin until after the point at which the start string was found in the page text (plus the length of the start string). However, if the start position string is not found, then the end string is ignored and the entire web page is returned instead. If the start string is found, but not the end string, then the web page text is gathered to the end of the web page. It also pops up error messages should either of those conditions occur. There is a way to suppress those error messages should the user desire not to have them appear.

The program allows for mixing the two methods as well. For example a string to find the start position and an integer to specify how much text to get from the point at which the start string was found.

So, basically, to do the same thing, the two parameters would have to be such that they can take either an integer or a string.


@dbeard You can delete the project yourself by going to it in the cloud. A delete option should appear for you since you posted it.


Thanks! I don't think the parameters are necessary after reading your description. The EZ-Script SubString will do the same thing

# Get 50 characters after the 10th character
$x = SubString(httpGet(&quot;;), 10, 50))

*Note: that wont' work for you currently until the next ARC is updated because of the quote issue with current ARC. But it works on mine :D


@DJSures Since we are on the topic of fixing errors in the script, I would like to take this opportunity to reiterate the fact that the script will also crash if the strings from a web page contain Line Feed or Carriage Return characters and the entire web page is loaded at once using FileReadAll. Correcting this would be helpful to folks who would like to work with the entire text from a web page at once from a single variable, as opposed to, reading the file line by line.

EDIT: Removed the request for optional parameters in FileRealAll since the SubString function will do that as well.

Thanks, once again, for all your efforts.


You bet! Line feed has also been resolved