slowness of IPS + https on T5220 class hardware

Discussion:

slowness of IPS + https on T5220 class hardware

Philip Brown

2013-01-18 16:13:42 UTC

In attempting to diagnose why Solaris 11 installs are ludicrously slow
through Oracle OpsCenter, I came across the following nastiness:
It appears that pkg operations over SSL are mindbogglingly slow for
T5220 class hardware.

I know that the CPUs on that generation of "Niagra" hardware are weak,
but they do have an SSL accelerator. In dealing with scp transfers
within our network, I compared an internally compiled scp, to the
solaris-supplied scp, and the solaris scp which uses the acceleration
support, was significantly faster.

So I'm wondering if there is an issue with python not using the
acceleration hardware for ssl, or something?

Comparison numbers:

From x4100 class hardware --
***@ovm-svr3:~# time pkgrepo list -s
https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
real 0m16.664s
user 0m13.317s
sys 0m0.512s
***@ovm-svr3:~# time pkgrepo list -s
https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
real 0m13.931s
user 0m12.588s
sys 0m0.433s

From T5220 hardware --
***@its-zones6:~# time pkgrepo list -s
https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
real 0m47.065s
user 0m45.310s
sys 0m1.316s
***@its-zones6:~# time pkgrepo list -s
https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
real 0m47.921s
user 0m45.327s
sys 0m1.277s

This may seem like a somewhat artificial test, but it matches up with
what matters: time to install solaris.
Doing an install of solaris over the https IPS url, took *3 hours* on
T5220, vs the expected 1hr or less.

Philip Brown

2013-01-18 16:19:04 UTC

On 1/18/13 8:13 AM, Philip Brown wrote:
>
>
> From x4100 class hardware --
> ***@ovm-svr3:~# time pkgrepo list -s
> https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
> real 0m16.664s
> user 0m13.317s
> sys 0m0.512s
> ...
>
> From T5220 hardware --
> ***@its-zones6:~# time pkgrepo list -s
> https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
> ...
> ***@its-zones6:~# time pkgrepo list -s
> https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
> real 0m47.921s
> user 0m45.327s
> sys 0m1.277s

Oops... I forgot. Before someone asks:

***@its-zones6:~# uname -v
11.1

Shawn Walker

2013-01-18 17:53:18 UTC

On 01/18/13 08:13, Philip Brown wrote:
> In attempting to diagnose why Solaris 11 installs are ludicrously slow
> through Oracle OpsCenter, I came across the following nastiness:
> It appears that pkg operations over SSL are mindbogglingly slow for
> T5220 class hardware.
>
> I know that the CPUs on that generation of "Niagra" hardware are weak,
> but they do have an SSL accelerator. In dealing with scp transfers
> within our network, I compared an internally compiled scp, to the
> solaris-supplied scp, and the solaris scp which uses the acceleration
> support, was significantly faster.
>
> So I'm wondering if there is an issue with python not using the
> acceleration hardware for ssl, or something?

pkg(5) uses pycurl for all of its transport needs which is just a
wrapper around libcurl.

libcurl uses libopenssl, which as far as I know, automatically uses the
crypto acceleration available.

I'm uncertain why Python would be implicated here.

If this is truly a crypto acceleration issue, then the same issue should
show up with curl.

Perhaps more data is required before analysis can be performed.

-Shawn

Philip Brown

2013-01-18 18:17:40 UTC

On 01/18/13 09:53 AM, Shawn Walker wrote:
>
> libcurl uses libopenssl, which as far as I know, automatically uses
> the crypto acceleration available.
>
> I'm uncertain why Python would be implicated here.
>
> If this is truly a crypto acceleration issue, then the same issue
> should show up with curl.
>
> Perhaps more data is required before analysis can be performed.

Fair enough. I'm happy to run some timing tests, if you can provide me with

1. a suggested large file target url for SSL speed testing
2. command line invocation options to more directly test
python/libcurl/whatever else needs to be compared

Shawn Walker

2013-01-18 18:34:34 UTC

On 01/18/13 10:17, Philip Brown wrote:
> On 01/18/13 09:53 AM, Shawn Walker wrote:
>>
>> libcurl uses libopenssl, which as far as I know, automatically uses
>> the crypto acceleration available.
>>
>> I'm uncertain why Python would be implicated here.
>>
>> If this is truly a crypto acceleration issue, then the same issue
>> should show up with curl.
>>
>> Perhaps more data is required before analysis can be performed.
>
>
> Fair enough. I'm happy to run some timing tests, if you can provide me with
>
>
> 1. a suggested large file target url for SSL speed testing

I'd suggest using the find utility to find the largest file in a
repository's publisher directory; they'll be named like this:
'00fe950a74bb641431343ea8d7848222bb402709'. And should be in a
directory like '$repo/publisher/solaris/file/xx/'.

You can then retrieve the file with something like:

https://$repo/file/0/$filename

> 2. command line invocation options to more directly test
> python/libcurl/whatever else needs to be compared

Use curl's standard options; there's nothing special.

However, I would also mention that in general, amd64-based systems will
have better single-threaded performance for pkg(5). Your earlier test
was not so much a test of transport performance as it is the client's
ability to parse and store json data.

A simple DTrace script that just tallied the amount of execution time
required for every function called during an execution of pkgrepo should
be enough to corroborate.

I suspect that if you did that, you'd see immediately that transport was
unlikely to be the source of the performance difference.

-Shawn

Philip Brown

2013-01-18 18:32:56 UTC

On 01/18/13 10:34 AM, Shawn Walker wrote:
>
> A simple DTrace script that just tallied the amount of execution time
> required for every function called during an execution of pkgrepo
> should be enough to corroborate.

please provide this simple script, and I'll run it on my prior example
of pkgrepo list

Shawn Walker

2013-01-18 18:56:58 UTC

On 01/18/13 10:32, Philip Brown wrote:
> On 01/18/13 10:34 AM, Shawn Walker wrote:
>>
>> A simple DTrace script that just tallied the amount of execution time
>> required for every function called during an execution of pkgrepo
>> should be enough to corroborate.
>
>
> please provide this simple script, and I'll run it on my prior example
> of pkgrepo list

Although it's not a "simple script" and perhaps overkill for this case,
there's one already written that could be used here:

http://www.dtracebook.com/index.php/Languages:py_calltime.d

Example usage:

dtrace -c '/usr/bin/pkg info pkg' -s py_calltime.d

Further analysis will be up to you.

-Shawn

Bart Smaalders

2013-01-18 22:06:18 UTC

On 01/18/13 08:13 AM, Philip Brown wrote:
>
>
>
> From x4100 class hardware --
> ***@ovm-svr3:~# time pkgrepo list -s
> https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
> real 0m16.664s
> user 0m13.317s
> sys 0m0.512s
> ***@ovm-svr3:~# time pkgrepo list -s
> https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
> real 0m13.931s
> user 0m12.588s
> sys 0m0.433s
>
> From T5220 hardware --
> ***@its-zones6:~# time pkgrepo list -s
> https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
> real 0m47.065s
> user 0m45.310s
> sys 0m1.316s
> ***@its-zones6:~# time pkgrepo list -s
> https://oracle-oem-oc-mgmt-sunspot:8002/IPS >/dev/null
> real 0m47.921s
> user 0m45.327s
> sys 0m1.277s

Note that this ratio of about 4:1 performance difference is much
better than the ratio of their specint values. Oracle doesn't
publish them, but you can get an idea on T1-T3 by dividing the
specintrate by the number of threads. Note that this DOESN'T
work on T4, since those machines will do out of order issue
and use all available chip pipelines for one thread, so
single thread performance is much better than you'd estimate
using this method. For the older chips, though, this seems to
match empirical observations.

SPECINT_rate2006 of T5220 is 83.2; that benchmark used 64 threads.
SPECINT_rate2006 of DL360G5 is 73; that benchmark used 4 thread.

e.g. about 14:1 difference. In cases where no downloads are
required, I've seen pkg a factor of 10 slower on T1/T2 than
fast Intel kit.

Since python performance is pretty proportional to SPECint (per
thread rating), these machines will always be significantly slower
than Intel hardware on single threaded, non-parallel workloads.
Note that you can use parallel zone update in U1 to put more of
those slow, but plentiful threads to work at the same time.

T4 performance is much better running pkg.

- Bart

--
Bart Smaalders Solaris Core OS
***@oracle.com http://blogs.oracle.com/barts
"You will contribute more with Mercurial than with Thunderbird."
"Civilization advances by extending the number of important
operations which we can perform without thinking about them."

Philip Brown

2013-01-18 22:34:52 UTC

On 01/18/13 02:06 PM, Bart Smaalders wrote:
> ...
> e.g. about 14:1 difference. In cases where no downloads are
> required, I've seen pkg a factor of 10 slower on T1/T2 than
> fast Intel kit.
...

Thanks for the numbers, Bart.

If that were the only limiting case... sounds like a design flaw to me
then.
Oracle is still officially supporting all(?) sun4v machines.
If IPS is really performaning *that* badly... sounds like it needs some
design tweaks to make it perform better on the full range of oracle
supported hardware.

That being said... The issues I'm seeing, cant only be due to that.
OpsCenter 3.1.1 driven installs, go ludicrously slowly, and usually time
out after 3 hours.

But, if I do a manually triggered, "installadm" based install, I can get
solaris 11 installed to an older sparc (a T2000, even), under 1 hour.
Actually, closer to half an hour

So seems like there's multiple issues at play here. One could be the
slower JSON parsing performance, perhaps. The other may be poor SSL
management on top of that.

Comparison of differences ---

* Direct installadm+net boot: no SSL, goes directly to port 80 standard
solaris 11 pkg services
Console output says that the combined stages of "figure out which
packages I need"
and "actually download the packages", takes around 15 minutes only.
Download itself only takes 10 minutes
20:17:00 Download: Completed 585.26 MB in 555.40 seconds (1.1M/s)
Then running the requisite 88,000 actions, takes another 15 minutes or so.
20:30:58 Actions: Completed 88134 actions in 776.30 seconds.

* OpsCenter: Over SSL, and goes to funky redirection port+URL
https://server:8002/IPS/
Additionally, opscenter seems to want to include every solaris11 package
ever released, in its pkg repo.
So whereas "pkgrepo list |wc -l" only returns around 4400 on the direct
case above,
it returns over 9000 for the opscenter case.

If "more packages in repo" degrades install time, even when the client
doesnt even need the extra packages... there's a problem in efficiency
of IPS mechanisms somewhere.

Shawn Walker

2013-01-18 23:04:03 UTC

On 01/18/13 14:34, Philip Brown wrote:
...
> If "more packages in repo" degrades install time, even when the client
> doesnt even need the extra packages... there's a problem in efficiency
> of IPS mechanisms somewhere.

More packages in the repository will cause a small increase in planning
time because there are more possible versions to evaluate, I don't see
how that's avoidable. With that said, only for repositories with
70,000+ unique package versions has that difference started to be
noticeable in the past and only at 100,000+ was it "significant".

On my (several years old) x86 system, it takes less than 30 seconds to
plan an update operation when configured with a combination of
repositories with 120,000+ unique package versions.

-Shawn

Philip Brown

2013-01-18 23:56:06 UTC

On 1/18/13 3:04 PM, Shawn Walker wrote:
> On 01/18/13 14:34, Philip Brown wrote:
> ...
>> If "more packages in repo" degrades install time, even when the client
>> doesnt even need the extra packages... there's a problem in efficiency
>> of IPS mechanisms somewhere.
>
> More packages in the repository will cause a small increase in
> planning time because there are more possible versions to evaluate, I
> don't see how that's avoidable. With that said, only for repositories
> with 70,000+ unique package versions has that difference started to be
> noticeable in the past and only at 100,000+ was it "significant".

huh. well thats good news I guess. But then that seems to leaves the
culprit to be
a) SSL transport+t5220
and/or
b) the front end proxying

And given that faster systems dont seem to have problems going through
the same proxy, it wouldnt seem to be the proxy that's the problem.

PS: I looked at that dtrace script and tried it out, but it wasnt
useful. It only reported use of "stat.py" or something like that. no
other python function.
So, I'm lost there.

Bart Smaalders

2013-01-19 01:54:32 UTC

On 01/18/13 02:34 PM, Philip Brown wrote:
> On 01/18/13 02:06 PM, Bart Smaalders wrote:
>> ...
>> e.g. about 14:1 difference. In cases where no downloads are
>> required, I've seen pkg a factor of 10 slower on T1/T2 than
>> fast Intel kit.
> ...
>
> Thanks for the numbers, Bart.
>
> If that were the only limiting case... sounds like a design flaw to me
> then.
> Oracle is still officially supporting all(?) sun4v machines.
> If IPS is really performaning *that* badly... sounds like it needs some
> design tweaks to make it perform better on the full range of oracle
> supported hardware.
>
>
> That being said... The issues I'm seeing, cant only be due to that.
> OpsCenter 3.1.1 driven installs, go ludicrously slowly, and usually time
> out after 3 hours.
>
> But, if I do a manually triggered, "installadm" based install, I can get
> solaris 11 installed to an older sparc (a T2000, even), under 1 hour.
> Actually, closer to half an hour
>
> So seems like there's multiple issues at play here. One could be the
> slower JSON parsing performance, perhaps. The other may be poor SSL
> management on top of that.
>
> Comparison of differences ---
>
> * Direct installadm+net boot: no SSL, goes directly to port 80 standard
> solaris 11 pkg services
> Console output says that the combined stages of "figure out which
> packages I need"
> and "actually download the packages", takes around 15 minutes only.
> Download itself only takes 10 minutes
> 20:17:00 Download: Completed 585.26 MB in 555.40 seconds (1.1M/s)
> Then running the requisite 88,000 actions, takes another 15 minutes or so.
> 20:30:58 Actions: Completed 88134 actions in 776.30 seconds.
>
> * OpsCenter: Over SSL, and goes to funky redirection port+URL
> https://server:8002/IPS/
> Additionally, opscenter seems to want to include every solaris11 package
> ever released, in its pkg repo.
> So whereas "pkgrepo list |wc -l" only returns around 4400 on the direct
> case above,
> it returns over 9000 for the opscenter case.
>
> If "more packages in repo" degrades install time, even when the client
> doesnt even need the extra packages... there's a problem in efficiency
> of IPS mechanisms somewhere.
>
>
>

There's a slight difference, but nothing substantial. Ops Center must
be doing something silly.

Is the Intel kit slow under ops center as well, or just the SPARC?

> But, if I do a manually triggered, "installadm" based install, I can get
> solaris 11 installed to an older sparc (a T2000, even), under 1 hour.
> Actually, closer to half an hour

That's what I expect.

- Bart

--
Bart Smaalders Solaris Core OS
***@oracle.com http://blogs.oracle.com/barts
"You will contribute more with Mercurial than with Thunderbird."
"Civilization advances by extending the number of important
operations which we can perform without thinking about them."

Philip Brown

2013-01-23 17:46:25 UTC

On 01/18/13 05:54 PM, Bart Smaalders wrote:
> On 01/18/13 02:34 PM, Philip Brown wrote:
>> On 01/18/13 02:06 PM, Bart Smaalders wrote:
>>> ...
>>> e.g. about 14:1 difference. In cases where no downloads are
>>> required, I've seen pkg a factor of 10 slower on T1/T2 than
>>> fast Intel kit.
>> ...
>>
>> Thanks for the numbers, Bart.
>>
>> If that were the only limiting case... sounds like a design flaw to me
>> then.
>> Oracle is still officially supporting all(?) sun4v machines.
>> If IPS is really performaning *that* badly... sounds like it needs some
>> design tweaks to make it perform better on the full range of oracle
>> supported hardware.
>>
> There's a slight difference, but nothing substantial. Ops Center must
> be doing something silly.
>
> Is the Intel kit slow under ops center as well, or just the SPARC?
>
>

Unfortunately, I had some issues getting our x86 machines to use
opscenter this week,even though they were working previously.

But the good news is, i finally managed to get an "apples to apples"
comparison.
Previously, I was comparing (opscenter repo +ssl) vs (manual oracle.com
mirror, no ssl)
Reminder: opscenter repo has over 9,000 packages, manual repo has only
about 4,400

Now I find out that the https://opscenter:8002/IPS is a redirect, to
http://opscenter:11000, so I can do speed tests to the same repo, with
and without the opscenter SSL proxy

from an x4100, accessing either with "pkgrepo list", takes about the
same amount of time. Around 11 seconds.

from the t5220: about the same amount of time: around 45 seconds.

***@its-zones6:~# time pkgrepo list -s
https://oracle-oem-oc-mgmt-eridu:8002/IPS |wc -l
9828

real 0m47.224s
user 0m45.507s
sys 0m1.339s

***@its-zones6:~# time pkgrepo list -s
http://oracle-oem-oc-mgmt-eridu:11000 |wc -l
9828

real 0m45.345s
user 0m43.920s
sys 0m1.236s

So, seems like my new premise quoted at the top of this email is in effect:
"design flaw: IPS repo/access tools need to be tweaked, so that they
perform to a usable degree on older sun4v class machines"

As a comparison, here's the same machine doing activity to the smaller,
manually created repo:
***@its-zones6:~# time pkgrepo list -s http://sunspot |wc -l 4784

real 0m19.631s
user 0m18.434s
sys 0m0.975s

The "manually created repo", was created by downloading a repo image
from oracle.
***@its-zones6:~# time pkgrepo list -s http://sunspot entire
solaris entire 0.5.11,5.11-0.175.1.0.0.24.2:20120919T190135Z
solaris entire 0.5.11,5.11-0.175.0.10.1.0.0:20120918T160900Z

In contrast, the opscenter auto-created repo, has just about every
package ever released.
pkgrepo list entire, gives everything between
solaris entire 0.5.11,5.11-0.175.1.1.0.4.0:20121106T001344Z
...
solaris entire 0.5.11,5.11-0.151.0.1:20101105T054056Z

This is even more than what is available on
http://pkg.oracle.com/solaris/release

Philip Brown

2013-02-26 19:34:21 UTC

On 01/23/13 09:46 AM, Philip Brown wrote:
> On 01/18/13 05:54 PM, Bart Smaalders wrote:
>> There's a slight difference, but nothing substantial. Ops Center must
>> be doing something silly.
>>
>> Is the Intel kit slow under ops center as well, or just the SPARC?
>>
>>
>
> Unfortunately, I had some issues getting our x86 machines to use
> opscenter this week,even though they were working previously.
>

I've finally been able to get back to investigating the problems we've
been having.
Turns out, it's not SSL... because opscenter doesnt use SSL for the
actual package transfers.
The HUGE slowness problem we saw, seems to be a combination of:
1. something weird opscenter does
2. something weird IPS does
3. a misconfiguration that leaked in from somewhere.

The good news is, I can now positively identify ALL of the above. So I'm
posting a summary of findings to the list.

Background: opscenter is a distributed control system,with a "master"
controller, and assorted proxies, to distribute load. Solaris 11
installation is supposed to be handed off to a "proxy controller".

It turns out that the proxy controller, for purposes of IPS installs, is
a literal apache proxy. With a confusing multi-level httpd
configuration. It's supposed to be a caching proxy, so I think it serves
out the packages from cache, after initial load.

I explored the opscenter http configs, and found this shocking comment:

# The pkg client opens 20 parallel connections to the server when performing
# network operations.

This turned out to be the key. The MaxClients knob had gotten set too
low, and it was starved for working connections.

Okay, this fixes my immediate problem. but what does that say about
IPS? Seems like there are multiple problems there.

First of all: It shouldn't degrade into glacial speed, when it can't
open 20 full connections!!!

Secondly... why is it being so obnoxious about so many connections? I
decided to put it to the test.

I created 20x 100mb files, and downloaded them, first with "wget file1
file2 file3..." and then
"wget file1&; wget file2&"
When dumping the data to /dev/null, I was surprised to find that 20 in
parallel was actually faster. About 30 seconds vs 34 seconds, usually

However: IPS doesnt dump downloaded packages to /dev/null. So time for
some more realistic tests!

When I set my tests to save the files to a ZFS filesystem (with
atime=off) I found that the transfer times were much more variable, and
generally speaking, there was no significant difference between the two
methods. They all took around 1 minute.
1:30 on slower(disk) hardware. Results tended to be within 1 second of
each other.

So, I would suggest that IPS be fixed to use fewer connections. It's
currently being obnoxious to any http front end, and for no significant
benefit.
At the very minimum, it needs to handle connection starvation better.

Shawn Walker

2013-02-26 19:46:18 UTC

On 02/26/13 11:34, Philip Brown wrote:
> On 01/23/13 09:46 AM, Philip Brown wrote:
>> On 01/18/13 05:54 PM, Bart Smaalders wrote:
>>> There's a slight difference, but nothing substantial. Ops Center must
>>> be doing something silly.
>>>
>>> Is the Intel kit slow under ops center as well, or just the SPARC?
>>>
>>>
>>
>> Unfortunately, I had some issues getting our x86 machines to use
>> opscenter this week,even though they were working previously.
>>
>
> I've finally been able to get back to investigating the problems we've
> been having.
> Turns out, it's not SSL... because opscenter doesnt use SSL for the
> actual package transfers.
> The HUGE slowness problem we saw, seems to be a combination of:
> 1. something weird opscenter does
> 2. something weird IPS does
> 3. a misconfiguration that leaked in from somewhere.
>
> The good news is, I can now positively identify ALL of the above. So I'm
> posting a summary of findings to the list.
>
> Background: opscenter is a distributed control system,with a "master"
> controller, and assorted proxies, to distribute load. Solaris 11
> installation is supposed to be handed off to a "proxy controller".
>
> It turns out that the proxy controller, for purposes of IPS installs, is
> a literal apache proxy. With a confusing multi-level httpd
> configuration. It's supposed to be a caching proxy, so I think it serves
> out the packages from cache, after initial load.
>
> I explored the opscenter http configs, and found this shocking comment:
>
> # The pkg client opens 20 parallel connections to the server when
> performing
> # network operations.
>
> This turned out to be the key. The MaxClients knob had gotten set too
> low, and it was starved for working connections.
>
> Okay, this fixes my immediate problem. but what does that say about IPS?
> Seems like there are multiple problems there.
>
> First of all: It shouldn't degrade into glacial speed, when it can't
> open 20 full connections!!!

If a web server can't spare at least 20 connections at any point,
there's definitely a configuration issue.

> Secondly... why is it being so obnoxious about so many connections? I
> decided to put it to the test.

20 is hardly "so many connections"; have you seen how many connections a
torrent client opens?

> I created 20x 100mb files, and downloaded them, first with "wget file1
> file2 file3..." and then
> "wget file1&; wget file2&"
> When dumping the data to /dev/null, I was surprised to find that 20 in
> parallel was actually faster. About 30 seconds vs 34 seconds, usually

Yes, which is part of why IPS uses parallel connections. We'd rather
use HTTP pipelining, but most HTTP servers implement that poorly, and
some proxies are completely busted when it comes to it.

> However: IPS doesnt dump downloaded packages to /dev/null. So time for
> some more realistic tests!
>
> When I set my tests to save the files to a ZFS filesystem (with
> atime=off) I found that the transfer times were much more variable, and
> generally speaking, there was no significant difference between the two
> methods. They all took around 1 minute.
> 1:30 on slower(disk) hardware. Results tended to be within 1 second of
> each other.
>
> So, I would suggest that IPS be fixed to use fewer connections. It's
> currently being obnoxious to any http front end, and for no significant
> benefit.
> At the very minimum, it needs to handle connection starvation better.

There's no "fix" here; the value of 20 was chosen after careful
evaluation for determining the optimal number of parallel connections.
In a properly configured environment, it currently provides the best
performance.

The current transport system has advantages and disadvantages, and
alternatives will be investigated at some point in the future, but for
now proper configuration is important for maximal performance.

HTTP in general is not an optimal transport for large amounts of file data.

-Shawn

Philip Brown

2013-02-26 20:32:23 UTC

On 02/26/13 11:46 AM, Shawn Walker wrote:
> On 02/26/13 11:34, Philip Brown wrote:
>> So, I would suggest that IPS be fixed to use fewer connections. It's
>> currently being obnoxious to any http front end, and for no significant
>> benefit.
>> At the very minimum, it needs to handle connection starvation better.
>
> There's no "fix" here; the value of 20 was chosen after careful
> evaluation for determining the optimal number of parallel connections.
> In a properly configured environment, it currently provides the best
> performance.

yeah... the trouble with that is, many environments are not "properly
configured". Solaris still needs to perform well in non-optimal conditions.

>
> The current transport system has advantages and disadvantages, and
> alternatives will be investigated at some point in the future, but for
> now proper configuration is important for maximal performance.

It could be enlightening if you put the evaluation data results up for
public scrutiny.
As shown in my email, when our environment was "properly configured",
there was no performance difference in my tests.

Shawn Walker

2013-02-26 20:39:49 UTC

On 02/26/13 12:32, Philip Brown wrote:
> On 02/26/13 11:46 AM, Shawn Walker wrote:
>> On 02/26/13 11:34, Philip Brown wrote:
>>> So, I would suggest that IPS be fixed to use fewer connections. It's
>>> currently being obnoxious to any http front end, and for no significant
>>> benefit.
>>> At the very minimum, it needs to handle connection starvation better.
>>
>> There's no "fix" here; the value of 20 was chosen after careful
>> evaluation for determining the optimal number of parallel connections.
>> In a properly configured environment, it currently provides the best
>> performance.
>
>
> yeah... the trouble with that is, many environments are not "properly
> configured". Solaris still needs to perform well in non-optimal conditions.

Again, HTTP is not the optimal transport option. The best transport
option remains local file repositories or file repositories accessed via
NFS.

>>
>> The current transport system has advantages and disadvantages, and
>> alternatives will be investigated at some point in the future, but for
>> now proper configuration is important for maximal performance.
>
> It could be enlightening if you put the evaluation data results up for
> public scrutiny.
> As shown in my email, when our environment was "properly configured",
> there was no performance difference in my tests.

Your own experiments indicated it was a few seconds faster. A few years
ago the performance advantage was determined to be ~20% for the typical
install/update case -- which is a much larger test case.

-Shawn

Philip Brown

2013-02-26 22:55:31 UTC

On 02/26/13 12:39 PM, Shawn Walker wrote:
>
> Again, HTTP is not the optimal transport option. The best transport
> option remains local file repositories or file repositories accessed
> via NFS.

You might wanna share that with the OpsCenter group inside of Oracle, then.

They offer no option to configure a client that way (as far as I can see).
They only do http.

Speaking of which, though... how does IPS do anything different for
"local file repositories", since regular pkg servers are also specified
through "Http" type urls?

Or by "local", do you mean "file:///, but local disk instead of NFS" in
this case ?

Shawn Walker

2013-02-26 23:48:27 UTC

On 02/26/13 14:55, Philip Brown wrote:
> On 02/26/13 12:39 PM, Shawn Walker wrote:
>>
>> Again, HTTP is not the optimal transport option. The best transport
>> option remains local file repositories or file repositories accessed
>> via NFS.
>
>
> You might wanna share that with the OpsCenter group inside of Oracle, then.
>
> They offer no option to configure a client that way (as far as I can see).
> They only do http.

Well, it's difficult if you want to proxy access to multiple sources
that could be a mix of both file and HTTP(S) repositories.

But I'd provide that feedback to them as they should be aware of the
capabilities of pkg(5).

> Speaking of which, though... how does IPS do anything different for
> "local file repositories", since regular pkg servers are also specified
> through "Http" type urls?

Repositories accessed via 'file:///' scale better because NFS and local
disk access are all in-kernel. But most importantly, in those cases,
the client won't (and doesn't need to) create a local copy of retrieved
data in /var/pkg/publisher/$name/file so there's significantly less
read/write I/O.

-Shawn

Philip Brown

2013-02-27 00:32:45 UTC

On 2/26/13 3:48 PM, Shawn Walker wrote:
>
>> Speaking of which, though... how does IPS do anything different for
>> "local file repositories", since regular pkg servers are also specified
>> through "Http" type urls?
>
> Repositories accessed via 'file:///' scale better because NFS and
> local disk access are all in-kernel. But most importantly, in those
> cases, the client won't (and doesn't need to) create a local copy of
> retrieved data in /var/pkg/publisher/$name/file so there's
> significantly less read/write I/O.
>

So, I think you are saying, that connecting to a
svc:/application/pkg/server:default

is no longer the best way to go?

It also sort of begs a followup question of, if this isnt the best way
any more, then how about oracle starts offering an "anonymous nfs" type
url for pkg.oracle.com ?

Martin Bochnig

2013-02-27 00:55:44 UTC

On Wed, Feb 27, 2013 at 12:32 AM, Philip Brown <***@usc.edu> wrote:
> It also sort of begs a followup question of, if this isnt the best way any
> more, then how about oracle starts offering an "anonymous nfs" type url for
> pkg.oracle.com ?

Phil: Run your next performance comparison against OpenSXCE with SVR4
and OpenCSW's pkgutil fetching

http://svr4.opensxce.org/sparc/5.11/

(Even though I still have cycles in that repo, it should still be
another dimension compared to IPS ... )

--
regards

%martin bochnig
http://svr4.opensxce.org/sparc/5.11/
http://opensxce.org/
http://wiki.openindiana.org/oi/MartUX_OpenIndiana+oi_151a+SPARC+LiveDVD
http://www.youtube.com/user/MartUXopensolaris
https://twitter.com/MartinBochnig

Shawn Walker

2013-02-27 18:48:00 UTC

On 02/26/13 16:55, Martin Bochnig wrote:
> On Wed, Feb 27, 2013 at 12:32 AM, Philip Brown<***@usc.edu> wrote:
>> It also sort of begs a followup question of, if this isnt the best way any
>> more, then how about oracle starts offering an "anonymous nfs" type url for
>> pkg.oracle.com ?
>
>
> Phil: Run your next performance comparison against OpenSXCE with SVR4
> and OpenCSW's pkgutil fetching
>
> http://svr4.opensxce.org/sparc/5.11/
>
> (Even though I still have cycles in that repo, it should still be
> another dimension compared to IPS ... )

pkg(5) optimises for update performance over initial install
performance. I have no doubt that the retrieval of single archive
package files is superior for initial installs.

-Shawn

Philip Brown

2013-02-27 19:35:53 UTC

From: Shawn Walker [***@oracle.com]
>pkg(5) optimises for update performance over initial install
>performance. I have no doubt that the retrieval of single archive
>package files is superior for initial installs.

Personally, I count "installing additional packages, after initial OS install", as "updates" also.

Lately, especially with the new so-called "fast lookup database", seems like pkg's update performance isnt so great in that area either.
after running "pkg install image/gnuplot" on an x4100 with S11.1, Here are some timing results:

time pkg uninstall image/gnuplot
real 0m20.940s

And then reinstalling JUST that package, with no dependencies needed any more:
time pkg install image/gnuplot
real 0m20:340s

In contrast, doing an install of gnuplot on "Oracle Linux"(aka redhat) on an ultra 20:
# time yum install gnuplot
real 0m16.513s
# time yum remove gnuplot
real 0m5.707s

In the real world, it seems like IPS's approach of "deal with individual files instead of full package tarballs", is a lot like HTTP pipelining. A great idea in theory, but providing little to no benefit most of the time, and sometimes making things even slower. very noticably slower.

"well, the environment isnt tuned right" isnt an acceptable reply to this, btw.
A robust OS should not require OracleDB level tuning skills to run well. It should tune itself to gracefully handle non-optimal deployments. That's sort of the definition of robust.
I didnt have to do any "tuning" to the redhat/oraclelinux box.

Even in your original case, where you may have been arguing about an existing package, getting a revision update:
If packages are suitably granular, then even for pathelogical cases of "1 file in 100 has been changed", the speed difference between the two approaches isnt going to be bad, with high speed internet

However, for the *common* case, where lots of files out of 100 have been touched... downloading a single packaged update would seem to be pretty efficient in theory. Opening a single connection, and then getting max throughput out of it, instead of having to use 20 connections to attempt to get over the latency problems of requesting 50 separate files, merely to update *one* package.

You guys seem to be happy with IPS's performance in some kind of synthetic tests you are running. Unfortunately, when it comes to real world usage, the actual performance is lacking.
Are you going to stick to your "tests", or are you going to do something about it?

Shawn Walker

2013-02-27 19:46:25 UTC

On 02/27/13 11:35, Philip Brown wrote:
...
> You guys seem to be happy with IPS's performance in some kind of
> synthetic tests you are running. Unfortunately, when it comes to real
> world usage, the actual performance is lacking. Are you going to
> stick to your "tests", or are you going to do something about it?

Performance is constantly evaluated to determine how it can be improved
while attempting to balance the tradeoffs involved against design goals
and supportability.

I will have to agree to disagree with the "real world" performance
conclusions reached here.

-Shawn

Bart Smaalders

2013-02-28 01:24:33 UTC

On 02/27/13 11:35, Philip Brown wrote:
>
> From: Shawn Walker [***@oracle.com]
>> pkg(5) optimises for update performance over initial install
>> performance. I have no doubt that the retrieval of single archive
>> package files is superior for initial installs.
>
> Personally, I count "installing additional packages, after initial OS install", as "updates" also.
>

The most frequent packaging operation performed on customer machines
is updating to the next SRU. Very few machines get packages installed
and uninstalled frequently.

SRUs typically affect a small fraction of the files in a package, if
that package is affected at all. As a result, updating only the
changed files rather than downloading every file in affected packages is
a big win.

Once a machine is provisioned and in production, package installations
are rare - and uninstalls are very rare indeed. I can see this when
I look at the package history of my own desktop machine, which dates
back to 2/18/2009 and stretches across multiple OpenSolaris and Oracle
Solaris releases; customers report the same thing.

We believe - and many of our customers confirm - that IPS
delivers an huge reduction in cost of software maintenance, and a
concomitant reduction in machine downtime. Compare for a moment
downloading patches by hand, manually resolving packaging and patching
dependencies, reviewing patch readme files, taking machines out
of production to upgrade to next OS release - all the steps
required to follow best practices on S10 patching, with what is
necessary on S11:

# pkg update

We made the best practice the default practice in S11 - and it's now
really easy for customers to stay up-to-date with respect to patches.

We continue to look for performance wins in IPS. However, we feel that
improving the available automation and reducing the amount of human
effort required to manage a Solaris instance is of far greater import
to our customers than optimizing for a single, seldom performed task
that happens to be easily micro-benchmarked.

- Bart

--
Bart Smaalders Solaris Core OS
***@oracle.com http://blogs.oracle.com/barts
"You will contribute more with Mercurial than with Thunderbird."
"Civilization advances by extending the number of important
operations which we can perform without thinking about them."

Philip Brown

2013-02-28 03:13:45 UTC

On 2/27/13 5:24 PM, Bart Smaalders wrote:
>
> We believe - and many of our customers confirm - that IPS
> delivers an huge reduction in cost of software maintenance, and a
> concomitant reduction in machine downtime. [etc., etc]

Certainly. but that's because of the customer-facing workflow. It has
nothing to do with the specific back end implementation.
To take a silly hypothetical: If you suddenly replaced the entire
download/pkgadd/pkg delete mechanism with dpkg/apt-get behind the
scenes, it wouldnt matter to the customers, so long as the front end
remained the same and performed approximately the same.
However.. moving on...

> We continue to look for performance wins in IPS. However, we feel that
> improving the available automation and reducing the amount of human
> effort required to manage a Solaris instance is of far greater import
> to our customers than optimizing for a single, seldom performed task
> that happens to be easily micro-benchmarked.

A fair conclusion in and of itself. However, in addition to pure speed,
the current implementation can lead to difficulties in debugging
performance problems with the package flow. As I've just had to deal
with for the last few weeks.

So my point is, if there arent truely significant gains, simpler is better.
And/or... it could be beneficial to customers if you would provide an
option (config or commandline, doesnt matter) to force single connection
downloads when desired.

Bart Smaalders

2013-02-28 17:21:24 UTC

On 02/27/13 19:13, Philip Brown wrote:
>
> So my point is, if there arent truely significant gains, simpler is better.
> And/or... it could be beneficial to customers if you would provide an
> option (config or commandline, doesnt matter) to force single connection
> downloads when desired.

The effect of single connections is very dependent on the average size
of files being retrieved. In particular, packages with small files need
lots of connections to overcome the inherent latency of setting up file
transfers.

- Bart

--
Bart Smaalders Solaris Core OS
***@oracle.com http://blogs.oracle.com/barts
"You will contribute more with Mercurial than with Thunderbird."
"Civilization advances by extending the number of important
operations which we can perform without thinking about them."

Philip Brown

2013-03-06 17:31:28 UTC

On 2/27/13 5:24 PM, Bart Smaalders wrote:
>
> The most frequent packaging operation performed on customer machines
> is updating to the next SRU. Very few machines get packages installed
> and uninstalled frequently.
>
> SRUs typically affect a small fraction of the files in a package, if
> that package is affected at all.

If the customer is patching every month or something.
But not all customers do that.
As a counter example, we don't patch our machines sometimes for as long as
(*redacted for security reasons*:)

Have you compared in those cases?

For that matter, have you bothered doing an actual survey of what
customers WANT, as opposed to making your own conclusions based on what
you believe customers do most frequently?

I, for example, have spent WEEKS of time attempting to debug and
fine-tune OS installs in the last few months.
A faster install process, could have saved me literally days of effort.
Guess which case matters the most to *this* customer?

Updates are supposed to be done transparently in the background anyway.
So it really doesnt *matter*, whether updates are fast or slow.

In contrast. OS installs are a blocking factor. Therefore, it actually
does matter how fast they happen, because you cannot do anything with
the hardware until the installation process has finished.

So you are spending a lot of effort optimizing in a way that doesn't
benefit the customer, and you are not optimizing in the case that would
give direct benefit to the customer.

Bart Smaalders

2013-03-06 19:03:51 UTC

On 03/06/13 09:31, Philip Brown wrote:

>
> If the customer is patching every month or something.
> But not all customers do that.
> As a counter example, we don't patch our machines sometimes for as long as
> (*redacted for security reasons*:)

That's fine.... even if you patch only every six months, using IPS is
still far faster than fiddling with LU or dropping to single user mode,
splitting mirrors, running patchadd and the like.

>
> In contrast. OS installs are a blocking factor. Therefore, it actually
> does matter how fast they happen, because you cannot do anything with
> the hardware until the installation process has finished.

I think you'll find that initial OS installs are significantly
faster w/ IPS & AI than with jumpstart.

About the only thing that's significantly slower is single package
install and removal, and if you add the time needed for a human to
compute the transitive closure of dependencies, IPS wins there too.

Look, we do focus on performance... but optimizing for single package
addition and removal on EOL hardware known to be slow for any
single-threaded operation is not likely to happen. Pkg install
on a T4 or x86 box is perfectly acceptable. There are other areas
we need to tune, such as change-facet and the like.
>
>
> I, for example, have spent WEEKS of time attempting to debug and
> fine-tune OS installs in the last few months. A faster install
> process, could have saved me literally days of effort. Guess which
> case matters the most to *this* customer?

Why not take a system, start with a basic set of packages, and
and remove stuff, and then write an AI manifest to duplicate that
install? Simple, easy and reasonably fast.

If you have lots of systems installing at the same time off the
same local repository, be sure to configure Apache front ends, or
use NFS repositories for maximum performance.

- Bart

--
Bart Smaalders Solaris Core OS
***@oracle.com http://blogs.oracle.com/barts
"You will contribute more with Mercurial than with Thunderbird."
"Civilization advances by extending the number of important
operations which we can perform without thinking about them."

Philip Brown

2013-03-06 22:29:06 UTC

On 3/6/13 11:03 AM, Bart Smaalders wrote:
>
> Look, we do focus on performance... but optimizing for single package
> addition and removal on EOL hardware known to be slow for any
> single-threaded operation is not likely to happen. Pkg install
> on a T4 or x86 box is perfectly acceptable.

When was the last time you compared performance of installs of, lets
say, oracle linux, to solaris, on the same x86 platform?
(kickstart vs AI install?)

As I previously mentioned, that is something that large-site admins care
about, when tuning new platforms or target configuration templates.

if you havent run a comparison in the last year, I suggest that you run
one again and let folks know what results you see.

Shawn Walker

2013-02-27 18:47:24 UTC

On 02/26/13 16:32, Philip Brown wrote:
> On 2/26/13 3:48 PM, Shawn Walker wrote:
>>
>>> Speaking of which, though... how does IPS do anything different for
>>> "local file repositories", since regular pkg servers are also specified
>>> through "Http" type urls?
>>
>> Repositories accessed via 'file:///' scale better because NFS and
>> local disk access are all in-kernel. But most importantly, in those
>> cases, the client won't (and doesn't need to) create a local copy of
>> retrieved data in /var/pkg/publisher/$name/file so there's
>> significantly less read/write I/O.
>>
>
> So, I think you are saying, that connecting to a
> svc:/application/pkg/server:default
>
> is no longer the best way to go?
>
> It also sort of begs a followup question of, if this isnt the best way
> any more, then how about oracle starts offering an "anonymous nfs" type
> url for pkg.oracle.com ?

The context of the original discussion was around local package
repositories, so my responses were based on that.

Long-haul transport scenarios such as pkg.oracle.com are clearly not
suitable for general NFS usage.

-Shawn

30 Replies
41 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Philip Brown 2013-01-18 16:13:42 UTC

Philip Brown 2013-01-18 16:19:04 UTC

Shawn Walker 2013-01-18 17:53:18 UTC

Philip Brown 2013-01-18 18:17:40 UTC

Shawn Walker 2013-01-18 18:34:34 UTC

Philip Brown 2013-01-18 18:32:56 UTC

Shawn Walker 2013-01-18 18:56:58 UTC

Bart Smaalders 2013-01-18 22:06:18 UTC

Philip Brown 2013-01-18 22:34:52 UTC

Shawn Walker 2013-01-18 23:04:03 UTC

Philip Brown 2013-01-18 23:56:06 UTC

Bart Smaalders 2013-01-19 01:54:32 UTC

Philip Brown 2013-01-23 17:46:25 UTC

Philip Brown 2013-02-26 19:34:21 UTC

Shawn Walker 2013-02-26 19:46:18 UTC

Philip Brown 2013-02-26 20:32:23 UTC

Shawn Walker 2013-02-26 20:39:49 UTC

Philip Brown 2013-02-26 22:55:31 UTC

Shawn Walker 2013-02-26 23:48:27 UTC

Philip Brown 2013-02-27 00:32:45 UTC

Martin Bochnig 2013-02-27 00:55:44 UTC

Shawn Walker 2013-02-27 18:48:00 UTC

Philip Brown 2013-02-27 19:35:53 UTC

Shawn Walker 2013-02-27 19:46:25 UTC

Bart Smaalders 2013-02-28 01:24:33 UTC

Philip Brown 2013-02-28 03:13:45 UTC

Bart Smaalders 2013-02-28 17:21:24 UTC

Philip Brown 2013-03-06 17:31:28 UTC

Bart Smaalders 2013-03-06 19:03:51 UTC

Philip Brown 2013-03-06 22:29:06 UTC

Shawn Walker 2013-02-27 18:47:24 UTC

about - legalese

Loading...